Discussion:
[zeromq-dev] IPC (again)
Erik Rigtorp
2009-12-30 22:34:48 UTC
Permalink
Hi!

I've read the two discussions on using ZeroMQ for IPC. I think ZeroMQ
should support IPC and in-process communication.

TCP is nice to work with but it has one problem: On linux (and others)
TCP over loopback doesn't bypass the TCP stack which makes the latency
several times higher than using pipes or unix domain sockets. I know
that on Solaris this is optimized so that a loopback TCP connection
becomes more or less a pipe. For low latency IPC on Linux ZeroMQ needs
pipes or unix domain sockets.

For ultra low latency IPC there is only one way to go and that is to
use shared memory. I took a look at yqueue.hpp in zeromq2 and it's a
good start. We only need to add a lock free memory allocator (which
can be implemented using a lock free queue) or implement a lock free
ringbuffer that would hold a fixed number of messages and block the
writer when it's full. For signaling I suggest to implement two
different approaches. One using pthreads conditions and one using busy
waiting.
Jon Dyte
2009-12-31 11:33:45 UTC
Permalink
Post by Erik Rigtorp
Hi!
I've read the two discussions on using ZeroMQ for IPC. I think ZeroMQ
should support IPC and in-process communication.
I think we all agree on this.
Post by Erik Rigtorp
TCP is nice to work with but it has one problem: On linux (and others)
TCP over loopback doesn't bypass the TCP stack which makes the latency
several times higher than using pipes or unix domain sockets. I know
that on Solaris this is optimized so that a loopback TCP connection
is that since a particular solaris release 8,9,10?
I havent got my solaris internals book to hand right now ;-)
Post by Erik Rigtorp
becomes more or less a pipe. For low latency IPC on Linux ZeroMQ needs
pipes or unix domain sockets.
just before xmas I exchanged an email with Martin about providing a fifo/pipe
interface. (I wasnt concerned about performance, but wanted a zmq socket
connection that could only be accessed via the same machine and not via
loopback.) Subsequently I think that providing AF_LOCAL (AF_UNIX) sockets
would be a good idea.
Post by Erik Rigtorp
For ultra low latency IPC there is only one way to go and that is to
use shared memory. I took a look at yqueue.hpp in zeromq2 and it's a
good start. We only need to add a lock free memory allocator (which
I'm glad some one else has looked at this because a while back I wondered
whether the yqueue.hpp could use shared memory.
Post by Erik Rigtorp
can be implemented using a lock free queue) or implement a lock free
ypipe.hpp for example?
Post by Erik Rigtorp
ringbuffer that would hold a fixed number of messages and block the
writer when it's full. For signaling I suggest to implement two
different approaches. One using pthreads conditions and one using busy
waiting.
Martin Sustrik
2010-01-04 08:43:28 UTC
Permalink
Hi Erik, John,
Post by Jon Dyte
Post by Erik Rigtorp
I've read the two discussions on using ZeroMQ for IPC. I think ZeroMQ
should support IPC and in-process communication.
I think we all agree on this.
Post by Erik Rigtorp
TCP is nice to work with but it has one problem: On linux (and others)
TCP over loopback doesn't bypass the TCP stack which makes the latency
several times higher than using pipes or unix domain sockets. I know
that on Solaris this is optimized so that a loopback TCP connection
is that since a particular solaris release 8,9,10?
I havent got my solaris internals book to hand right now ;-)
Post by Erik Rigtorp
becomes more or less a pipe. For low latency IPC on Linux ZeroMQ needs
pipes or unix domain sockets.
just before xmas I exchanged an email with Martin about providing a fifo/pipe
interface. (I wasnt concerned about performance, but wanted a zmq socket
connection that could only be accessed via the same machine and not via
loopback.) Subsequently I think that providing AF_LOCAL (AF_UNIX) sockets
would be a good idea.
Post by Erik Rigtorp
For ultra low latency IPC there is only one way to go and that is to
use shared memory. I took a look at yqueue.hpp in zeromq2 and it's a
good start. We only need to add a lock free memory allocator (which
I'm glad some one else has looked at this because a while back I wondered
whether the yqueue.hpp could use shared memory.
Post by Erik Rigtorp
can be implemented using a lock free queue) or implement a lock free
ypipe.hpp for example?
Post by Erik Rigtorp
ringbuffer that would hold a fixed number of messages and block the
writer when it's full. For signaling I suggest to implement two
different approaches. One using pthreads conditions and one using busy
waiting.
Erik Rigtorp
2010-01-04 09:43:20 UTC
Permalink
Post by Martin Sustrik
Hi Erik, John,
Post by Jon Dyte
Post by Erik Rigtorp
I've read the two discussions on using ZeroMQ for IPC. I think ZeroMQ
should support IPC and in-process communication.
I think we all agree on this.
Post by Erik Rigtorp
TCP is nice to work with but it has one problem: On linux (and others)
TCP over loopback doesn't bypass the TCP stack which makes the latency
several times higher than using pipes or unix domain sockets. I know
that on Solaris this is optimized so that a loopback TCP connection
is that since a particular solaris release 8,9,10?
I havent got my solaris internals book to hand right now ;-)
Post by Erik Rigtorp
becomes more or less a pipe. For low latency IPC on Linux ZeroMQ needs
pipes or unix domain sockets.
just before xmas I exchanged an email with Martin about providing a fifo/pipe
interface. (I wasnt concerned about performance, but wanted a zmq socket
connection that could only be accessed via the same machine and not via
loopback.) Subsequently I think that providing AF_LOCAL (AF_UNIX) sockets
would be a good idea.
Post by Erik Rigtorp
For ultra low latency IPC there is only one way to go and that is to
use shared memory. I took a look at yqueue.hpp in zeromq2 and it's a
good start. We only need to add a lock free memory allocator (which
I'm glad some one else has looked at this because a while back I wondered
whether the yqueue.hpp could use shared memory.
Post by Erik Rigtorp
can be implemented using a lock free queue) or implement a lock free
ypipe.hpp for example?
Post by Erik Rigtorp
ringbuffer that would hold a fixed number of messages and block the
writer when it's full. For signaling I suggest to implement two
different approaches. One using pthreads conditions and one using busy
waiting.
Martin Sustrik
2010-01-04 11:48:37 UTC
Permalink
Post by Martin Sustrik
Hi Erik, John,
Post by Jon Dyte
Post by Erik Rigtorp
I've read the two discussions on using ZeroMQ for IPC. I think ZeroMQ
should support IPC and in-process communication.
I think we all agree on this.
Post by Erik Rigtorp
TCP is nice to work with but it has one problem: On linux (and others)
TCP over loopback doesn't bypass the TCP stack which makes the latency
several times higher than using pipes or unix domain sockets. I know
that on Solaris this is optimized so that a loopback TCP connection
is that since a particular solaris release 8,9,10?
I havent got my solaris internals book to hand right now ;-)
Post by Erik Rigtorp
becomes more or less a pipe. For low latency IPC on Linux ZeroMQ needs
pipes or unix domain sockets.
just before xmas I exchanged an email with Martin about providing a fifo/pipe
interface. (I wasnt concerned about performance, but wanted a zmq socket
connection that could only be accessed via the same machine and not via
loopback.) Subsequently I think that providing AF_LOCAL (AF_UNIX) sockets
would be a good idea.
Post by Erik Rigtorp
For ultra low latency IPC there is only one way to go and that is to
use shared memory. I took a look at yqueue.hpp in zeromq2 and it's a
good start. We only need to add a lock free memory allocator (which
I'm glad some one else has looked at this because a while back I wondered
whether the yqueue.hpp could use shared memory.
Post by Erik Rigtorp
can be implemented using a lock free queue) or implement a lock free
ypipe.hpp for example?
Post by Erik Rigtorp
ringbuffer that would hold a fixed number of messages and block the
writer when it's full. For signaling I suggest to implement two
different approaches. One using pthreads conditions and one using busy
waiting.
Erik Rigtorp
2010-01-05 12:03:29 UTC
Permalink
3. The above would work OK for VSMs (very small messages). Still, larger
message contents are allocated via malloc (see zmq_msg_init_size
implementation) and these would require allocating shmem for each
message. While doable, it would make sense only for very large messages,
and only those very large messages that are known in advance to be sent
via shmem transport. It's kind of complex.
That would be a neat optimization, but complex. I think as a start we
should implement a ringbuffer with byte elements and use it as a
shared memory pipe. Basically you would write() and read() from the
buffer just like a socket but without the overhead. If you know the
max message size you could optimize this and implement a ringbuffer
where each element is a message and let the user program work directly
on shared memory. That would be hard to integrate with ZeroMQs API.
What about passing just VSMs via the ringbuffer? You can increase
MAX_VSM_SIZE when compiling 0MQ so that all the messages fit into the
ringbuffer.
Is it correct that VSMs are messages that get copied within 0mq and
other messages get passed by reference?

Then we would implement copy-on-write for VSMs and zero-copy reading
in the client. For large messages we would need a shared memory area
which is mapped to all processes, a lock-free allocator and a
reference counting garbage collector. Doable but complex and it has
it's own performance bottlenecks. To utilize this we also need to
change/complement the API (i think).

I think we should forget about implementing zero-copy to begin with
and for small messages it's not necessarily better.
I'll try to find/write a good c++ lock-free ringbuffer template.
I would start with yqueue_t and ypipe_t. We've spent a lot of time making
them as efficient as possible. The only thing needed is to split each of
them into read & write part. This shouldn't be that complex. Both classes
have variables accessed exclusively by reader and variables accessed
exclusively by the writer. Then there are shared variables manipulated by
atomic operations that should reside in the shared memory.
They look efficient, but assumes you can allocate new memory. For shm
it's simpler to assume a fixed length buffer. Also I think that the
code is not correct. On PPC you are not guaranteed that the memcpy()
is commited before you update the atomic_ptr. You need to add a memory
barrier.
Martin Sustrik
2010-01-05 14:10:03 UTC
Permalink
Post by Erik Rigtorp
Is it correct that VSMs are messages that get copied within 0mq and
other messages get passed by reference?
Yes. That's right.
Post by Erik Rigtorp
Then we would implement copy-on-write for VSMs and zero-copy reading
in the client. For large messages we would need a shared memory area
which is mapped to all processes, a lock-free allocator and a
reference counting garbage collector. Doable but complex and it has
it's own performance bottlenecks. To utilize this we also need to
change/complement the API (i think).
Right. Now let's break it into small and gradual steps and find out
which is the first one to implement...
Post by Erik Rigtorp
I think we should forget about implementing zero-copy to begin with
and for small messages it's not necessarily better.
I'll try to find/write a good c++ lock-free ringbuffer template.
I would start with yqueue_t and ypipe_t. We've spent a lot of time making
them as efficient as possible. The only thing needed is to split each of
them into read & write part. This shouldn't be that complex. Both classes
have variables accessed exclusively by reader and variables accessed
exclusively by the writer. Then there are shared variables manipulated by
atomic operations that should reside in the shared memory.
They look efficient, but assumes you can allocate new memory. For shm
it's simpler to assume a fixed length buffer. Also I think that the
code is not correct. On PPC you are not guaranteed that the memcpy()
is commited before you update the atomic_ptr. You need to add a memory
barrier.
Note that on PPC atomic_ptr is implemented using mutex. The mutex should
trigger the barrier.

In general, the code works now due to various assumptions resembling the
one above. In future it should be rewritten to use explicit memory barriers.

Martin

Martin Sustrik
2010-01-04 17:46:23 UTC
Permalink
Hi Martin, erik
personally i quite like the idea of using a shared named pipe, then
passing the the fd.
Yes, that's the simplest option. It requires some investigation whether
passing a descriptor via unix domain socket is guaranteed to be atomic.
If not so, data for multiple file descriptors passed through the socket
can be interleaved resulting in a nasty mess.
the reason for that is if you are going to do communication over pipes, then
it might be reasonable to assume the app is not doing tcp at all.....
That's a good idea. We can even introduce a new configure option
--with-tcp so that TCP support can be turned off altogether making the
binary smaller.

On the other hand, IPC can be implemented as an optimisation of TCP
transport (thus tcp://127.0.0.1:5555 would be automatically passed via a
pipe).

The former option feels somewhat neater IMO.
on solaris there is an ioctl for passing the FD between processes, or we
could use more portable posix code (probabky better).
Stevens suggests using sendmsg, sending the file descriptor as ancillary
data of SCM_RIGHTS type.

Martin
Loading...