Boost message passing between Erlang nodes

Boost message passing between Erlang nodes

Message passing between Erlang nodes is considerably slower than within the same node. This is normal, and is due to the fact that messages sent between nodes are actually copied from the area of the sender to that of the receiver, then sent over from one node to the other via TCP/IP.

I was getting interested when a server could achieve amazing performance over a single node, performance which got much lower once the server got distributed over two Erlang nodes. I therefore tried a small benchmark test to somehow measure the difference in message passing speed within/across Erlang nodes.

The machine used in this benchmark is an apple macbook running Leopard, with 2.0GHz Intel Core 2 Duo and 2GB of DDR3 RAM memory.

This benchmark is a very simple ping request/pong response test. A process ‘A’ sends a ping message to process ‘B’, which replies with a pong message.

Code for the pong process:

This very simple code basically starts a process and registers it with the name flood_pong. You can see that to any request with format:

flood_pong will reply with a message with format:

The code for the ping process, i.e. the one that starts the ping requests, is the following:

This code does two things:

  • Starts and registers a flood_ping process, which will be responsible to await and count the incoming pong replies. This allows it to compute the time needed for all the pong replies to get back to the node where the ping requests where first sent.
  • Spawns the ping requests [one process per ping request].

Let’s give it a try. First, we have to start up two Erlang nodes, so I fire up terminal and enter:

and in another terminal window:

This initializes two erlang nodes, ‘one@rob.loc’ and ‘two@rob.loc’, both with kernel polling enabled and a maximum number of processes per node set to 500,000.

I need to ensure that the nodes can see each other, so from ‘one@rob.loc’ I ping ‘two@rob.loc’:

Now let’s perform a message passing speed benchmark for messages sent and received within the same Erlang node. First, I start up the pong process:

I then need to start the ping requests, and I do so by issuing the start_ping/2 function, which has per arguments the name of the recipient node of the ping requests, and the number of requests to be performed. I therefore set the first parameter to node() and the second to 200,000 [we’ll always try out 200,000 ping requests].

I see therefore that within the same Erlang node, this benchmark gives a result of an average of 5.3 million ping/pong messages per minute. Quite a rush.

Let’s now try the same thing with pong running on ‘two@rob.loc’, and ping requests coming from ‘one@rob.loc’. I start the pong process on ‘two@rob.loc’:

Then I issue the ping request from ‘one@rob.loc’:

I see therefore that on two different Erlang nodes, this benchmark gives a result of an average of 700 thousands ping/pong messages per minute.

This is around 7 times a loss in performance when compared to the same benchmark run on a single Erlang node. This is to be expected, still it does mean that a system heavily dependant on message passing and running on a single Erlang node would be considerably faster than one running on distributed nodes (one of the things Erlang is great at doing).

At this point, I tried reducing the overhead in message passing, by using UDP instead of Erlang’s native functionalities. This was no success at all, UDP having all the troubles it’s known for (packet loss, ordering, double packet sending, …) without me getting any interesting speed improvement.

So I got this thought: what if Erlang’s message passing overhead is so demanding, due to TCP/IP, that actually queuing messages sent to processes running on the same node and sending them altogether would increase the overall passing speed? TCP/IP already does provide such mechanisms, but what if performing this queuing at Erlang application level could bring significant improvements to infra-nodes message passing?

This is the picture. On both Erlang nodes I run a gen_server (which I called qr) which performs queuing and routing of messages sent from one node to the other. Instead of sending messages directly from a process A on node ‘one@rob.loc’ to a process B on node ‘two@rob.loc’, process A sends a routing request to the qr running on ‘one@rob.loc’, which will then send it to the qr running on ‘two@rob.loc’, which will finally forward it to B. Fact is, the sending qr will queue messages until a certain number of them for the same node have piled up, or a timeout period has expired.

qr

I start by defining the gen_server API to route a message. Here it is:

You can see that we need to pass the Pid of the destination process, as well as the name of the Node where this process is running, together with the Message that we need to send.

In case the node is remote, this will cast a message to qr, which will handle it:

The mecanism is quite simple. When a message to be sent to a remote node arrives to qr, we build a list of key tuples [which will be passed along as gen_server State variable] of format:

DestNode, i.e. the destination node, is the tuple key. CreationTime is a parameter which allows to know when a timeout for the messages to DestNode has passed. MsgList is a list of tuples of the messages to be sent to DestNode, in format:

We handle this key list to check when to send all the queued messages to the other node, which we do:

  • when a certain timeout has passed for that node [computed using CreationTime];
  • when the number of messages to be sent has reached a certain size.

We see that the gen_server cast also establishes an overall timeout of 200 ms, which will fire when no other messages are received by qr and thus we need to send the remaining messages out. This timeout is handled with a handle_info call:

As we can see from here over, a message sent from a qr on one node to a qr of another node has format:

With MsgList being a list of the individual messages in format:

When a qr of a node receives such a message, it handles it:

Ok, this basically is it. Let’s now see if this does improve message passing between nodes.

For this, we need to redifine the pong_loop so that it does not send messages directly to the process, but routes its requests to qr instead.

New code for the pong process:

The same goes for the ping process:

We are all set. Let’s try to immediately perform a test on two different Erlang nodes, pong running on ‘two@rob.loc’, and ping requests coming from ‘one@rob.loc’. I start qr and the pong process on ‘two@rob.loc’:

Then I start qr and issue the ping request from ‘one@rob.loc’:

I see therefore that on two different Erlang nodes, this benchmark gives a result of an average of 2.1 million ping/pong messages per minute, 3 times as much as without message queuing. This is interesting.

Just to be on the safe side, I try the same benchmark for messages sent via qr to processes running on the same Erlang node.

I see therefore that within the same Erlang node, this benchmark gives a result of an average of 5.3 million ping/pong messages per minute, the same that we had without the queuing mechanism.

To summarize:

  • without queuing mechanism:
    • same Erlang node: 5.3 million messages/min;
    • different Erlang nodes: 700 K messages/min.
  • with queuing mechanism:
    • same Erlang node: 5.3 million messages/min;
    • different Erlang nodes: 2.1 million messages/min.

The complete code to run this on your machine is available here. This whole ‘queuing idea’ is still an experiment, and I’d be more than delighted to hear your feedback, to see whether you are getting the same results, you know how to improve the concept or the code, or you have any considerations at all you would like to share.


UPDATE [April 17th, 2009]

 


Due to a comment here below from Adam Bregenzer, who performed a test on a Linux box and had very different results, I’ve decided to perform some testing on 3 different OS. Here is a summary of these tests, but please bear in mind that these results are definitely non exhaustive.

1. macbook running Leopard OSX, with 2.0GHz Intel Core 2 Duo and 2GB of DDR3 RAM memory.

  • on the same Erlang node: 5.3 million messages/min;
  • on different Erlang nodes:
    • without queuing mechanism: 700 K messages/min.
    • with queuing mechanism: 2.1 million messages/min.

2. Ubuntu 8.10 Linux, 64-bit system with 2.5GHz Intel Core 2 Duo [Adam’s test]

  • on the same Erlang node: 11.7 million messages/min;
  • on different Erlang nodes:
    • without queuing mechanism: 4.8 million messages/min.
    • with queuing mechanism: 3.6 million messages/min.

3. Ubuntu 9.04 beta Linux, 32-bit system on a VM with 1 CPU 2.0GHz and 512MB of RAM memory.

  • on the same Erlang node: 4.6 million messages/min;
  • on different Erlang nodes:
    • without queuing mechanism: 1.6 million messages/min.
    • with queuing mechanism: 960K messages/min.

4. Windows XP, with 3.0GHz Intel Core 2 Duo E8400 and 2.5GB of DDR3 RAM memory.

  • on the same Erlang node: 24.7 million messages/min;
  • on different Erlang nodes:
    • without queuing mechanism: 2.2 million messages/min.
    • with queuing mechanism: 5.1 million messages/min.

As it is normal with Erlang, results are extremely different depending on many factor, one of which seems to be the OS. On the non-exhaustive tests here above, the qr mechanism does:

  • improve message passing performance between Erlang nodes in the OSX and Windows environment;
  • decrease message passing performance between Erlang nodes in the Ubuntu Linux environment.

I do not have the insights to understand which characteristics of each OS is switching results in this manner: TCP/IP flow, threads, I/O. Though, I’m currently deploying a similar qr mechanism in my applications, which can be activated if necessary based on individual benchmarking tests.


UPDATE [June 9th, 2009]

 


I have had the availability of two HP ProLiant DL380, running Ubuntu 8.04 Server 32 bit, therefore I decided to take a shot also to test the same bench on different machines.

Here is a summary of the results of these tests. As usual, please bear in mind that these results are definitely non exhaustive.

  • on the same Erlang node: 5.3 million messages/min [on both machines];
  • on two different Erlang nodes, both running on the same machine:
    • without queuing mechanism: 1.7 K million messages/min.
    • with queuing mechanism: 3.1 million messages/min.
  • on two different Erlang nodes, each running on a different machine:
    • without queuing mechanism: 1.7 K million messages/min.
    • with queuing mechanism: 3.2 million messages/min.

This seems to state that the qr mechanism does improve message passing performance between Erlang nodes, also on the Ubuntu 8.04 environment.

I’ve also updated the original code, so that the destination node identifier is now extracted from the recipient Pid and thus it is not necessary to pass it as variable. This code can be found here.

44 Comments

  1. Peter

    Your “different nodes” were started at the same PC, weren’t they?
    I wonder what if this test was performed with real network and various situated servers.

  2. Roberto Ostinelli Author

    hi peter, indeed the nodes where on the same machine.

  3. Vincent

    Hi Roberto,

    Very interesting experiment. Have you also benchmarked different queue lengths?

  4. Roberto Ostinelli Author

    hi vincent, in the code you can both define queue length and timeout. i think this mainly depends on how big your messages are, and erlang’s internal way by which larger messages get split before being sent over.

    i believe the best is to find the right queue size/timeout compromise, to set those just before the message being sent from a node to the other gets split.

    if you do conduct experiments on this let me know :)

  5. nagle is at a lower level of implementation (tcp), though i’m trying to emulate its behavior from an application perspective.

    if you feel you can improve what you’ve read above by implementing this or another algorithm, i’d be happy to hear from you :)

  6. Ivan

    I downloaded your code. flood1 works as shown. flood2 does not. flood2:start_ping returns “ok” almost immediately and after some time it prints “ping timeout, missing 200000 pong, shutdown’.

  7. flood_pong has a timeout. did you ensure that it is running before running flood2:start/2? otherwise obviously messages do not get ponged back.

  8. Ivan

    first I run flood2:start_pong() and then flood2:start_ping() and that scenario doesn’t work. But it works fine with flood1.

  9. ivan, just tried it out again and everything works fine. did you ensure that you are running qr on both nodes? (which maybe wasn’t clear enough, so thank you for the feedback – I’ve updated the post).

  10. Ivan

    Yep. qr:start_link() resolved the problem. Thx.

  11. Roberto,

    Thanks for the interesting post and research. I was curious to try this myself and ran your benchmarks locally and got very different results. It seems that newer versions of erlang have made big improvements in this area.

    I used an ubuntu 8.10 64-bit system with a 2.5ghz core 2 duo and the latest erlang-base (1:12.b.3-d) installed via ubuntu. Running the tests I get the following results:
    flood1:start_ping(node(), 200000).
    RECEIVED ALL 200000 in 1024913 ms [11708310.851750344/min]

    flood1:start_ping(‘two@localhost’, 200000).
    RECEIVED ALL 200000 in 2504182 ms [4791983.969216295/min]

    flood2:start_ping(‘two@localhost’, 200000).
    RECEIVED ALL 200000 in 3269456 ms [3670335.370777279/min]

    As you can see I got very different results with flood1 on the same node performing ~2.5 times faster than on two different local nodes, and qr queuing actually makes it a bit slower.

    FYI, when I run erl I get more (and different) version info:
    Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:2] [async-threads:0] [kernel-poll:true]

    Eshell V5.6.3 (abort with ^G)

    Have you tried your code on this version?

    Thanks,
    Adam

  12. hello adam,

    this is getting interesting. actually, i’m running a newer version than yours: the latest stable i.e. R12B-5, over a mac osx leopard [10.5.6].

    are you running ubuntu on a vm? erlang performs very differently depending on the host operating system, and i suspect you running on a 64-bit system might also influence the results. i am soon gonna perform some testing over an ubuntu vm, and see what that gives.

    have you tried modifying the queue length to see if results switch?

  13. Roberto,
    I have not tried anything else seriously yet, but knowing that you are using a more recent version I will.

    I did some limited, informal testing on an ubuntu vm, however it was running ubuntu server and apt-get only had R11. Its results were stranger – two local nodes were fastest. With two nodes performance was more than quadruple and tests between remote nodes (across two vmware servers) was nearly identical, regardless of the version used. However, the servers I used were 32bit ubuntu server running in 64bit vmware, single core processors with ~256mb of ram – not good test servers…

    I will do some more testing on various systems and even try compiling my own version of erlang and report back. I would be interested in what you find was well. The wide variations between tests is a bit concerning.

    Thanks,
    Adam

  14. hi adam,

    ok please do keep me posted on this. i should have some time to test this on an ubuntu vm myself, and see what that gives too.

  15. walrus

    I tried that benchmarks on my home computer (ubuntu 8.10, AMD Athlon(tm) 64 X2 Dual Core Processor 4000+) erlang R13B (built from sources). Results:

    flood1:
    from same node: RECEIVED ALL 1000000 in 10983142 ms [5462917.624118854/min]
    from another node: RECEIVED ALL 1000000 in 19117340 ms [3138511.9477919/min]

    fllod2+qr:
    same node: RECEIVED ALL 1000000 in 9553300 ms [6280552.269896266/min]
    another node: RECEIVED ALL 1000000 in 22377510 ms [2681263.4649699633/min]

    So, in my environment qr provided better results in all cases.

  16. walrus

    :-) sorry I meant in case of same node only.

  17. walrus

    the same test with hipe:
    same node:
    walrus@home:~/erlang_mq_boost$ erl +K true +P 500000 -sname one
    Erlang R13B (erts-5.7.1) [source] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:true]

    Eshell V5.7.1 (abort with ^G)
    (one@home)1> flood1:start_pong().
    true
    (one@home)2> flood1:start_ping(node(),1000000).
    ok
    RECEIVED ALL 1000000 in 9906972 ms [6056340.928388613/min]
    (one@home)3> qr:start_link().
    {ok,}
    (one@home)4> flood2:start_ping(node(),1000000).
    ok
    RECEIVED ALL 1000000 in 9780417 ms [6134707.753258374/min]
    pong timeout, shutdown

    From another node:
    walrus@home:~/erlang_mq_boost$ erl +K true +P 500000 -sname tester
    Erlang R13B (erts-5.7.1) [source] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:true]

    Eshell V5.7.1 (abort with ^G)
    (tester@home)1> net_adm:ping(one@home).
    pong
    (tester@home)2> flood1:start_ping(one@home,1000000).
    ok
    RECEIVED ALL 1000000 in 19568416 ms [3066165.3963202746/min]
    (tester@home)3> qr:start_link().
    {ok,}
    (tester@home)4> flood2:start_ping(one@home,1000000).
    ok
    RECEIVED ALL 1000000 in 21260062 ms [2822193.086737/min]

  18. thank you walrus for your feedback. parts of your results comply with what is written in my update.

    though, i believe there are some errors in your testings, as they state that qr can increase message passing speed within nodes: this is impossible, since qr basically does not queue local messages and sends these immediately.

    cheers,

    r.

  19. Davide

    Hi!

    I should mention that there are additional comments here:
    http://www.planeterlang.org/en/planet/article/Boost_message_passing_between_Erlang_nodes/

    Perhaps planeterlang.org should disable comments when re-posting to prevent scattering the conversation. :\

    Anyway, good work! It’s always good to see new ideas being tested! :)

    Cheers,
    :Davide

  20. thank you davide, seen those and have replied to them.

    cheers,

    r.

  21. There is an undocumented option, -kernel dist_nodelay Bool, which defaults to ‘true’. You could try benchmarking your program using plain messaging and setting dist_nodelay to ‘false’. It will try to aggregate messages in the inet driver. In previous cases, this has given comparable speedups to aggregating messages in ‘user space’, but obviously without the hassle.

    Note that it will (obviously) affect /all/ messages sent between the nodes.

  22. hi ulf,

    thank you for this pointer.

    one question though: has the dist_nodelay option to be set as application environment variable, or can it be set as vm startup option somehow?

  23. When starting the VM, you can either set the variable on the erl command line like so:

    erl … -kernel dist_nodelay false

    or specify a .config file (often called sys.config), and point to it with:

    erl … -config sys

    The sys.config file has erlang syntax, and setting only the above variable would look like this:

    $ cat sys.config
    [{kernel, [{dist_nodelay, false}]}].
    $

    The details are covered in http://www.erlang.org/doc/design_principles/applications.html#7.8

  24. hi ulf,

    since it is undocumented i was unsure how to set it up. thank you for clearing that up.

    setting dist_nodelay to false does improve using plain messages, though the results are quite far from the ones i get with the qr mechanism.

    flood1 [dist_nodelay true]: ~700K
    flood1 [dist_nodelay false]: ~810K

    but

    flood2 [dist_nodelay true]: ~2,100K
    flood2 [dist_nodelay false]: ~2,000K

    unsurprisingly, using dist_nodelay coupled with another form of piping [my qr mechanism] actually slightly decreases performances.

    i’ve performed this rough bench only on the osx macbook described here above. i will try this out also on other systems.

    thank you for your feedback, dist_nodelay definitely is worth a try, and nice knowing :)

    r.

  25. Richard Andrews

    This smells like a scheduling issue. When a packet is passed through TCP there is a complex chain of events: copy to kernel space, transfer to rx TCP stack, wake up receiver process, copy to user-space. I have a feeling that qr works because it reduces the number of userspace/kernel context switches.

    Have you observed the relative packet sizes for tests with v without qr? Simply counting packets through the loopback interface (for constant data) should be sufficient to get a reasonable estimate of avg size.

    I think Ulf’s idea is going in the right direction. I think you should be trying to get the underlying TCP stack to send fewer segments of larger size.

    TCP_CORK sockopt under linux might be interesting.

    A limitation will still be the number of copies *to* kernel space ie. send operation. Changing the aggregation scheme in the node is still required to influence that.

    What effect does ping message size have on the comparison?

  26. hi richard,

    first of all thank you for your feedback. yes, i too do believe that i should go the route of the tcp optimization. i’m perfectly aware that it is quite bizarre to ’emulate’ tcp piping mechanisms at application level. however, i’ve personally been unable to find a way to reproduce the results of the ‘qr’ pipe mechanism by mere tcp optimization.

    as per your suggestion, i’ve performed the same tests considerably increasing the ping message size: as one would expect, the message passing gets slower, but the relative increase in speed using qr remains the same.

  27. Richard Andrews

    Interesting result. I would not have predicted that the speedup would remain constant. I wasn’t necessarily advocating trying to optimise TCP but rather suggesting ways to track down the reason for the disparity.

    I think the unremovable bottleneck will be the number of writes required with many messages sent individually. System calls are expensive. Userspace should be reducing the number of send operations as much as possible.

    Out of curiosity are you using SMP in the erl nodes? Have you performed the test with SMP disabled on both nodes? I think there should only be a single erlang process (socket process) sending to TCP at any given time so IO lock contention should not be an issue. But it is worth checking.

    Also out of curiosity – can you provide a breakdown of user-space versus kernel-space CPU utilisation during the tests. The time utility might be sufficient. Given the lower throughput I would expect to see the utilisation without qr to be significantly down.

  28. @Richard: Actually, since the send() function spawns processes in a tight loop, one could imagine that there will indeed be lock contention on the distributed send. Spawn is non-blocking, and returns as soon as it has inserted the other process in the ready queue. It’s therefore likely to spawn a great number of processes before it yields, e.g. due to context reductions.

  29. …commenting on my own comment above, of course in a multicore world, it is by no means certain that this invalidates the benchmark, or the utility of qr, much like ets has been a popular optimization on single-core, but is much less attractive on SMP due to the low-level locking.

  30. …what’s more (my brain seems to schedule in slices, just like Erlang does – beware: this may happen to you too, in time!)

    There are good reasons to make ! fully asynchronous in SMP, while it isn’t in non-SMP. There’s a tiny backoff control that checks the length of the receiver’s inbox, and punishes the sender if it’s too long. This is *very* costly on many-core. Similarly, sending to a port is a blocking operation, which favors solutions where only one process sends to the port. Perhaps this too should be made fully asynchronous…? This is easier to accomplish for Distributed Erlang than e.g. for gen_tcp, since it would break the gen_tcp API unless it is extended.

  31. thank you ulf and richard for these precious insights. ulf, unfortunately i believe that it’s already too late for me, as my brains seems to be thinking in slices too :)

    still, i still haven’t had a grab on:

    1. why qr is actually working [or if it’s working only with such a benchmark];
    2. what should i beware of when using qr.

    to my understanding up until now, qr could be particularly costly on multi-core machines. is this correct? this alone could give me enough reasons to think again before using qr.

    fact is, i’ve designed a multi-node architecture from the start [i.e. not based on clusters] and i would really like to find a “best practice” way to get the speed between nodes as fast as it can be.

    i’m even considering developing an alternative carrier, but this seems to be a really complex and overwhelming task.

    again, thank you for any insights you feel like sharing.

    cheers,

    r.

  32. In an unusually timely fashion, OTP has just released R13B01, which, among other things, delivers significant speedups on port communication, e.g. Distributed Erlang. Specifically, it dramatically reduces the lock contention on the port.

    It would be interesting to see how R13B01 affects this benchmark.

  33. @Roberto: I’m not at all sure that qr is worse on multicore. All messages are going out on the same port eventually anyway, and sending to the same process gives little lock contention as appending to the message queue is one of the operations most optimized in the Erlang VM – and a very small critical section. Sending to the port (until R13B01), the critical section actually includes the term_to_binary and updating of the atom cache, as well as the actual send to the port. This means that the risk of lock contention is much higher. It’s the contention that’s costly, not the actual locking.

  34. hi ulf, i had seen the readme of the release, and i said to myself this was perfect timing. when you wrote i just had erlang R13B01 installed and was starting to perform the test :)

    here are the results. again, on my macbook with leopard.

    1. to my surprise, sending speed of messages within the same node has consistently decreased. it is now 3.6 million/min instead of 5.3.

    2. speed between two erlang nodes are similar, however setting the dist_nodelay option to false now produces far better results, bringing message passing between erlang nodes up to 1 million [instead of ~810K].

    3. qr still beats everything: 2.3 million/min between different erlang nodes.

    r.

  35. ok. still it beats me why qr is actually performing so well.

    i would just like to understand where i should/should not use the qr mechanism, since as i already said – it is, after all, quite weird to implement a queue mechanism at a higher level than tcp.

    i am far from having the insights and experience you have in erlang; thus again, thank you for your feedback, it is warmly welcomed.

  36. @Roberto: Those were interesting numbers. One perpetual problem with benchmarking is that you have to gauge how close to reality your test comes. In your test, you begin by spawning 200,000 processes that all get into line to send a distributed message. This will probably result in an unusually great propability for lock contention, at least compared to most ‘normal’ erlang applications.

    The local case is interesting. You should probably let the OTP team know. My guess is that they have verified with their own benchmarks that local message passing does *not* deteriorate with each new release, but you may have hit some corner case that their benchmarks don’t cover.

  37. i completely agree. this benchmark test is very specific.

    i will need to reperform the local benchmark, as i prefer not bugging the otp team without being 100% sure of the results i’m experimenting. asa i get to have some time, i will. i’ll get back to you then to know who to contact :)

Trackbacks for this post

  1. links for 2009-04-09 « Bloggitation
  2. How Discord Scaled Elixir to 5,000,000 Concurrent Users | Artificia Intelligence
  3. [Перевод] Как Discord масштабировал Elixir на 5 млн одновременных пользователей | Все новости - Самые последние новости!
  4. How Discord Scaled Elixir to 5,000,000 Concurrent Users – gutuka
  5. How Discord Scaled Elixir to 5,000,000 Concurrent Users
  6. Discord CTO 谈如何构建500W并发用户的Elixir应用 – 前端开发,JQUERY特效,全栈开发,vue开发

Comments are closed.