Boost message passing between Erlang nodes
Apr 2009

Message passing between Erlang nodes is considerably slower than within the same node. This is normal, and is due to the fact that messages sent between nodes are actually copied from the area of the sender to that of the receiver, then sent over from one node to the other via TCP/IP.

In the systems I’m building I need to get as fast as possible in message passing between nodes. Who doesn’t? :) I was getting frustrated when a server could achieve amazing performance over a single node, performance which got much lower once the server got distributed over two Erlang nodes. I therefore tried a small benchmark test to somehow measure the difference in message passing speed within/across Erlang nodes.

The machine used in this benchmark is an apple macbook running Leopard, with 2.0GHz Intel Core 2 Duo and 2GB of DDR3 RAM memory.

This benchmark is a very simple ping request/pong response test. A process ‘A’ sends a ping message to process ‘B’, which replies with a pong message.

Code for the pong process:

% starts pong
start_pong() ->
	register(flood_pong, spawn(fun() -> pong_loop() end)).

pong_loop() ->
	receive
		{{Sender, SenderNode}, Any} ->
			% pong back
			{Sender, SenderNode} ! {pong, Any},
			pong_loop();
		shutdown ->
			io:format("pong shutdown~n",[]);
		_Ignore ->
			pong_loop()
	after 60000 ->
		io:format("pong timeout, shutdown~n",[])
	end.

This very simple code basically starts a process and registers it with the name flood_pong. You can see that to any request with format:

{{Sender, SenderNode}, Any}

flood_pong will reply with a message with format:

{pong, Any}

The code for the ping process, i.e. the one that starts the ping requests, is the following:

% start ping
start_ping(PongNode, Num) ->
	register(flood_ping, spawn(fun() -> ping_loop(Num, now(), Num) end)),
	send(PongNode, Num).

send(_PongNode, 0) ->
	ok;
send(PongNode, Num) ->
	% send a spawned ping
	spawn(fun() -> {flood_pong, PongNode} ! {{flood_ping, node()}, ping_request} end),
	send(PongNode, Num - 1).

ping_loop(Num, Start, 0) ->
	T = timer:now_diff(now(), Start),
	io:format("RECEIVED ALL ~p in ~p ms [~p/min]~n",[Num, T, (Num*60000000/T)]);
ping_loop(Num, Start, Count) ->
	receive
		{pong, _PingBack} ->
			ping_loop(Num, Start, Count-1);
		_Received ->
			ping_loop(Num, Start, Count)
	after 10000 ->
		io:format("ping timeout, missing ~p pong, shutdown~n",[Count])
	end.

This code does two things:

  • Starts and registers a flood_ping process, which will be responsible to await and count the incoming pong replies. This allows it to compute the time needed for all the pong replies to get back to the node where the ping requests where first sent.
  • Spawns the ping requests [one process per ping request].

Let’s give it a try. First, we have to start up two Erlang nodes, so I fire up terminal and enter:

:~ roberto$ erl +K true +P 500000 -name 'one@rob.loc' -setcookie asd

and in another terminal window:

:~ roberto$ erl +K true +P 500000 -name 'two@rob.loc' -setcookie asd

This initializes two erlang nodes, ‘one@rob.loc’ and ‘two@rob.loc’, both with kernel polling enabled and a maximum number of processes per node set to 500,000.

I need to ensure that the nodes can see each other, so from ‘one@rob.loc’ I ping ‘two@rob.loc’:

(one@rob.loc)1>net_adm:ping('two@rob.loc').
pong
(one@rob.loc)2>nodes().
['two@rob.loc']

Now let’s perform a message passing speed benchmark for messages sent and received within the same Erlang node. First, I start up the pong process:

(one@rob.loc)3> flood1:start_pong().
true

I then need to start the ping requests, and I do so by issuing the start_ping/2 function, which has per arguments the name of the recipient node of the ping requests, and the number of requests to be performed. I therefore set the first parameter to node() and the second to 200,000 [we'll always try out 200,000 ping requests].

(one@rob.loc)4> flood1:start_ping(node(), 200000).
ok
RECEIVED ALL 200000 in 2257727 ms [5315080.166911234/min]

I see therefore that within the same Erlang node, this benchmark gives a result of an average of 5.3 million ping/pong messages per minute. Quite a rush.

Let’s now try the same thing with pong running on ‘two@rob.loc’, and ping requests coming from ‘one@rob.loc’. I start the pong process on ‘two@rob.loc’:

(two@rob.loc)1> flood1:start_pong().

Then I issue the ping request from ‘one@rob.loc’:

(one@rob.loc)5> flood1:start_ping('two@rob.loc', 200000).
ok
RECEIVED ALL 200000 in 16727643 ms [717375.4246189974/min]

I see therefore that on two different Erlang nodes, this benchmark gives a result of an average of 700 thousands ping/pong messages per minute.

This is around 7 times a loss in performance when compared to the same benchmark run on a single Erlang node. It is simply unacceptable: it does mean that a system heavily dependant on message passing and running on a single Erlang node would be considerably faster than one running on distributed nodes [one of the things Erlang is great at doing].

At this point, I tried reducing the overhead in message passing, by using UDP instead of Erlang’s native functionalities. This was no success at all, UDP having all the troubles it’s known for (packet loss, ordering, double packet sending, …) without me getting any interesting speed improvement.

It’s only here that I got this thought: what if Erlang’s message passing overhead is so demanding, due to TCP/IP, that actually queuing messages sent to processes running on the same node and sending them altogether would increase the overall passing speed? TCP/IP already does provide such mechanisms, but what if performing this queuing at Erlang application level could bring significant improvements to infra-nodes message passing?

This is the picture. On both Erlang nodes I run a gen_server [which I called qr] which performs queuing and routing of messages sent from one node to the other. Instead of sending messages directly from a process A on node ‘one@rob.loc’ to a process B on node ‘two@rob.loc’, process A sends a routing request to the qr running on ‘one@rob.loc’, which will then send it to the qr running on ‘two@rob.loc’, which will finally forward it to B. Fact is, the sending qr will queue messages until a certain number of them for the same node have piled up, or a timeout period has expired.

I start by defining the gen_server API to route a message. Here it is:

% Function() -> void()
% Description: Gets a routing request
route({ToPid, ToNode}, Message) ->
	case ToNode =:= node() of
		true ->
			% send directly
			ToPid ! Message;
		false ->
			% queue
			gen_server:cast(?SERVER, {{queue, ToNode}, {ToPid, Message}})
	end.

You can see that we need to pass the Pid of the destination process, as well as the name of the Node where this process is running, together with the Message that we need to send.

In case the node is remote, this will cast a message to qr, which will handle it:

% add incoming routing request to queue
handle_cast({{queue, DestNode}, {ToPid, Message}}, Queue) ->
	% to pid
	Msg = {route, ToPid, Message},
	% get if node exists in queue
	case lists:keysearch(DestNode, 1, Queue) of
		false ->
			% add node
			NewQueue = [{DestNode, {now(), [Msg]}}|Queue];
		{value, {DestNode, {CreationTime, MsgList}}} ->
			% check if queue is long enough
			case length(MsgList) >= ?QUEUELENGTH of
				true ->
					% queue of a node is of maximum lenght, send routing message
					{?QR, DestNode} ! {queue_route, [Msg|MsgList]},
					% empty queue for node
					NewQueue = lists:keydelete(DestNode, 1, Queue);
				false ->
					% add message to queue list and replace node
					NewQueue = lists:keyreplace(DestNode, 1, Queue, {DestNode, {CreationTime, [Msg|MsgList]}})
			end
	end,
	% purge timeout
	PurgedQueue = purge_queue_selective(NewQueue),
	{noreply, PurgedQueue, 200};

[...]

% Function -> PurgedQueue
% Description: Loop queue and purge only nodes on timeout.
purge_queue_selective(Queue) ->
	% check timeout, send and remove element if needed
	FilterFun = fun({DestNode, {CreationTime, MsgList}}) ->
		case timer:now_diff(now(), CreationTime) > ?QUEUETIMEOUT of
			true ->
				% timeout for a node, send routing message
				% io:format("send selective: {?QR, ~p} ! {queue_route, ~p}~n",[DestNode, MsgList]),
				{?QR, DestNode} ! {queue_route, MsgList},
				% delete node from queue
				false;
			false ->
				true
		end
	end,
	% return cleaned queue
	lists:filter(FilterFun, Queue).

The mecanism is quite simple. When a message to be sent to a remote node arrives to qr, we build a list of key tuples [which will be passed along as gen_server State variable] of format:

{DestNode, {CreationTime, MsgList}}

DestNode, i.e. the destination node, is the tuple key. CreationTime is a parameter which allows to know when a timeout for the messages to DestNode has passed. MsgList is a list of tuples of the messages to be sent to DestNode, in format:

{route, ToPid, Message}

We handle this key list to check when to send all the queued messages to the other node, which we do:

  • when a certain timeout has passed for that node [computed using CreationTime];
  • when the number of messages to be sent has reached a certain size.

We see that the gen_server cast also establishes an overall timeout of 200 ms, which will fire when no other messages are received by qr and thus we need to send the remaining messages out. This timeout is handled with a handle_info call:

% timeout on a cast message, purge queue
handle_info(timeout, Queue) ->
	% purge timeout
	PurgedQueue = purge_queue(Queue),
	% return
	{noreply, PurgedQueue};

[...]

% Function -> PurgedQueue
% Description: Loop queue and purge all nodes.
purge_queue(Queue) ->
	% sent to all remaining
	FilterFun = fun({DestNode, {_CreationTime, MsgList}}) ->
		% send routing message
		{?QR, DestNode} ! {queue_route, MsgList}
	end,
	lists:foreach(FilterFun, Queue),
	% return an empty queue
	[].

As we can see from here over, a message sent from a qr on one node to a qr of another node has format:

{queue_route, MsgList}

With MsgList being a list of the individual messages in format:

{route, ToPid, Message}

When a qr of a node receives such a message, it handles it:

handle_info({queue_route, MsgList}, State) ->
	RouteFun = fun({route, ToPid, Message}) ->
		% local send to node
		ToPid ! Message
	end,
	lists:foreach(RouteFun, MsgList),
	% return
	{noreply, State, 200};

Ok, this basically is it. Let’s now see if this does improve message passing between nodes.

For this, we need to redifine the pong_loop so that it does not send messages directly to the process, but routes its requests to qr instead.

New code for the pong process:

% starts pong
start_pong() ->
	register(flood_pong, spawn(fun() -> pong_loop() end)).

pong_loop() ->
	receive
		{{Sender, SenderNode}, Any} ->
			% pong back
			qr:route({Sender, SenderNode}, {pong, Any}),    % < === ONLY LINE CHANGED ===
			pong_loop();
		shutdown ->
			io:format("pong shutdown~n",[]);
		_Ignore ->
			pong_loop()
	after 30000 ->
		io:format("pong timeout, shutdown~n",[])
	end.

The same goes for the ping process:

% start ping
start_ping(PongNode, Num) ->
	register(flood_ping, spawn(fun() -> ping_loop(Num, now(), Num) end)),
	send(PongNode, Num).

send(_PongNode, 0) ->
	ok;
send(PongNode, Num) ->
	% send a spawned ping                               % === \/ ONLY LINE CHANGED ===
	spawn(fun() -> qr:route({flood_pong, PongNode}, {{flood_ping, node()}, ping_request}) end),
	send(PongNode, Num - 1).

ping_loop(Num, Start, 0) ->
	T = timer:now_diff(now(), Start),
	io:format("RECEIVED ALL ~p in ~p ms [~p/min]~n",[Num, T, (Num*60000000/T)]);
ping_loop(Num, Start, Count) ->
	receive
		{pong, _PingBack} ->
			ping_loop(Num, Start, Count-1);
		_Received ->
			ping_loop(Num, Start, Count)
	after 10000 ->
		io:format("ping timeout, missing ~p pong, shutdown~n",[Count])
	end.

We are all set. Let’s try to immediately perform a test on two different Erlang nodes, pong running on ‘two@rob.loc’, and ping requests coming from ‘one@rob.loc’. I start qr and the pong process on ‘two@rob.loc’:

(two@rob.loc)55> qr:start_link().
{ok,<0.38.0>}
(two@rob.loc)56> flood2:start_pong().
true

Then I start qr and issue the ping request from ‘one@rob.loc’:

(one@rob.loc)44> qr:start_link().
{ok,<0.114.0>}
(one@rob.loc)45> flood2:start_ping('two@rob.loc', 200000).
ok
RECEIVED ALL 200000 in 5690225 ms [2108879.7015935224/min]

I see therefore that on two different Erlang nodes, this benchmark gives a result of an average of 2.1 million ping/pong messages per minute, 3 times as much as without message queuing!

Just to be on the safe side, I try the same benchmark for messages sent via qr to processes running on the same Erlang node.

{ok,<0.3562.6>}
(two@rob.loc)50> flood2:start_pong().
true
(two@rob.loc)51> flood2:start_ping(node(), 200000).
ok
RECEIVED ALL 200000 in 2264539 ms [5299091.779828035/min]

I see therefore that within the same Erlang node, this benchmark gives a result of an average of 5.3 million ping/pong messages per minute, the same that we had without the queuing mechanism.

To summarize:

  • without queuing mechanism:
    • same Erlang node: 5.3 million messages/min;
    • different Erlang nodes: 700 K messages/min.
  • with queuing mechanism:
    • same Erlang node: 5.3 million messages/min;
    • different Erlang nodes: 2.1 million messages/min.

The complete code to run this on your machine is available here. This whole ‘queuing idea’ is still an experiment, and I’d be more than delighted to hear your feedback, to see whether you are getting the same results, you know how to improve the concept or the code, or you have any considerations at all you would like to share.


UPDATE [April 17th, 2009]

 


Due to a comment here below from Adam Bregenzer, who performed a test on a Linux box and had very different results, I’ve decided to perform some testing on 3 different OS. Here is a summary of these tests, but please bear in mind that these results are definitely non exhaustive.

1. macbook running Leopard OSX, with 2.0GHz Intel Core 2 Duo and 2GB of DDR3 RAM memory.

  • on the same Erlang node: 5.3 million messages/min;
  • on different Erlang nodes:
    • without queuing mechanism: 700 K messages/min.
    • with queuing mechanism: 2.1 million messages/min.

2. Ubuntu 8.10 Linux, 64-bit system with 2.5GHz Intel Core 2 Duo [Adam's test]

  • on the same Erlang node: 11.7 million messages/min;
  • on different Erlang nodes:
    • without queuing mechanism: 4.8 million messages/min.
    • with queuing mechanism: 3.6 million messages/min.

3. Ubuntu 9.04 beta Linux, 32-bit system on a VM with 1 CPU 2.0GHz and 512MB of RAM memory.

  • on the same Erlang node: 4.6 million messages/min;
  • on different Erlang nodes:
    • without queuing mechanism: 1.6 million messages/min.
    • with queuing mechanism: 960K messages/min.

4. Windows XP, with 3.0GHz Intel Core 2 Duo E8400 and 2.5GB of DDR3 RAM memory.

  • on the same Erlang node: 24.7 million messages/min;
  • on different Erlang nodes:
    • without queuing mechanism: 2.2 million messages/min.
    • with queuing mechanism: 5.1 million messages/min.

As it is normal with Erlang, results are extremely different depending on many factor, one of which seems to be the OS. On the non-exhaustive tests here above, the qr mechanism does:

  • improve message passing performance between Erlang nodes in the OSX and Windows environment;
  • decrease message passing performance between Erlang nodes in the Ubuntu Linux environment.

I do not have the insights to understand which characteristics of each OS is switching results in this manner: TCP/IP flow, threads, I/O. Though, I’m currently deploying a similar qr mechanism in my applications, which can be activated if necessary based on individual benchmarking tests.


UPDATE [June 9th, 2009]

 


I have had the availability of two HP ProLiant DL380, running Ubuntu 8.04 Server 32 bit, therefore I decided to take a shot also to test the same bench on different machines.

Here is a summary of the results of these tests. As usual, please bear in mind that these results are definitely non exhaustive.

  • on the same Erlang node: 5.3 million messages/min [on both machines];
  • on two different Erlang nodes, both running on the same machine:
    • without queuing mechanism: 1.7 K million messages/min.
    • with queuing mechanism: 3.1 million messages/min.
  • on two different Erlang nodes, each running on a different machine:
    • without queuing mechanism: 1.7 K million messages/min.
    • with queuing mechanism: 3.2 million messages/min.

This seems to state that the qr mechanism does improve message passing performance between Erlang nodes, also on the Ubuntu 8.04 environment.

I’ve also updated the original code, so that the destination node identifier is now extracted from the recipient Pid and thus it is not necessary to pass it as variable. This code can be found here.

Comments (39)

Your “different nodes” were started at the same PC, weren’t they?
I wonder what if this test was performed with real network and various situated servers.

1. Posted by Peter — April 8, 2009 @ 9:51 am

hi peter, yes the nodes where on the same machine.

of course, this means that on real networks timings might be longer, but the point here is to try to considerably improve these timings, regardless of absolute values :)

2. Posted by Roberto Ostinelli — April 8, 2009 @ 9:54 am

Hi Roberto,

Very interesting experiment. Have you also benchmarked different queue lengths?

3. Posted by Vincent — April 8, 2009 @ 11:23 am

hi vincent, in the code you can both define queue length and timeout. i think this mainly depends on how big your messages are, and erlang’s internal way by which larger messages get split before being sent over.

i believe the best is to find the right queue size/timeout compromise, to set those just before the message being sent from a node to the other gets split.

if you do conduct experiments on this let me know :)

4. Posted by Roberto Ostinelli — April 8, 2009 @ 11:29 am

[...] Boost message passing between Erlang nodes (tags: erlang cluster tuning programming 247up) Comments (0) [...]

5. Pingback by links for 2009-04-09 « Bloggitation — April 9, 2009 @ 8:03 am

nagle is at a lower level of implementation (tcp), though i’m trying to emulate its behavior from an application perspective.

if you feel you can improve what you’ve read above by implementing this or another algorithm, i’d be happy to hear from you :)

7. Posted by Roberto Ostinelli — April 9, 2009 @ 10:04 am

I downloaded your code. flood1 works as shown. flood2 does not. flood2:start_ping returns “ok” almost immediately and after some time it prints “ping timeout, missing 200000 pong, shutdown’.

8. Posted by Ivan — April 9, 2009 @ 3:21 pm

flood_pong has a timeout. ensure that it is running before running flood2:start/2, otherwise obviously messages do not get ponged back.

9. Posted by Roberto Ostinelli — April 9, 2009 @ 3:36 pm

first I run flood2:start_pong() and then flood2:start_ping() and that scenario doesn’t work. But it works fine with flood1.

10. Posted by Ivan — April 9, 2009 @ 3:43 pm

ivan, just tried it out again and everything works fine. you need to ensure that you are running qr on both nodes [which maybe wasn't clear enough, so I updated the post]. If you want me to help you out, you may consider leaving a non fake email address or contacting me directly.

11. Posted by Roberto Ostinelli — April 9, 2009 @ 4:01 pm

Yep. qr:start_link() resolved the problem. Thx.

12. Posted by Ivan — April 9, 2009 @ 9:06 pm

Roberto,

Thanks for the interesting post and research. I was curious to try this myself and ran your benchmarks locally and got very different results. It seems that newer versions of erlang have made big improvements in this area.

I used an ubuntu 8.10 64-bit system with a 2.5ghz core 2 duo and the latest erlang-base (1:12.b.3-d) installed via ubuntu. Running the tests I get the following results:
flood1:start_ping(node(), 200000).
RECEIVED ALL 200000 in 1024913 ms [11708310.851750344/min]

flood1:start_ping(‘two@localhost’, 200000).
RECEIVED ALL 200000 in 2504182 ms [4791983.969216295/min]

flood2:start_ping(‘two@localhost’, 200000).
RECEIVED ALL 200000 in 3269456 ms [3670335.370777279/min]

As you can see I got very different results with flood1 on the same node performing ~2.5 times faster than on two different local nodes, and qr queuing actually makes it a bit slower.

FYI, when I run erl I get more (and different) version info:
Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:2] [async-threads:0] [kernel-poll:true]

Eshell V5.6.3 (abort with ^G)

Have you tried your code on this version?

Thanks,
Adam

13. Posted by Adam Bregenzer — April 14, 2009 @ 10:16 pm

hello adam,

this is getting interesting. actually, i’m running a newer version than yours: the latest stable i.e. R12B-5, over a mac osx leopard [10.5.6].

are you running ubuntu on a vm? erlang performs very differently depending on the host operating system, and i suspect you running on a 64-bit system might also influence the results. i am soon gonna perform some testing over an ubuntu vm, and see what that gives.

have you tried modifying the queue length to see if results switch?

14. Posted by Roberto Ostinelli — April 14, 2009 @ 10:29 pm

Roberto,
I have not tried anything else seriously yet, but knowing that you are using a more recent version I will.

I did some limited, informal testing on an ubuntu vm, however it was running ubuntu server and apt-get only had R11. Its results were stranger – two local nodes were fastest. With two nodes performance was more than quadruple and tests between remote nodes (across two vmware servers) was nearly identical, regardless of the version used. However, the servers I used were 32bit ubuntu server running in 64bit vmware, single core processors with ~256mb of ram – not good test servers…

I will do some more testing on various systems and even try compiling my own version of erlang and report back. I would be interested in what you find was well. The wide variations between tests is a bit concerning.

Thanks,
Adam

15. Posted by Adam Bregenzer — April 15, 2009 @ 12:28 am

hi adam,

ok please do keep me posted on this. i should have some time to test this on an ubuntu vm myself, and see what that gives too.

16. Posted by Roberto Ostinelli — April 15, 2009 @ 9:46 pm

I tried that benchmarks on my home computer (ubuntu 8.10, AMD Athlon(tm) 64 X2 Dual Core Processor 4000+) erlang R13B (built from sources). Results:

flood1:
from same node: RECEIVED ALL 1000000 in 10983142 ms [5462917.624118854/min]
from another node: RECEIVED ALL 1000000 in 19117340 ms [3138511.9477919/min]

fllod2+qr:
same node: RECEIVED ALL 1000000 in 9553300 ms [6280552.269896266/min]
another node: RECEIVED ALL 1000000 in 22377510 ms [2681263.4649699633/min]

So, in my environment qr provided better results in all cases.

17. Posted by walrus — April 22, 2009 @ 5:30 pm

:-) sorry I meant in case of same node only.

18. Posted by walrus — April 22, 2009 @ 5:33 pm

the same test with hipe:
same node:
walrus@home:~/erlang_mq_boost$ erl +K true +P 500000 -sname one
Erlang R13B (erts-5.7.1) [source] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:true]

Eshell V5.7.1 (abort with ^G)
(one@home)1> flood1:start_pong().
true
(one@home)2> flood1:start_ping(node(),1000000).
ok
RECEIVED ALL 1000000 in 9906972 ms [6056340.928388613/min]
(one@home)3> qr:start_link().
{ok,}
(one@home)4> flood2:start_ping(node(),1000000).
ok
RECEIVED ALL 1000000 in 9780417 ms [6134707.753258374/min]
pong timeout, shutdown

From another node:
walrus@home:~/erlang_mq_boost$ erl +K true +P 500000 -sname tester
Erlang R13B (erts-5.7.1) [source] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:true]

Eshell V5.7.1 (abort with ^G)
(tester@home)1> net_adm:ping(one@home).
pong
(tester@home)2> flood1:start_ping(one@home,1000000).
ok
RECEIVED ALL 1000000 in 19568416 ms [3066165.3963202746/min]
(tester@home)3> qr:start_link().
{ok,}
(tester@home)4> flood2:start_ping(one@home,1000000).
ok
RECEIVED ALL 1000000 in 21260062 ms [2822193.086737/min]

19. Posted by walrus — April 22, 2009 @ 5:46 pm

thank you walrus for your feedback. parts of your results comply with what is written in my update.

though, i believe there are some errors in your testings, as they state that qr can increase message passing speed within nodes: this is impossible, since qr basically does not queue local messages and sends these immediately.

cheers,

r.

20. Posted by Roberto Ostinelli — April 28, 2009 @ 5:19 pm

Hi!

I should mention that there are additional comments here:
http://www.planeterlang.org/en/planet/article/Boost_message_passing_between_Erlang_nodes/

Perhaps planeterlang.org should disable comments when re-posting to prevent scattering the conversation. :\

Anyway, good work! It’s always good to see new ideas being tested! :)

Cheers,
:Davide

21. Posted by Davide — May 29, 2009 @ 6:35 pm

thank you davide, seen those and have replied to them.

i’ll also note that i’ve opened up a forum post on trapexit: http://www.trapexit.org/forum/viewtopic.php?p=44589

cheers,

r.

22. Posted by Roberto Ostinelli — June 7, 2009 @ 1:04 pm

There is an undocumented option, -kernel dist_nodelay Bool, which defaults to ‘true’. You could try benchmarking your program using plain messaging and setting dist_nodelay to ‘false’. It will try to aggregate messages in the inet driver. In previous cases, this has given comparable speedups to aggregating messages in ‘user space’, but obviously without the hassle.

Note that it will (obviously) affect /all/ messages sent between the nodes.

23. Posted by Ulf Wiger — June 8, 2009 @ 3:07 pm

hi ulf,

thank you for this pointer.

one question though: has the dist_nodelay option to be set as application environment variable, or can it be set as vm startup option somehow?

24. Posted by Roberto Ostinelli — June 8, 2009 @ 3:40 pm

When starting the VM, you can either set the variable on the erl command line like so:

erl … -kernel dist_nodelay false

or specify a .config file (often called sys.config), and point to it with:

erl … -config sys

The sys.config file has erlang syntax, and setting only the above variable would look like this:

$ cat sys.config
[{kernel, [{dist_nodelay, false}]}].
$

The details are covered in http://www.erlang.org/doc/design_principles/applications.html#7.8

25. Posted by Ulf Wiger — June 8, 2009 @ 4:58 pm

hi ulf,

since it is undocumented i was unsure how to set it up. thank you for clearing that up.

setting dist_nodelay to false does improve using plain messages, though the results are quite far from the ones i get with the qr mechanism.

flood1 [dist_nodelay true]: ~700K
flood1 [dist_nodelay false]: ~810K

but

flood2 [dist_nodelay true]: ~2,100K
flood2 [dist_nodelay false]: ~2,000K

unsurprisingly, using dist_nodelay coupled with another form of piping [my qr mechanism] actually slightly decreases performances.

i’ve performed this rough bench only on the osx macbook described here above. i will try this out also on other systems.

thank you for your feedback, dist_nodelay definitely is worth a try, and nice knowing :)

r.

26. Posted by Roberto Ostinelli — June 9, 2009 @ 10:10 am

This smells like a scheduling issue. When a packet is passed through TCP there is a complex chain of events: copy to kernel space, transfer to rx TCP stack, wake up receiver process, copy to user-space. I have a feeling that qr works because it reduces the number of userspace/kernel context switches.

Have you observed the relative packet sizes for tests with v without qr? Simply counting packets through the loopback interface (for constant data) should be sufficient to get a reasonable estimate of avg size.

I think Ulf’s idea is going in the right direction. I think you should be trying to get the underlying TCP stack to send fewer segments of larger size.

TCP_CORK sockopt under linux might be interesting.

A limitation will still be the number of copies *to* kernel space ie. send operation. Changing the aggregation scheme in the node is still required to influence that.

What effect does ping message size have on the comparison?

27. Posted by Richard Andrews — June 9, 2009 @ 2:08 pm

hi richard,

first of all thank you for your feedback. yes, i too do believe that i should go the route of the tcp optimization. i’m perfectly aware that it is quite bizarre to ‘emulate’ tcp piping mechanisms at application level. however, i’ve personally been unable to find a way to reproduce the results of the ‘qr’ pipe mechanism by mere tcp optimization.

as per your suggestion, i’ve performed the same tests considerably increasing the ping message size: as one would expect, the message passing gets slower, but the relative increase in speed using qr remains the same.

28. Posted by Roberto Ostinelli — June 9, 2009 @ 3:07 pm

Interesting result. I would not have predicted that the speedup would remain constant. I wasn’t necessarily advocating trying to optimise TCP but rather suggesting ways to track down the reason for the disparity.

I think the unremovable bottleneck will be the number of writes required with many messages sent individually. System calls are expensive. Userspace should be reducing the number of send operations as much as possible.

Out of curiosity are you using SMP in the erl nodes? Have you performed the test with SMP disabled on both nodes? I think there should only be a single erlang process (socket process) sending to TCP at any given time so IO lock contention should not be an issue. But it is worth checking.

Also out of curiosity – can you provide a breakdown of user-space versus kernel-space CPU utilisation during the tests. The time utility might be sufficient. Given the lower throughput I would expect to see the utilisation without qr to be significantly down.

29. Posted by Richard Andrews — June 10, 2009 @ 12:07 am

@Richard: Actually, since the send() function spawns processes in a tight loop, one could imagine that there will indeed be lock contention on the distributed send. Spawn is non-blocking, and returns as soon as it has inserted the other process in the ready queue. It’s therefore likely to spawn a great number of processes before it yields, e.g. due to context reductions.

30. Posted by Ulf Wiger — June 10, 2009 @ 2:12 pm

…commenting on my own comment above, of course in a multicore world, it is by no means certain that this invalidates the benchmark, or the utility of qr, much like ets has been a popular optimization on single-core, but is much less attractive on SMP due to the low-level locking.

31. Posted by Ulf Wiger — June 10, 2009 @ 2:15 pm

…what’s more (my brain seems to schedule in slices, just like Erlang does – beware: this may happen to you too, in time!)

There are good reasons to make ! fully asynchronous in SMP, while it isn’t in non-SMP. There’s a tiny backoff control that checks the length of the receiver’s inbox, and punishes the sender if it’s too long. This is *very* costly on many-core. Similarly, sending to a port is a blocking operation, which favors solutions where only one process sends to the port. Perhaps this too should be made fully asynchronous…? This is easier to accomplish for Distributed Erlang than e.g. for gen_tcp, since it would break the gen_tcp API unless it is extended.

32. Posted by Ulf Wiger — June 10, 2009 @ 2:23 pm

thank you ulf and richard for these precious insights. ulf, unfortunately i believe that it’s already too late for me, as my brains seems to be thinking in slices too :)

still, i still haven’t had a grab on:

1. why qr is actually working [or if it's working only with such a benchmark];
2. what should i beware of when using qr.

to my understanding up until now, qr could be particularly costly on multi-core machines. is this correct? this alone could give me enough reasons to think again before using qr.

fact is, i’ve designed a multi-node architecture from the start [i.e. not based on clusters] and i would really like to find a “best practice” way to get the speed between nodes as fast as it can be.

i’m even considering developing an alternative carrier, but this seems to be a really complex and overwhelming task.

again, thank you for any insights you feel like sharing.

cheers,

r.

33. Posted by Roberto Ostinelli — June 10, 2009 @ 2:39 pm

In an unusually timely fashion, OTP has just released R13B01, which, among other things, delivers significant speedups on port communication, e.g. Distributed Erlang. Specifically, it dramatically reduces the lock contention on the port.

It would be interesting to see how R13B01 affects this benchmark.

34. Posted by Ulf Wiger — June 10, 2009 @ 6:57 pm

@Roberto: I’m not at all sure that qr is worse on multicore. All messages are going out on the same port eventually anyway, and sending to the same process gives little lock contention as appending to the message queue is one of the operations most optimized in the Erlang VM – and a very small critical section. Sending to the port (until R13B01), the critical section actually includes the term_to_binary and updating of the atom cache, as well as the actual send to the port. This means that the risk of lock contention is much higher. It’s the contention that’s costly, not the actual locking.

35. Posted by Ulf Wiger — June 10, 2009 @ 7:02 pm

hi ulf, i had seen the readme of the release, and i said to myself this was perfect timing. when you wrote i just had erlang R13B01 installed and was starting to perform the test :)

here are the results. again, on my macbook with leopard.

1. to my surprise, sending speed of messages within the same node has consistently decreased. it is now 3.6 million/min instead of 5.3.

2. speed between two erlang nodes are similar, however setting the dist_nodelay option to false now produces far better results, bringing message passing between erlang nodes up to 1 million [instead of ~810K].

3. qr still beats everything: 2.3 million/min between different erlang nodes.

r.

36. Posted by Roberto Ostinelli — June 10, 2009 @ 7:10 pm

ok. still it beats me why qr is actually performing so well.

i would just like to understand where i should/should not use the qr mechanism, since as i already said – it is, after all, quite weird to implement a queue mechanism at a higher level than tcp.

i am far from having the insights and experience you have in erlang; thus again, thank you for your feedback, it is warmly welcomed.

37. Posted by Roberto Ostinelli — June 10, 2009 @ 7:16 pm

@Roberto: Those were interesting numbers. One perpetual problem with benchmarking is that you have to gauge how close to reality your test comes. In your test, you begin by spawning 200,000 processes that all get into line to send a distributed message. This will probably result in an unusually great propability for lock contention, at least compared to most ‘normal’ erlang applications.

The local case is interesting. You should probably let the OTP team know. My guess is that they have verified with their own benchmarks that local message passing does *not* deteriorate with each new release, but you may have hit some corner case that their benchmarks don’t cover.

38. Posted by Ulf Wiger — June 11, 2009 @ 12:45 pm

i completely agree. this benchmark test is very specific.

i will need to reperform the local benchmark, as i prefer not bugging the otp team without being 100% sure of the results i’m experimenting. asa i get to have some time, i will. i’ll get back to you then to know who to contact :)

39. Posted by Roberto Ostinelli — June 12, 2009 @ 9:14 am

Sorry, the comment form is closed at this time.