An evaluation of Erlang global process registries: meet Syn

Due to my personal interests and history, I often find myself building applications in field of the Internet Of Things. Most of the times I end up using Erlang: it is based on the Actor’s Model and is an ideological (and practical) perfect match to manage IoT interactions.

I recently built an application where devices can connect to, and interact with each other. Every device is identified via a unique ID (its serial number) and based on this ID the devices can send and receive messages. Nothing new here: it’s a standard messaging platform, which supports a custom protocol.

Due to the large amount of devices that I needed to support, this application runs on a cluster of Erlang nodes. Once a device connects to one of those nodes, the related TCP socket events are handled by a process running on that node. To send a message to a specific device, you send a message to the process that handles the devices’s TCP socket.

While building this application, I was early in the process faced with a very common problem: I needed a global process registry that would allow me to globally register a process based on its serial number, so that messages can be sent from anywhere in the cluster. This registry would need to have the following main characteristics:

Distributed.
Fast write speeds (>10,000 / sec).
Handle naming conflict resolution.
Allow for adding/removal of nodes.

Therefore I started to search for possible solutions (which included posting to the Erlang Questions mailing list), and these came out as my options:

Erlang’s global module.
Erlang’s pg2 module.
Gproc.
CloudI Process Groups.
Roll out a custom solution.

The Stress Test

I decided to evaluate every one of these solutions based on a variety of considerations. However, I also wanted to see how they would perform when submitted to some kind of a stress test. Therefore, I defined and wrote a simple one that:

Launches a certain number of processes per node (for example, 25,000 processes per node).
Registers these processes (25,000 processes per node), each with a globally unique Key.
Waits for those Keys to be propagated to all the nodes.
Unregisters all of these processes.
Waits for those Keys to be removed from all the nodes.
Re-registers all of the processes, to check for unwanted effects of subsequent add/remove operations.
Again, waits for those Keys to be propagated to all the nodes.
Kills all the processes (this time, without previously unregistering them).
Waits for those Keys to be removed from all the nodes (to check for process monitoring).

The test measures how long each one of these steps takes.

The following is the code for this stress test. You can see that it defines a behaviour: this is to implement callback modules that match the different syntax used by the different libraries.

-module(process_registry_bench).

-export([start/3]).
-export([register/2, unregister/2]).
-export([register_on_node/2, unregister_on_node/2]).

-callback init() -> term().
-callback register(Key :: string(), pid()) -> term().
-callback unregister(Key :: string(), pid()) -> term().
-callback retrieve(Key :: string()) -> pid() | undefined.
-callback process_loop() -> any().

-define(MAX_RETRIEVE_WAITING_TIME, 60000).


start(CallbackModule, ProcessesCount, Nodes) ->
	%% connect
	connect_nodes(Nodes),

	%% callback init
	CallbackModule:init(),

	%% launch processes
	{UpperKey, PidInfos} = launch_processes(CallbackModule, ProcessesCount),

	%% benchmark: register
	{TimeReg, _} = timer:tc(?MODULE, register, [CallbackModule, PidInfos]),
	io:format("Registered processes in ~p sec, at a rate of ~p/sec~n", [
		TimeReg/1000000,
		ProcessesCount/TimeReg*1000000
	]),

	%% benchmark: registration propagation
	{RetrievedInMs1, RetrieveProcess1} = retrieve(pid, CallbackModule, UpperKey),
	io:format("Check that process with Key ~p was found: ~p in ~p ms~n", [
		UpperKey, RetrieveProcess1, RetrievedInMs1
	]),

	%% benchmark: unregister
	{TimeUnreg, _} = timer:tc(?MODULE, unregister, [CallbackModule, PidInfos]),
	io:format("Unregistered processes in ~p sec, at a rate of ~p/sec~n", [
		TimeUnreg/1000000,
		ProcessesCount/TimeUnreg*1000000
	]),

	%% benchmark: unregistration propagation
	{RetrievedInMs2, RetrieveProcess2} = retrieve(undefined, CallbackModule, UpperKey),
	io:format("Check that process with Key ~p was NOT found: ~p in ~p ms~n", [
		UpperKey, RetrieveProcess2, RetrievedInMs2
	]),

	%% benchmark: re-registering
	{TimeReg2, _} = timer:tc(?MODULE, register, [CallbackModule, PidInfos]),
	io:format("Re-registered processes in ~p sec, at a rate of ~p/sec~n", [
		TimeReg2/1000000,
		ProcessesCount/TimeReg2*1000000
	]),

	%% benchmark: re-registration propagation
	{RetrievedInMs3, RetrieveProcess3} = retrieve(pid, CallbackModule, UpperKey),
	io:format("Check that process with Key ~p was found: ~p in ~p ms~n", [
		UpperKey, RetrieveProcess3, RetrievedInMs3
	]),

	%% benchmark: monitoring
	io:format("Kill all processes~n", []),
	kill_processes(PidInfos),
	{RetrievedInMs4, RetrieveProcess4} = retrieve(undefined, CallbackModule, UpperKey),
	io:format("Check that process with Key ~p was NOT found: ~p in ~p ms~n", [
		UpperKey, RetrieveProcess4, RetrievedInMs4
	]).

connect_nodes(Nodes) ->
    [true = net_kernel:connect_node(Node) || Node <- Nodes].

launch_processes(CallbackModule, ProcessesCount) ->
	%% return the processes info in format [{Node, [{Key, Pid}]}, ...]
	Nodes = [node() | nodes()],
	ProcessesPerNode = round(ProcessesCount / length(Nodes)),
	UpperKey = integer_to_list(ProcessesPerNode * length(Nodes)),
	F = fun(Node, Acc) ->
		StartingKey = length(Acc) * ProcessesPerNode,
		Pids = launch_processes_on_node(CallbackModule, ProcessesPerNode, StartingKey, Node),
		[{Node, Pids} | Acc]
	end,
	{UpperKey, lists:foldl(F, [], Nodes)}.
launch_processes_on_node(CallbackModule, ProcessesPerNode, StartingKey, Node) ->
	%% return the key and process in a list of format [{Key, Pid}, ...]
	Seq = [
		integer_to_list(Key)
		|| Key <- lists:seq(StartingKey + 1, ProcessesPerNode + StartingKey)
	],
	[{Key, spawn(Node, CallbackModule, process_loop, [])} || Key <- Seq].

register(CallbackModule, PidInfos) ->
	%% register in parallel on all nodes
	F = fun({Node, NodePidInfos}, Acc) ->
		RpcKey = rpc:async_call(Node, ?MODULE, register_on_node, [
			CallbackModule, NodePidInfos
		]),
		[{Node, RpcKey} | Acc]
	end,
	RpcKeys = lists:foldl(F, [], PidInfos),
	%% wait for registration to complete on all nodes
	FResult = fun({Node, RpcKey}) ->
		Registered = rpc:yield(RpcKey),
		io:format("Registered ~p processes on node ~p~n", [Registered, Node])
	end,
	lists:foreach(FResult, RpcKeys).
register_on_node(CallbackModule, NodePidInfos) ->
	F = fun({Key, Pid}) ->
		CallbackModule:register(Key, Pid)
	end,
	lists:foreach(F, NodePidInfos),
	length(NodePidInfos).

retrieve(Expected, CallbackModule, Key) ->
	StartTime = epoch_time_ms(),
	retrieve(Expected, CallbackModule, Key, StartTime).
retrieve(pid, CallbackModule, Key, StartTime) ->
	%% wait for a pid to be returned
	case CallbackModule:retrieve(Key) of
		undefined ->
			timer:sleep(50),
			case epoch_time_ms() > StartTime + ?MAX_RETRIEVE_WAITING_TIME of
				true -> {error, timeout_during_retrieve};
				false -> retrieve(pid, CallbackModule, Key, StartTime)
			end;
		{error, Error} ->
			{error, Error};
		Pid ->
			RetrievedInMs = epoch_time_ms() - StartTime,
			{RetrievedInMs, Pid}
	end;
retrieve(undefined, CallbackModule, Key, StartTime) ->
	%% wait for undefined to be returned
	case CallbackModule:retrieve(Key) of
		undefined ->
			RetrievedInMs = epoch_time_ms() - StartTime,
			{RetrievedInMs, undefined};
		{error, Error} ->
			{error, Error};
		_Pid ->
			timer:sleep(50),
			case epoch_time_ms() > StartTime + ?MAX_RETRIEVE_WAITING_TIME of
				true -> {error, timeout_during_retrieve};
				false -> retrieve(undefined, CallbackModule, Key, StartTime)
			end
	end.

unregister(CallbackModule, PidInfos) ->
	%% unregister in parallel on all nodes
	F = fun({Node, NodePidInfos}, Acc) ->
		RpcKey = rpc:async_call(Node, ?MODULE, unregister_on_node, [
		CallbackModule, NodePidInfos
		]),
		[{Node, RpcKey} | Acc]
	end,
	RpcKeys = lists:foldl(F, [], PidInfos),
	%% wait for unregistration to complete on all nodes
	FResult = fun({Node, RpcKey}) ->
		Unregistered = rpc:yield(RpcKey),
		io:format("Unregistered ~p processes on node ~p~n", [Unregistered, Node])
	end,
	lists:foreach(FResult, RpcKeys).
unregister_on_node(CallbackModule, NodePidInfos) ->
	F = fun({Key, Pid}) ->
		CallbackModule:unregister(Key, Pid)
	end,
	lists:foreach(F, NodePidInfos),
	length(NodePidInfos).

kill_processes(PidInfos) ->
	F = fun({_Node, NodePidInfos}) ->
		[exit(Pid, kill) || {_Key, Pid} <- NodePidInfos]
	end,
	lists:foreach(F, PidInfos).

epoch_time_ms() ->
    {Mega, Sec, Micro} = os:timestamp(),
    (Mega * 1000000 + Sec) * 1000 + round(Micro / 1000).

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

-module(process_registry_bench).

-export([start/3]).

-export([register/2, unregister/2]).

-export([register_on_node/2, unregister_on_node/2]).

-callback init() -> term().

-callback register(Key :: string(), pid()) -> term().

-callback unregister(Key :: string(), pid()) -> term().

-callback retrieve(Key :: string()) -> pid() | undefined.

-callback process_loop() -> any().

-define(MAX_RETRIEVE_WAITING_TIME, 60000).

start(CallbackModule, ProcessesCount, Nodes) ->

%% connect

connect_nodes(Nodes),

%% callback init

CallbackModule:init(),

%% launch processes

{UpperKey, PidInfos} = launch_processes(CallbackModule, ProcessesCount),

%% benchmark: register

{TimeReg, _} = timer:tc(?MODULE, register, [CallbackModule, PidInfos]),

io:format("Registered processes in ~p sec, at a rate of ~p/sec~n", [

TimeReg/1000000,

ProcessesCount/TimeReg*1000000

]),

%% benchmark: registration propagation

{RetrievedInMs1, RetrieveProcess1} = retrieve(pid, CallbackModule, UpperKey),

io:format("Check that process with Key ~p was found: ~p in ~p ms~n", [

UpperKey, RetrieveProcess1, RetrievedInMs1

]),

%% benchmark: unregister

{TimeUnreg, _} = timer:tc(?MODULE, unregister, [CallbackModule, PidInfos]),

io:format("Unregistered processes in ~p sec, at a rate of ~p/sec~n", [

TimeUnreg/1000000,

ProcessesCount/TimeUnreg*1000000

]),

%% benchmark: unregistration propagation

{RetrievedInMs2, RetrieveProcess2} = retrieve(undefined, CallbackModule, UpperKey),

io:format("Check that process with Key ~p was NOT found: ~p in ~p ms~n", [

UpperKey, RetrieveProcess2, RetrievedInMs2

]),

%% benchmark: re-registering

{TimeReg2, _} = timer:tc(?MODULE, register, [CallbackModule, PidInfos]),

io:format("Re-registered processes in ~p sec, at a rate of ~p/sec~n", [

TimeReg2/1000000,

ProcessesCount/TimeReg2*1000000

]),

%% benchmark: re-registration propagation

{RetrievedInMs3, RetrieveProcess3} = retrieve(pid, CallbackModule, UpperKey),

io:format("Check that process with Key ~p was found: ~p in ~p ms~n", [

UpperKey, RetrieveProcess3, RetrievedInMs3

]),

%% benchmark: monitoring

io:format("Kill all processes~n", []),

kill_processes(PidInfos),

{RetrievedInMs4, RetrieveProcess4} = retrieve(undefined, CallbackModule, UpperKey),

io:format("Check that process with Key ~p was NOT found: ~p in ~p ms~n", [

UpperKey, RetrieveProcess4, RetrievedInMs4

]).

connect_nodes(Nodes) ->

[true = net_kernel:connect_node(Node) || Node <- Nodes].

launch_processes(CallbackModule, ProcessesCount) ->

%% return the processes info in format [{Node, [{Key, Pid}]}, ...]

Nodes = [node() | nodes()],

ProcessesPerNode = round(ProcessesCount / length(Nodes)),

UpperKey = integer_to_list(ProcessesPerNode * length(Nodes)),

F = fun(Node, Acc) ->

StartingKey = length(Acc) * ProcessesPerNode,

Pids = launch_processes_on_node(CallbackModule, ProcessesPerNode, StartingKey, Node),

[{Node, Pids} | Acc]

end,

{UpperKey, lists:foldl(F, [], Nodes)}.

launch_processes_on_node(CallbackModule, ProcessesPerNode, StartingKey, Node) ->

%% return the key and process in a list of format [{Key, Pid}, ...]

Seq = [

integer_to_list(Key)

|| Key <- lists:seq(StartingKey + 1, ProcessesPerNode + StartingKey)

[{Key, spawn(Node, CallbackModule, process_loop, [])} || Key <- Seq].

%% register in parallel on all nodes

F = fun({Node, NodePidInfos}, Acc) ->

RpcKey = rpc:async_call(Node, ?MODULE, register_on_node, [

CallbackModule, NodePidInfos

]),

[{Node, RpcKey} | Acc]

end,

RpcKeys = lists:foldl(F, [], PidInfos),

%% wait for registration to complete on all nodes

FResult = fun({Node, RpcKey}) ->

Registered = rpc:yield(RpcKey),

io:format("Registered ~p processes on node ~p~n", [Registered, Node])

end,

lists:foreach(FResult, RpcKeys).

register_on_node(CallbackModule, NodePidInfos) ->

F = fun({Key, Pid}) ->

CallbackModule:register(Key, Pid)

end,

lists:foreach(F, NodePidInfos),

length(NodePidInfos).

retrieve(Expected, CallbackModule, Key) ->

StartTime = epoch_time_ms(),

retrieve(Expected, CallbackModule, Key, StartTime).

retrieve(pid, CallbackModule, Key, StartTime) ->

%% wait for a pid to be returned

case CallbackModule:retrieve(Key) of

undefined ->

timer:sleep(50),

case epoch_time_ms() > StartTime + ?MAX_RETRIEVE_WAITING_TIME of

true -> {error, timeout_during_retrieve};

false -> retrieve(pid, CallbackModule, Key, StartTime)

end;

{error, Error} ->

{error, Error};

Pid ->

RetrievedInMs = epoch_time_ms() - StartTime,

{RetrievedInMs, Pid}

end;

retrieve(undefined, CallbackModule, Key, StartTime) ->

%% wait for undefined to be returned

case CallbackModule:retrieve(Key) of

undefined ->

RetrievedInMs = epoch_time_ms() - StartTime,

{RetrievedInMs, undefined};

{error, Error} ->

{error, Error};

_Pid ->

timer:sleep(50),

case epoch_time_ms() > StartTime + ?MAX_RETRIEVE_WAITING_TIME of

true -> {error, timeout_during_retrieve};

false -> retrieve(undefined, CallbackModule, Key, StartTime)

end

end.

unregister(CallbackModule, PidInfos) ->

%% unregister in parallel on all nodes

F = fun({Node, NodePidInfos}, Acc) ->

RpcKey = rpc:async_call(Node, ?MODULE, unregister_on_node, [

CallbackModule, NodePidInfos

]),

[{Node, RpcKey} | Acc]

end,

RpcKeys = lists:foldl(F, [], PidInfos),

%% wait for unregistration to complete on all nodes

FResult = fun({Node, RpcKey}) ->

Unregistered = rpc:yield(RpcKey),

io:format("Unregistered ~p processes on node ~p~n", [Unregistered, Node])

end,

lists:foreach(FResult, RpcKeys).

unregister_on_node(CallbackModule, NodePidInfos) ->

F = fun({Key, Pid}) ->

CallbackModule:unregister(Key, Pid)

end,

lists:foreach(F, NodePidInfos),

length(NodePidInfos).

kill_processes(PidInfos) ->

F = fun({_Node, NodePidInfos}) ->

[exit(Pid, kill) || {_Key, Pid} <- NodePidInfos]

end,

lists:foreach(F, PidInfos).

epoch_time_ms() ->

{Mega, Sec, Micro} = os:timestamp(),

(Mega * 1000000 + Sec) * 1000 + round(Micro / 1000).

To run this stress test:

process_registry_bench:start(CallbackModule, ProcessCount, Nodes).

1	process_registry_bench:start(CallbackModule, ProcessCount, Nodes).

For instance, to launch it with the callback module global_bench for 100,000 processes running on a cluster of 4 nodes ['1@127.0.0.1', '2@127.0.0.1', '3@127.0.0.1', '4@127.0.0.1']:

process_registry_bench:start(global_bench, 100000, [
    '1@127.0.0.1',
    '2@127.0.0.1',
    '3@127.0.0.1',
    '4@127.0.0.1'
]).

process_registry_bench:start(global_bench, 100000, [

'1@127.0.0.1',

'2@127.0.0.1',

'3@127.0.0.1',

'4@127.0.0.1'

]).

Running this test returns an output similar to:

Registered 25000 processes on node '1@127.0.0.1'
Registered 25000 processes on node '2@127.0.0.1'
Registered 25000 processes on node '3@127.0.0.1'
Registered 25000 processes on node '4@127.0.0.1'
Registered processes in 6.385835 sec, at a rate of 15659.659230155492/sec
Check that process with Key "100000" was found: <6218.25065.0> in 0 ms
Unregistered 25000 processes on node '1@127.0.0.1'
Unregistered 25000 processes on node '2@127.0.0.1'
Unregistered 25000 processes on node '3@127.0.0.1'
Unregistered 25000 processes on node '4@127.0.0.1'
Unregistered processes in 4.481706 sec, at a rate of 22312.93172733776/sec
Check that process with Key "100000" was NOT found: undefined in 0 ms
Registered 25000 processes on node '1@127.0.0.1'
Registered 25000 processes on node '2@127.0.0.1'
Registered 25000 processes on node '3@127.0.0.1'
Registered 25000 processes on node '4@127.0.0.1'
Re-registered processes in 4.943493 sec, at a rate of 20228.611631492146/sec
Check that process with Key "100000" was found: <6218.25065.0> in 0 ms
Kill all processes
Check that process with Key "100000" was NOT found: undefined in 0 ms
ok

Registered 25000 processes on node '1@127.0.0.1'

Registered 25000 processes on node '2@127.0.0.1'

Registered 25000 processes on node '3@127.0.0.1'

Registered 25000 processes on node '4@127.0.0.1'

Registered processes in 6.385835 sec, at a rate of 15659.659230155492/sec

Check that process with Key "100000" was found: <6218.25065.0> in 0 ms

Unregistered 25000 processes on node '1@127.0.0.1'

Unregistered 25000 processes on node '2@127.0.0.1'

Unregistered 25000 processes on node '3@127.0.0.1'

Unregistered 25000 processes on node '4@127.0.0.1'

Unregistered processes in 4.481706 sec, at a rate of 22312.93172733776/sec

Check that process with Key "100000" was NOT found: undefined in 0 ms

Registered 25000 processes on node '1@127.0.0.1'

Registered 25000 processes on node '2@127.0.0.1'

Registered 25000 processes on node '3@127.0.0.1'

Registered 25000 processes on node '4@127.0.0.1'

Re-registered processes in 4.943493 sec, at a rate of 20228.611631492146/sec

Check that process with Key "100000" was found: <6218.25065.0> in 0 ms

Kill all processes

Check that process with Key "100000" was NOT found: undefined in 0 ms

The Process Registry Libraries

The following are the considerations that I made for every solution.

1. Erlang’s native global module

Considerations

The Erlang global module has native functionalities to support a global process registry. I was not particularly attracted to it, because:

I always think that this module should be used to identify application’s long-running services.
I didn’t know if millions of entries can be supported. This module wasn’t built with my use case in mind: as per my previous point, it is generally used to register long-running processes.
It has a locking mechanism to ensure that the registration is atomic. I felt this could become a serious bottleneck to the registration of processes.

However, this is a native Erlang module, which also allows to define a resolve function to be used for conflict resolution (i.e. in case of race conditions, or during net splits, when a Key gets registered simultaneously on two different nodes). It is able to satisfy the distributed requirements out of the box, with no need for additional libraries.

Stress Test

I gave it a go at my stress test, with the following callback module:

-module(global_bench).
-behaviour(process_registry_bench).

-export([init/0]).
-export([register/2, unregister/2]).
-export([retrieve/1]).
-export([process_loop/0]).

init() ->
	ok.

register(Key, Pid) ->
	yes = global:register_name(Key, Pid).

unregister(Key, _Pid) ->
	global:unregister_name(Key).

retrieve(Key) ->
	global:whereis_name(Key).

process_loop() ->
	receive
		_ -> ok
	end.

-module(global_bench).

-behaviour(process_registry_bench).

-export([init/0]).

-export([register/2, unregister/2]).

-export([retrieve/1]).

-export([process_loop/0]).

init() ->

ok.

yes = global:register_name(Key, Pid).

unregister(Key, _Pid) ->

global:unregister_name(Key).

retrieve(Key) ->

global:whereis_name(Key).

process_loop() ->

receive

_ -> ok

end.

Note that process_loop (which is the loop running in the processes) does nothing, except keeping the process alive.

The results of the stress test are:

	1 Node	2 Nodes	3 Nodes	4 Nodes
Reg / second	27,233	2,673	1,997	1,579
Retrieve registered Key (ms)	0	0	0	0
Unreg / second	29,491	2,908	2,206	1,596
Retrieve unregistered Key (ms)	0	0	0	0
Re-Reg / second	27,149	2,993	2,131	2,542
Retrieve re-registered Key (ms)	0	0	0	0
Retrieve Key of killed Pid (ms)	0	timeout	timeout	timeout

Conclusions

The locking mechanism heavily influences the decrease in performance that can be seen when adding nodes. With a cluster of 2+ nodes we already are under the spec of 10,000 registrations / second.
The monitoring of processes is slow. After having killed all the processes, in a cluster of 2+ nodes it takes more than 60 seconds to have global:whereis_name/1 return undefined (this is what timeout means in the table here above). I had to decrease the number of processes to around 80,000 to have the stress test pass in a cluster of 4 nodes, and it would take around 55 seconds for a killed process’ Key to be removed from the registry.

For these reasons, it didn’t look like I could use this module.

2. Erlang’s native pg2 module

Considerations

Erlang pg2 module has native functionalities to support a global process registry. I was not particularly attracted to it, because:

This library handles Process Groups, which is very different from handling unique Registered Names. We can use it for our purpose though, by basically creating Groups with a single entry. These groups are named according to our Keys, and every Group has a single entry: the Pid that we are registering. This is kind of a trick, but it’s not a showstopper.
Having Process Groups basically means that conflict resolution isn’t covered. If two processes are registered on different nodes with the same Key (because of race conditions or during a net split) this will result in having a Process Group with two elements instead of one. Sometimes this is fine; however, I wanted to ensure that there would be a clearly identified single Pid per device in the whole system. Not a showstopper either, but a turn-off.
I didn’t know if millions of entries can be supported. This module wasn’t built with my use case in mind.
Here too, it has a locking mechanism to ensure that the registration is atomic which could become a bottleneck to the registration of processes.

Stress Test

Here’s the callback module:

-module(pg2_bench).
-behaviour(process_registry_bench).

-export([init/0]).
-export([register/2, unregister/2]).
-export([retrieve/1]).
-export([process_loop/0]).

init() ->
	ok.

register(Key, Pid) ->
	ok = pg2:create(Key), %% create group
	ok = pg2:join(Key, Pid). %% add pid

unregister(Key, _Pid) ->
	ok = pg2:delete(Key).

retrieve(Key) ->
	case pg2:get_members(Key) of
		{error, {no_such_group, Key}} -> undefined;
		[] -> undefined;
		[Pid] -> Pid
	end.

process_loop() ->
	receive
		_ -> ok
	end.

-module(pg2_bench).

-behaviour(process_registry_bench).

-export([init/0]).

-export([register/2, unregister/2]).

-export([retrieve/1]).

-export([process_loop/0]).

init() ->

ok.

ok = pg2:create(Key), %% create group

ok = pg2:join(Key, Pid). %% add pid

unregister(Key, _Pid) ->

ok = pg2:delete(Key).

retrieve(Key) ->

case pg2:get_members(Key) of

{error, {no_such_group, Key}} -> undefined;

[] -> undefined;

[Pid] -> Pid

end.

process_loop() ->

receive

_ -> ok

end.

The results of the stress test are:

	1 Node	2 Nodes	3 Nodes	4 Nodes
Reg / second	25,062	3,823	2,914	1,862
Retrieve registered Key (ms)	0	0	0	0
Unreg / second	39,522	6,903	5,191	3,425
Retrieve unregistered Key (ms)	0	0	0	0
Re-Reg / second	25,701	3,794	2,783	1,817
Retrieve re-registered Key (ms)	0	0	0	0
Retrieve Key of killed Pid (ms)	timeout	timeout	timeout	timeout

Conclusions

The locking mechanism heavily influences the decrease in performance that can be seen when adding nodes. With a cluster of 2+ nodes we already are under the spec of 10,000 registrations / second.
The monitoring of processes is slow. After having killed all the processes, even on a single nodes it takes more than 60 seconds to have pg2:get_members/1 return that the group no longer exits. I had to decrease the number of processes to around 45,000 to have the stress test pass in a cluster of 4 nodes, and it would take a little less than 60 seconds for a killed process’ Key to be removed from the registry.

For these reasons, it didn’t look like I could use this module.

3. Gproc

Considerations

gproc is a well-known process registry which is normally used for the additional features that it provides on top of Erlang’s native process dictionary (for instance, it is able to provide pub/sub patterns). It is a solid and well-supported library, and you can often see Ulf Wiger (one of the library’s authors) generously providing support for it.

However, there were some concerns I had:

For the distributed part it relies on gen_leader, on which I’ve heard too many horror stories (maybe that’s not a thing anymore). Ulf pointed me to a gproc branch that uses locks_leader, where he is mainly concentrating his efforts for gproc’s support for distributed operations.
I felt that the main purpose of this library is not to provide a distributed process registry as much as extending the existing Erlang registration mechanisms with some additional features. The README in gproc’s Github page clearly depicts it as being an “Extended process dictionary”; it just felt that the distributed part hasn’t been the primary focus in the development of this library.
I could not understand how conflict resolution is managed in a distributed environment.

Stress Test

Here’s the callback module:

-module(gproc_bench).
-behaviour(process_registry_bench).

-export([init/0]).
-export([register/2, unregister/2]).
-export([retrieve/1]).
-export([process_loop/0]).

init() ->
	%% start app on every node
	Nodes = [node() | nodes()],
	F = fun(Node) ->
		rpc:call(Node, application, ensure_all_started, [gproc]),
		rpc:call(Node, gproc_dist, start_link, [Nodes])
	end,
	lists:foreach(F, Nodes).

register(Key, Pid) ->
	Pid ! {self(), reg, Key},
	receive
		done -> ok
	end.

unregister(Key, Pid) ->
	Pid ! {self(), unreg, Key},
	receive
		done -> ok
	end.

retrieve(Key) ->
	case catch gproc:lookup_pid({n, g, Key}) of
		{'EXIT', _} -> undefined;
		Pid -> Pid
	end.

process_loop() ->
	receive
		{Sender, reg, Key} ->
			gproc:reg({n, g, Key}, ignored),
			Sender ! done,
			process_loop();
		{Sender, unreg, Key} ->
			gproc:unreg({n, g, Key}),
			Sender ! done,
			process_loop()
	end.

-module(gproc_bench).

-behaviour(process_registry_bench).

-export([init/0]).

-export([register/2, unregister/2]).

-export([retrieve/1]).

-export([process_loop/0]).

init() ->

%% start app on every node

Nodes = [node() | nodes()],

F = fun(Node) ->

rpc:call(Node, application, ensure_all_started, [gproc]),

rpc:call(Node, gproc_dist, start_link, [Nodes])

end,

lists:foreach(F, Nodes).

Pid ! {self(), reg, Key},

receive

done -> ok

end.

unregister(Key, Pid) ->

Pid ! {self(), unreg, Key},

receive

done -> ok

end.

retrieve(Key) ->

case catch gproc:lookup_pid({n, g, Key}) of

{'EXIT', _} -> undefined;

Pid -> Pid

end.

process_loop() ->

receive

{Sender, reg, Key} ->

gproc:reg({n, g, Key}, ignored),

Sender ! done,

process_loop();

{Sender, unreg, Key} ->

gproc:unreg({n, g, Key}),

Sender ! done,

process_loop()

end.

Note: in gproc, to ensure thread safety, a process can only set its own values. That’s why the register/2 and unregister/2 callbacks here above send messages to the processes, which then register or unregister themselves (see process_loop). As you can see here above I’ve decided to provide a locking call for these functions (by using a receive block), to emulate the locking calls that I’ve used in the other libraries.

The results of the stress test are:

	1 Node	2 Nodes	3 Nodes	4 Nodes
Reg / second	67,011	19,111	22,048	15,659
Retrieve registered Key (ms)	0	0	0	0
Unreg / second	118,228	22,845	24,282	22,312
Retrieve unregistered Key (ms)	0	0	0	0
Re-Reg / second	127,200	22,115	25,884	20,228
Retrieve re-registered Key (ms)	0	0	0	0
Retrieve Key of killed Pid (ms)	178	1,890	7,584	10,600

Conclusions

These are overall very good results.
I didn’t need to reduce the process count to make all of the test pass.
The monitoring of processes can be optimized. After having killed all the processes, on a cluster of 4 nodes it takes >10 seconds for gproc:lookup_pid/1 to not find the Pid once a process has exited.
Unfortunately, I had some inconsistent results running this test in a cluster of 2+ nodes. Often, the test could not retrieve the registered Key (after the first registration round) in less than 60 second, and timed out.

I was a little skeptical though on the inconsistency that I saw in the test results, which might be related to the gen_leader issues that I’ve occasionally heard about. The author’s choice to move towards locks_leader might be a sign of this. Despite these thoughts, this looked like a good potential candidate.

4. CloudI Process Groups

Considerations

cpg is an actively maintained library, and his main author Michael Truog is often very available to discuss his choices and provide support. cpg deals with Process Groups and not unique Registered Names, therefore my concerns where similar to the ones I had with pg2:

Handling Process Groups is very different from handling unique Registered Names. We can use the same trick used with pg2, i.e. creating Process Groups named with Key, with a single entry (the Pid).
Here too, having Process Groups basically means that conflict resolution isn’t covered. This made me a little uncomfortable because I wanted to ensure that there would be a clearly identified single Pid per device in the whole system.

Stress Test

Here’s the callback module:

-module(cpg_bench).
-behaviour(process_registry_bench).

-export([init/0]).
-export([register/2, unregister/2]).
-export([retrieve/1]).
-export([process_loop/0]).

init() ->
	%% start app on every node
	Nodes = [node() | nodes()],
	[rpc:call(Node, reltool_util, application_start, [cpg]) ||  Node <- Nodes].

register(Key, Pid) ->
	ok = cpg:join(Key, Pid).

unregister(_Key, Pid) ->
	ok = cpg:leave(Pid).

retrieve(Key) ->
	case catch cpg:get_members(Key) of
		{ok, Key, [Pid]} -> Pid;
		{error, {no_such_group, Key}} -> undefined;
		Error -> {error, Error}
	end.

process_loop() ->
	receive
		_ -> ok
	end.

-module(cpg_bench).

-behaviour(process_registry_bench).

-export([init/0]).

-export([register/2, unregister/2]).

-export([retrieve/1]).

-export([process_loop/0]).

init() ->

%% start app on every node

Nodes = [node() | nodes()],

[rpc:call(Node, reltool_util, application_start, [cpg]) || Node <- Nodes].

ok = cpg:join(Key, Pid).

unregister(_Key, Pid) ->

ok = cpg:leave(Pid).

retrieve(Key) ->

case catch cpg:get_members(Key) of

{ok, Key, [Pid]} -> Pid;

{error, {no_such_group, Key}} -> undefined;

Error -> {error, Error}

end.

process_loop() ->

receive

_ -> ok

end.

The results of the stress test are:

	1 Node	2 Nodes	3 Nodes	4 Nodes
Reg / second	110,198	42,680	20,703	8,488
Retrieve registered Key (ms)	0	0	0	0
Unreg / second	109,374	32,264	25,599	15,128
Retrieve unregistered Key (ms)	0	1	0	0
Re-Reg / second	126,791	30,862	32,138	20,791
Retrieve re-registered Key (ms)	0	0	0	0
Retrieve Key of killed Pid (ms)	error	error	error	error

Conclusions

These are overall very good results.
I was surprised of the major drop in a cluster of 4 nodes. I run this test multiple times and it always returned similar results.
The monitoring of processes didn’t work appropriately. Even on a single node, the test experienced an internal timeout:

{'EXIT',
 {timeout,
  {gen_server,
   call,
   [cpg_default_scope,
    {get_members,
     "100000"}]}}}

{'EXIT',

{timeout,

{gen_server,

call,

[cpg_default_scope,

{get_members,

"100000"}]}}}

I had to decrease the number of processes to around 25,000 to have the stress test pass in a cluster of 4 nodes. The monitoring issue didn’t make me feel particularly at ease, however this library did look like a potential candidate.

5. Custom Solution: Syn

Considerations

Since it became clear that I could not use Erlang’s native global or pg2 modules, and that the two other libraries I looked into were candidates but each one with their own little twerks, I decided to try a custom solution, which I called syn (short for synonym).

In any distributed system you are faced with a consistency challenge, which is often resolved by having one master arbiter performing all write operations (chosen with a mechanism of leader election), or through atomic transactions. As said here above, I needed a global process registry for an application of the IoT field. In this context, Keys used to identify a process are often the physical object’s unique identifier (for instance, its serial or mac address), and are therefore already defined and unique before hitting the system. The consistency challenge is less of a problem in this case, since the likelihood of concurrent incoming requests that would register processes with the same Key is extremely low and, in most cases, acceptable.

Therefore, Availability has been chosen over Consistency and Syn is eventually consistent.

Under the hood, Syn performs dirty reads and writes into a distributed in-memory Mnesia table, replicated across all the nodes of the cluster. This made me feel comfortable that I wouldn’t need to reinvent the replication mechanisms of Erlang’s native DB, however I needed a way to handle conflict resolution and net splits. For this reason, Syn can automatically manage conflict resolution by implementing a specialized and simplified version of the mechanisms used in Ulf Wiger’s unsplit framework.

You can read more about Syn in its github repo.

Stress Test

Here’s the callback module:

-module(syn_bench).
-behaviour(process_registry_bench).

-export([init/0]).
-export([register/2, unregister/2]).
-export([retrieve/1]).
-export([process_loop/0]).

init() ->
	%% start app on every node
	Nodes = [node() | nodes()],
	F = fun(Node) ->
		rpc:call(Node, syn, start, []),
		rpc:call(Node, syn, init, [])
	end,
	lists:foreach(F, Nodes).

register(Key, Pid) ->
	ok = syn:register(Key, Pid).

unregister(Key, _Pid) ->
	ok = syn:unregister(Key).

retrieve(Key) ->
	syn:find_by_key(Key).

process_loop() ->
	receive
		_ -> ok
	end.

-module(syn_bench).

-behaviour(process_registry_bench).

-export([init/0]).

-export([register/2, unregister/2]).

-export([retrieve/1]).

-export([process_loop/0]).

init() ->

%% start app on every node

Nodes = [node() | nodes()],

F = fun(Node) ->

rpc:call(Node, syn, start, []),

rpc:call(Node, syn, init, [])

end,

lists:foreach(F, Nodes).

ok = syn:register(Key, Pid).

unregister(Key, _Pid) ->

ok = syn:unregister(Key).

retrieve(Key) ->

syn:find_by_key(Key).

process_loop() ->

receive

_ -> ok

end.

The results of the stress test are:

	1 Node	2 Nodes	3 Nodes	4 Nodes
Reg / second	106,324	52,792	60,958	40,929
Retrieve registered Key (ms)	0	0	0	56
Unreg / second	105,506	50,591	67,042	42,896
Retrieve unregistered Key (ms)	0	0	0	0
Re-Reg / second	106,424	51,322	77,258	47,125
Retrieve re-registered Key (ms)	0	0	0	0
Retrieve Key of killed Pid (ms)	719	995	1,577	1,825

Conclusions

These are overall very good results. I’m not sure why Syn is performing better with 3 nodes than with 2 (and I’ve repeated this test more than once).
I didn’t need to reduce the process count to make all of the test pass.
The monitoring of processes worked appropriately.

Final notes

I want to stress out how comparisons and these tests are difficult to perform. Every library behaves differently, and it is hard (if not impossible) to define some kind of a common stress test to allow for a better understanding of their performance levels. I gave it a go, but looking at the above definition of my stress test for instance I ask myself: “Why did I set the process count to 100,000? I can see that most libraries behave fine with lower numbers”. Also, “What would happen if instead of registering processes sequentially in a single process per node, we had them register themselves simultaneously, therefore increasing the load on the registry?”. More importantly, “Does this test represent some kind of real life scenario?”.

This article wants to share my thoughts and how I ended up writing Syn. Sure, Syn performs well in the defined use case and stress test, but this does in no way mean that the other libraries here won’t perform way better in other stress tests and scenarios. I’d actually be glad to know that someone else is willing to take the time to evaluate these, and other, global process registries. They are a kind of holy grail; and let’s remember that anything distributed is never easy, nor given.

As a final note, I’d enjoy reading comments from the library authors or other Erlang enthusiasts. This is such a delicate matter that I’d love to have a healthy exchange of opinions, hopefully contributing to improving all of our experiences.

12 Comments

Adam Lindberg

July 6, 2015 at 12:53 pm / Reply

Would be interesting to see Locker (https://github.com/wooga/locker) compared in these benchmarks. It’s a distributed locker service that can also be used as a process registry.
- Roberto Ostinelli Author
  
  July 6, 2015 at 2:22 pm / Reply
  
  Hi Adam, Knut did point at it after my thread on Erlang Questions mailing list.
  
  As I told him, this is an expiring key/value store DB, which I read can be used as a mechanism for leader election. However, I assume all the rest still needs to be developed.
  For instance, does it do process monitoring? How does it handle conflict resolution?
Michael Truog

July 6, 2015 at 6:53 pm / Reply

Some issues with the test above:
1) No hardware or instance type was mentioned, making the test irreproducible and limiting the impact of the results.
2) No common timeout value was enforced for the operations among the separate process registries. It is important to determine the performance given some amount of time for a synchronous request.
3) Syn is likely to lose process registry data during a netsplit, due to process registry modifications that occur during the netsplit that are unable to be resolved after the partitions merge. This is a common problem with mnesia, so it is probably a good thing to explain here, since other process registry solutions do not have this problem.
4) The usage of CloudI Process Groups (https://github.com/okeuday/cpg) was erroneously mentioned as having monitoring processes, but the error clearly shows a timeout occurred, due to the default timeout value being used (5000 ms). For cpg, it does appear like you are reaching a bottleneck on a single scope process, which is why it is possible to create more than 1 scope (to avoid always using the single default scope, i.e., cpg_default_scope). The other thing to keep in mind is that cached cpg data can be used to avoid putting extra load on a scope process by utilizing the cpg_data module. These details may not match your use case, but it is at least important that you are using the same timeout values for all the process registry tests, to avoid premature timeout exceptions.
- Roberto Ostinelli Author
  
  July 6, 2015 at 7:20 pm / Reply
  
  Michael,
  Sometimes I have a hard time in understanding what you refer to. :) You say that Syn:
  
  is likely to lose process registry data during a netsplit, due to process registry modifications that occur during the netsplit that are unable to be resolved after the partitions merge
  
  …No. And that’s the whole point actually. Syn is able to resolve registry modifications that have happened during a net split. You might want to check Syn’s test suite that covers net splits. So, why do you think it is unable to do it?
  
  As far as cdg usage with multiple scopes, I understand them, but I don’t care about using them. My use case is extremely simple, I don’t see why I need to use multiple scope to be able to avoid a cdg bottleneck. You also refer to cdg being “erroneously mentioned as having monitoring processes”. No, I said that it monitor processes and removes them when a process dies. Was I mistaken in this?
  
  Finally, since you’re interested in hearing hardware specs here it goes: all of these tests were run on a local box, a 2014 MacBook 15-inch Retina running Erlang 17.5 on a Yosemite 10.10.4 (i7 2.2 GHz, 16 GB Ram).
  - Michael Truog
    
    July 6, 2015 at 7:35 pm /
    
    The comment on Syn, “is likely to lose process registry data during a netsplit, due to process registry modifications that occur during the netsplit that are unable to be resolved after the partitions merge”, is the case where a netsplit occurs so that partition A and partition B exist separately. Groups are modified in both parition A and paritition B. Generally, a mnesia merge operation has to choose whether partiion A or partition B is the accurate data for the modified groups, when merging the data after the partition A and partition B reconnect. If Syn is able to not discard process registry data from both partitions and merge in a way where all data survives the merge, it would be important to mention that and describe how it works. If it does work, it likely sacrifices availability to make it possible.
    
    I didn’t enter the comment on cpg properly. To quote the comment on cpg in the test results above, “The monitoring of processes didn’t work appropriately.”, however, the monitoring of processes worked fine. Your usage of the cpg module didn’t specify a timeout long enough to allow a response and since no common timeout was used among all process registries, it is unclear what conclusions you can draw, due to having no data when a timeout occurred without waiting for the same period with all process registries.
Sean Cribbs

March 24, 2016 at 5:14 pm / Reply

I’d be interested to see a benchmark with Christopher Meiklejohn’s riak_pg. https://github.com/cmeiklejohn/riak_pg
- Roberto Ostinelli Author
  
  November 3, 2016 at 6:07 pm / Reply
  
  Now I am too :)
Manuel Rubio

September 5, 2016 at 12:36 pm / Reply

In Altenwald we was working for long time in our own solution for global registering of processes available in a cluster (only one process with a specific name in the whole cluster) and we develop:

https://github.com/altenwald/forseti

I’ll check the benchmarks to see if our solution is worthy… good post! :-)
- Roberto Ostinelli Author
  
  November 3, 2016 at 6:08 pm / Reply
  
  :)
Artificial intelligence creates content for the site, no worse than a copywriter, you can also use it to write articles. 100% uniqueness :). Click Here: https://stanford.io/3FXszd0?h=79e7a2e3f05ebb327a202adbadf211ce&

November 11, 2022 at 4:01 am / Reply

8xn8sl
Artificial intelligence creates content for the site, no worse than a copywriter, you can also use it to write articles. 100% uniqueness :). Click Here: https://stanford.io/3FXszd0?h=aaeeb9911cda89d2cee3495a48069563&

November 11, 2022 at 4:02 am / Reply

eyqvas33
Artificial intelligence creates content for the site, no worse than a copywriter, you can also use it to write articles. 100% uniqueness :). Click Here: https://stanford.io/3FXszd0?h=79e7a2e3f05ebb327a202adbadf211ce&

November 14, 2022 at 11:15 pm / Reply

kmg8b9l5