I Hate Time-Outs
A recurring issue in the design of concurrent and/or distributed systems is to choose those dreaded time out values. It seems that they are inevitable, doesn't it? They seem arbitrary, baked-in in an way that often make them hard to change and mange. In my experience, they are always wrong somehow, or at least need a lot of trial-and-error hand tuning to get right.
Imagine if we could do away with them all together. Woudn't that be a relief?
The other day when I was explaining some Erlang issues to Steven Freeman (@sf105), he summed up my explanation with those magic words: "When process are cheap, who needs timeouts?" And it's so true.
This ideas I was describing to him came from a recent discussion on the erlang-questions mailing list. That discussion started with a question about the default timeout in erlang's standard RPC mechanism gen:call; which used to do an rpc invocation on a generic server. It has a standard timeout of 5 secs, so when one process performs an RPC request to another process inside the erlang ecosystem, it has a default timeout of five seconds. Why? In the end someone authoritatively explaining that it was really a mistake in the original design to have a default timeout different from infinity, because it's not really needed.
The reason why timeouts are not needed is that when you use the gen:call mechanism the caller monitors the callee while waiting for a reply. That means the caller knows if the target of the rpc dies. Why have a timeout then?
Timeouts really belong at the edges of the system, not inside it. The relevant timeout is one that the ultimate caller -- the client -- should be concerned with. And in stead of specifying a timeout, why not support some kind of cancel mechanism?
When interacting with some external resource, you can feasibly use a timeout to decide that the resource has become unavailable. (This strategy is also part of what erlang nodes use to determine liveliness of connected nodes, in combination with tcp sockets being broken.)
Timeouts waiting for some response to arrive should only be used at the edges.
Cancelling requests
Another case of timeout at the "edges of the system" is a client sending a request to your server. Think of a HTTP request which is cancelled by the user by closing the connection; or it could be an explicit cancel request in some other protocol. When such a cancellation arrives, you'd want to be able to propagate this cancellation to the part of your program which is working to carry out the request. In this case, the normal timeouts doesn't help you either.
Luckily, erlang is great for something like this. A standard pattern is to create a process for each request, and then ... have processes that are spawned by the "request process" monitor or link to the original request process. That allows you to propagate the cancel request the other way -- down the RPC call chain. I think of it as an exception propagating in the opposite direction. It propagates from the caller to the callees. Awesome!
I have not been able to find code in OTP that covers this usage model, and even though this code probably doesn't do exactly the right thing
...
handle_request({do_something, Args}, {CallerPID,_}=ReplyTo, State) ->
JobRunner = spawn(fun() ->
link(CallerPID),
{ok, Result} = compute_do_something( Args, State )
unlink(CallerPID),
gen:reply(ReplyTo, Result)
end),
{noreply, State}.
i.e., while executing a long-running job compute_do_something, link with the requestor so that the JobRunner gets killed if the requestor dies.
Ideally, you'd want to use the less intrusive monitor option rather than linking, but that requires more integration across modules. But I like the idea of an "exception propagating in the opposite direction" to describe a cancel request.