A Caterpillar crawled to me one day and said
“Oh what the hell goes on inside your swollen head?
I don’t believe that you can see as much as I
Now close your eyes and tell me what do you spy?”
A detailed account of the development of
HostnameEndpoint follows, also leading to my talk at PyCon US 2014 in Montreal. During the course of its creation, many wise words were exchanged, thanks to exarkun and glyph (and itamar and tomprince!). I repeat them here, in hopes that it will help make you wiser too, should you aim for that goal. So well, indulge me whilst I rant on.
Soon after I finished working on a name-resolving TCP IPv6 client endpoint, I heard of a certain crazy endpoint, described (not in as much detail as it is now, but to some extent) in a ticket. In those days, seeing as the description was not too elaborate, and the goals not too clear, any mention of working on it would lead to a discussion such as:
<exarkun> I suspect it is obvious to anyone with more than ten years network programming experience.
<glyph> exarkun: Completely. ashfall should hurry up and get ten years of that, then.
This continued until one day Glyph decided to write a nice spec for it, and that’s where our story begins.
<kenaan> ashfall r35028 A(branches/gai-endpoint-4859/): Branching to ‘gai-endpoint-4859’ …
<ashfall> Okay, this one is a little bit scary to start.
The gist of this endpoint is that it takes a hostname, does a
getaddrinfo lookup on it, and picks the first two addresses out of the returned list. Using some time-involving algorithm, it tries connecting to both of them.
Doing things based on time in Twisted meant using
reactor.callLater. That in turn meant using
Clock as the reactor in the unit tests, since real time is slow and unpredictable, and tests are ideally fast and predictable.
This was also the day when my appreciation of the reactor grew.
My initial understanding of the task at hand, I remember, was that the endpoint would connect to the first address, record how long it took, then connect to the next address, record how long it took, compare the times taken, pick the one that took less time, and call connect one final time. After discussing it further with exarkun, I realized that this doesn’t actually need to record any durations, since we don’t care how long it took, as long as it was first.
Now, the high-level objective was to establish the connection as quickly as possible, and 3 connects would take a lot of time. So we make the connection attempts in parallel, drop the one that doesn’t connect first and use the one that does connect first. So that there are two connection attempts, one (hopefully) connection success, and one discarded connection attempt. And no third connection attempt.
And how do we do that?
“Well, fortunately Twisted is good at doing network operations in parallel”, explained exarkun. For example:
For my part I wondered, how is that parallel, with one reactor?
I learnt the reactor can do many things at once, and that that’s its job, essentially. ”
reactor.connectTCP returns immediately, after all. It doesn’t wait for the connection to be established first,” exarkun elaborated. “That’s why we need a factory with a
buildProtocol method, so that the reactor can call it once the connection has succeeded. Because by the time it does succeed, the application will have moved on to doing something else.”
“A secondary objective of the endpoint is to be conservative of network resources,” he added. “If the connection to 22.214.171.124 succeeds quickly enough, then it would be wasteful of network resources to try connecting to 1:2::3:4 as well. That’s why the relevant RFC talks about a 300 ms delay. If the first one is done really fast and you can avoid the second connection attempt, you should. So you won’t actually start both connection attempts at the same time. You’ll start the 2nd one a short while after the first one, if and only if the first one has not succeeded.
reactor.callLater is how you would do that, and
delayedCall.cancel() is how you would stop a delayed call from happening when the first one succeeds.
someDeferred.cancel() is how you would make a connection attempt give up early, if the other connection attempt succeeds.”
But now I wondered, what if 126.96.36.199 takes 254 ms and 1:2::3:4 would have taken, say, 230 ms, had we tried it (but we will never know?). And exarkun said, “[Then] Tough luck for us. We’re not going to try to solve that problem.” Turns out, it’s not as much of a problem as the problem this RFC is aimed at fixing, which is the problem where 1:2::3:4 takes 120 seconds and 188.8.131.52 takes 254 ms, and so a purely serial approach would take 120.254 seconds to succeed, whereas this one will only take 0.254 seconds to succeed. That’s a much more noticeable difference than 0.254 seconds vs 0.230 seconds.
“I watched nine cats dance on the moon
A flamingo stalked into my room
It bowed its head to me and knelt
To reveal the cards it had dealt
Now, where does
Clock come into the picture? The clock is actually a replacement for the reactor, at least in a very restricted scope. It has a
callLater method, but whereas
reactor.callLater(5, f) will only call
f after roughly 5 seconds have passed,
clock.callLater(5, f) will call
f the instant you call
clock.advance(5). (or any combination of
clock.advance calls so that the sum of the arguments passed to all calls is 5 or more). So if you’re writing a unit test for code that does
callLater(5, f), the test can complete in fewer than 5 seconds if you pass a
Clock instance in for the reactor.
Clock completes with more predictable results, since the reactor really does call
f in roughly 5 seconds, and the variation implied by “roughly” can sometimes make a difference (in a bad way) to how the code behaves. The idea is that maybe 5 seconds is close to some other critical amount of time. Perhaps sometimes the reactor gets around to calling
f very soon after 5 seconds, perhaps 5.001 seconds. That exercises one code path through the code under test. And perhaps sometimes, the reactor doesn’t get around to calling
f until 5.1 seconds have elapsed, meanwhile, something else happened in those extra .099 seconds that didn’t have time to happen in the first case, and so now you’re exercising a second, different code path through the code under test. But this is a single test method, and it should only ever exercise a single code path. Since it is exercising two, now, it can have two different outcomes, and that makes it much less useful. Perhaps one of the outcomes is a success and the other is a failure, so when you run the tests, the results flip flop between good and bad, even though you’re not changing any code. Or perhaps both outcomes are success - for now. And someone breaks one of the code paths accidentally. But they run the test, and it accidentally goes through the other code path, the one they didn’t break, and so the test passes. And they think their code works.
So, yes. Clock and writing two tests is the solution. Because with Clock, you can write a test where it appears that 5.001 seconds elapse before
f is called, and a second test where it appears that 5.1 seconds elapse. (just by doing
clock.advance(5.001) in one of them and
clock.advance(5.1) in the other)
An ace, three jacks, two queens, four kings
Then turned them into burning rings
The flames jumped out and chased four mice
Caught by their tails they turned to ice
The new super-smart TCP endpoint, what should it be called?
The first name glyph suggested was
HostnameEndpoint, because it connects you to a hostname, over whatever protocol is appropriate. Another suggestion was to call it
TCPEndpoint, since we haven’t used that name yet. And then it was pointed out that hostnames aren’t part of TCP. Nor even IP. They’re part of DNS. 99% of this endpoint is DNS functionality. So…
DNSEndpoint? But you can look up hostnames for use with SCTP or UDP via DNS as well! And this won’t let you do that. And, in fact, this endpoint does not necessarily use DNS: it could be doing name resolution via /etc/hosts, or, it could be doing name resolution using WINS… you don’t know.
The important lesson to be learnt here is that naming things for how they work is generally worse than naming things for what they do. DNS and TCP are both implementation details (albeit kind of important details in this case); the important thing about this endpoint is that it takes a hostname and gives you a stream-oriented connection.
HostnameEndpoint seemed like the best and the least ambiguous choice.
Deform to form a Star
Reading the spec, the whole idea sounded (to me, at least) too convoluted and it wasn’t very obvious where to begin. On seeking help, exarkun suggested:
- Make a list of features it should have.
- Sort them from simplest to most complex. (the failure ones come before the success ones)
- Now, write a unit test for the first thing on that list. (i.e., write the test before you have any implementation for the endpoint at all, or at least set up a
TestCasesubclass that mixes in that helper and implements the method for creating an endpoint)
Part of the idea with proceeding from simplest to most complicated is that once you’ve done all the simple things, the complicated things generally end up pretty simple. They’re only complicated if you have to tackle all the complexity at once.
We moved on to talk about what test cases to write first, because… TDD! Starting with the easy cases, I planned to begin by using the EndpointTestCaseMixin - Try making the test methods from that mixin that exercise failures pass. Then the success cases. Perhaps then try a test that exercises the case where getaddrinfo returns one ipv4 address. And then the case where it returns one ipv6 address. And when all that works, then maybe it’s time to start thinking about the interesting cases, where getaddrinfo returns an ipv4 and an ipv6 address, e.g. What if it returns ipv4 then ipv6? What if it returns ipv6 then ipv4? What if the first connection attempt succeeds in < 300ms? What if it succeeds in > 300ms, but before the second? What if the second succeeds before the first? What if they both fail? But start with the simpler cases, and work on them one at a time.
Deferreds and Raising Exceptions
A cloud appeared outside my door
And through the window saw four more
And on the back of each cloud sat
Two rainbow smiles in wizard’s hats
There was a part of the endpoint where I was using “except” to work with a Deferred firing with an exception. Needless to say, the unit test that was to check this was failing, and I wasn’t sure why. This led exarkun to elaborate on how Deferreds work, and where I was going wrong.
We have Deferreds so that functions can return before they have finished the abstract task they are responsible for. They return a Deferred and the Deferred becomes responsible for indicating the result of the abstract task, once it is complete. This lets multiple tasks proceed concurrently. You can just call one function to start one task, then it will return (before the task is complete), and you can call another function to start another task (which will also return before that task is complete). Now you have two Deferreds for the two tasks, and you can use them to learn the results of those tasks.
You learn the result by adding a callback to the Deferred. And when the result is available, the callback is called with it.
Not all results are “successes” though. But for the same reason that these functions can’t return their result directly, neither can they raise an exception directly. By the time they return, they might not have gotten far enough in their abstract task to know that there is going to be an error. So they can still only return a Deferred. So you can also attach error callbacks, or errbacks, to Deferreds as well. An errback is just like a callback, except it is called when the Deferred gets its result and that result is not a success.
So, to handle results from asynchronous, Deferred-returning functions, you take the returned Deferred and call
addCallback(callbackFunction) on it. And to handle error results from asynchronous, Deferred-returning functions, you take the returned Deferred and call
addErrback(errbackFunction) on it.
The errback is ignored if the result is not a failure, just like an “except” statement is ignored if no exception is raised. Similarly, the callback is ignored if the result is a failure.
Hence, “except” will not work for a Deferred that fires with exception, because the Deferred was returned. If anything was returned at all, an exception could not possibly have been raised. Because of how Python functions work - they can raise or they can return, they cannot do both.
Returning Deferreds From Tests
At one point, I had written a test that was returning a Deferred, and then the whole test case was timing out, never reaching completion. I learnt that since it doesn’t require any external resources, there was no reason this test needed to spin a reactor.
I asked, then, is returning Deferreds from tests a bad thing?
“It’s not ‘good’ or ‘bad really,” Glyph explained, “but if you’re going to think of it as one of those, let’s think of it as ‘bad’ by default”.
Returning a Deferred spins the global reactor. You return Deferreds from tests that need to interface with real, external resources or connect to real sockets. In this case you absolutely do not want to do that. You want to very carefully control everything that your
HostnameEndpoint is talking to. You want to fire the connection attempt success and failure in a very specific order - in several different orders, as a matter of fact. and to verify that attempts are in progress at certain times and not in progress at other times. That means you want to use Clock and you want to use MemoryReactor (possibly smashed together on one object so that you only need to have one reactor).
Then, at precise points in your test, you want to say “this deferred should be fired by now” not “whenever this Deferred happens to fire…”. As you can see, if you do it by returning Deferreds, then your failure mode is “uh… the test didn’t finish”, and not “Oh, of course, I forgot to callback deferred X when event Y happened, and my test failed telling me that that Deferred wasn’t fired”.
If you return an unfired Deferred from your test, then the test never ends. It just sits around waiting for that Deferred forever. where Trial defines “forever” to be 120 seconds, after which the test times out. There are times when it’s appropriate, but you won’t likely encounter them when testing Twisted itself, I was told.
Basically if you’re originating the Deferreds that you’re testing, you should definitely not be yielding them, or returning them from your test. Tests which do return Deferreds depend on the things that return the Deferreds that they’re returning being fully tested and having no chance of timing out.
So as a rule of thumb, only return Deferreds from a test if the test interacts with external stuff, e.g. writing tests that talk to, say, Postgres, and have to do multiple asynchronous operations. And, if you’re doing that, see if there’s a way you could reasonably interact with less external stuff first.
This is just a special case of a general rule though. The general rule is, your tests should not depend on any more stuff than they need to. Especially, they should not depend on unpredictable external factors that may make your tests behave differently under different conditions. A real reactor is by definition an unpredictable external factor; the whole point is that it’s for talking to unpredictable external stuff.
Glyph also shared an excellent analogy: “Think of it like an actual nuclear reactor. Before you hook your thermal regulator up to the coolant rods, you really want to have a high level of confidence that it is going to react in the appropriate way when the “uranium is too hot” wire gets some voltage. So you give that wire some voltage using a hand crank, not a critical mass.”
0 seconds later is still “some amount of time”
They threw five clocks down on my bed
The chimes danced out on golden threads
And turned to footprints on my wall
Sequined tears began to fall”
Now, I had a function
iterateEndpoint, which iterates through a generator that yields different endpoints for different addresses returned by gai. Now, I needed to keep calling this function for each endpoint, and as per the algorithm we are implementing, we introduce a time lag of 0.3 seconds between two endpoint’s connection attempts. So, each call to iterateEndpoint needs to occur 0.3 seconds after the previous call. I needed a technique to introduce this delay in the calls. There’s
reactor.callLater, which did this and I tried using it, and it didn’t work.
By “didn’t work”, I mean that all my tests were failing, but not failing in a way that would help me identify what’s wrong.
MemoryReactor.tcpClients was empty, when it should hold some host address, i.e., (in terms of real endpoint) a connection wasn’t being established.
callLater for this, because there was at least one connection attempt when I hadn’t used
callLater and introduced the delay, whereas now it wasn’t even trying the first endpoint’s host address.
Then I made the delay in the looped calls 0 seconds to check, and I learnt that I needed to “advance” the clock for this to work. Even when the delay was 0 seconds. And this just wasn’t make any sense.
That was when glyph started quoting Morpheus and said, “Free your mind!”
“Advancing the clock means calling
Clock.advance()”, he explained further. “Specifically when I say ‘free your mind’ I mean ‘stop thinking about everything in terms of real stuff happening’. Everything here is just a method calling another method. Some of those methods represent things (like the passage of time on a clock) but in a test it’s all a simulation. I’d also say that ultimately when it’s not in the test it’s just a higher fidelity simulation but that’s a thorny philosophical issue we probably shouldn’t get into. At any rate it’s always just methods calling other methods.
LoopingCall specifies a function to call to advance to the next connection attempt; it asks an
IReactorTime to call that thing via
But then, I wondered, if I set the time delay between two consecutive calls to the looped code as 0 seconds, why would I need to advance the clock now?
“Probably the easiest way to understand the answer to that is to look at the implementation of Clock”, Glyph said patiently. “Particularly the implementation of
Clock.callLater. When you call
callLater, it just remembers the function you asked it to run, puts it in a list, and sorts the list. Sorting a list does not call your function. Similarly, a “real” reactor’s callLater will never run your function reentrantly. 0 seconds later is still ‘some amount of time’ later, even if it’s not a very big amount of time. the behavior of Clock is that it only invokes application code when ‘time passes’ which is to say when you call .advance(). Real reactors do the same thing and only invoke timed calls when you get back to the main loop.
But… it’s zero. It’s supposed to mean “nothing”! I protested.
“You’re thinking too hard about it”, said Glyph. “It’s a number. The behavior of
Clock is such that it remembers a callable, sorts them by number, and then calls them. It stuffs it on the stack of
callLaters and re-evaluates that when the reactor gets control again. The magnitude of the number isn’t important except in that negative numbers aren’t allowed because that’s non-sensical. In actual fact it is actually an orderable object, ‘number’ is a convention that we apply to it.
callLater means ‘when time advances, no sooner than N seconds later, call this function’.”
“So, when the reactor is told to perform a task 0 seconds later, does it not do that immediately?”, I asked, still not quite convinced.
“Well of course it does it immediately. But not reentrantly. ‘As soon as possible but no sooner’.” was Glyph’s answer. “‘immediately’ is defined by some time-keeping thing, so in order to figure out when “immediately” occurs, it has to consult some time-keeping thing. And the point in the code - not in time - where it does that consultation is the reactor loop for reactors and the
advance() method for
Clock. Practically speaking,
callLater is not a hard realtime API, because Python isn’t a hard realtime system, so
callLater(n, f) cannot physically run
f precisely n seconds after its invocation; it can’t even get a zero-error concept of when its own invocation was. clocks keep moving while computers are computing. so if you say
callLater(1, f) really you’re likely for
f to get run at 1.000000001 seconds later, or something close to that. so if you say
callLater(0, f) you’re going to get 0.000000001 seconds later, if you’re comparing against real time. if you’re using a
task.Clock(), time is frozen, so you don’t get to 0 seconds later until you allow time to advance.”
“The most important thing though is that you understand the (very simple) implementation within
Clock. there’s nothing complicated going on in there, no magic. it’s almost as simple as
l = ; l.append(f); for g in l: g(). All the stuff about how time actually works is useful to understand but ultimately peripheral to the very simple rule that functions don’t get called until you call them.”
“One last time”, I said, now almost convinced, “what exactly does ‘reentrant’ mean here?”
“It means that
def forever(): reactor.callLater(0, forever) will not cause a ‘maximum recursion depth exceeded’ error, because
callLater does not call
forever below its stack frame.
forever() doesn’t re-enter itself upon the call to
callLater does not invoke its argument reentrantly. In other words, it doesn’t mutually recurse.”
The Calculus of Testing
Now, in my tests, I was just popping the result I want from
MemoryReactor.tcpClients, and matching it against a value. So, the test ensured that the connection attempts were being made, but they weren’t checking the algorithm.
Turns out, you can’t check an algorithm with a unit test. “Some people think you can check an algorithm with formal logic, but they’re probably wrong in practice. An algorithm is too complicated to check in any general way.”, explained exarkun.
So I can only hope that the implementation of the algorithm is correct?
“No, you can test that the output of the algorithm is correct for some number of inputs”, said he. “And you can hope that your selection of inputs exercises all meaningful branches (and branch combinations) in the algorithm.”
“The idea with unit tests is that you constrain the domain of possible incorrect outputs”, Glyph added. “You test what you believe to be a representative sample of each ‘interesting’ region of behavior for your algorithm, and by ‘interesting’ I mean each range of input that requires more code.”
A general lesson I learnt was that I was trying really hard to understand how the whole thing fits together all at once, but it helps to compartmentalize. It helps to just figure out what functions call what other functions; later, once you know what’s going on, learn why they’re getting called in that way. You can do a whole lot without knowing why. And in some cases there isn’t really a “why”, it’s just an arbitrary decision because the reactor has a bunch of work to do and it has to choose some order, so it does things in whatever order we thought to implement first.
And just like that, we had our endpoint written in a respectful, TDD-way, with well-tested parts, etc. But it still wasn’t perfect. Later in the reviews, exarkun suggested that
connect could be implemented more simply - probably as an independent object with instance state (and perhaps as an explicit state machine) rather than a collection of nested functions with shared, closed-over mutable objects. That led me to file another ticket, and work on getting a finite state machine into Twisted. But that’s another story!
The caterpillar gasped at me and said
“My god if that’s what’s going on inside your head
You can see so much more than I
I think it’s time to turn into a butterfly.”