Garajeando: Sleeping is not the best option

Sunday, October 18, 2020

Sleeping is not the best option

Introduction.

Some time ago we were developing a code that stored some data with a given TTL. We wanted to check not only that the data was stored correctly but also that it expired after the given TTL. This is an example of testing asynchronous code.

When testing asynchronous code we need to carefully coordinate the test with the system it is testing to avoid running the assertion before the tested action has completed^[1]. For example, the following test will always fail because the assertion in line 30 is checked before the data has expired:

In this case the test always fails but in other cases it might be worse, failing intermittently when the system is working, or passing when the system is broken. We need to make the test wait to give the action we are testing time to complete successfully and fail if this doesn’t happen within a given timeout period.

Sleeping is not the best option.

This is an improved version of the previous test in which we are making the test code wait before the checking that the data has expired to give the code under test time to run:

The problem with the simple sleeping approach is that in some runs the timeout might be enough for the data to expire but in other runs it might not, so the test will fail intermittently; it becomes a flickering test. Flickering tests are confusing because when they fail, we don’t know whether it’s due to a real bug, or it is just a false positive. If the failure is relatively common, the team might start ignoring those tests which can mask real defects and completely destroy the value of having automated tests.

Since the intermittent failures happen because the timeout is too close to the time the behavior we are testing takes to run, many teams decide to reduce the frequency of those failures by increasing the time each test sleeps before checking that the action under test was successful. This is not practical because it soon leads to test suites that take too long to run.

Alternative approaches.

If we are able to detect success sooner, succeeding tests will provide rapid feedback, and we only have to wait for failing tests to timeout. This is a much better approach than waiting the same amount of time for each test regardless it fails or succeeds.

There are two main strategies to detect success sooner: capturing notifications^[2] and polling for changes.

In the case we are using as an example, polling was the only option because redis didn’t send any monitoring events we could listen to.

Polling for changes.

To detect success as soon as possible, we’re going to probe several times separated by a time interval which will be shorter than the previous timeout. If the result of a probe is what we expect the test pass, if the result we expect is not there yet, we sleep a bit and retry. If after several retries, the expected value is not there, the test will fail.

Have a look at the checkThatDataHasExpired method in the following code:

By polling for changes we avoid always waiting the maximum amount of time. Only in the worst case scenario, when consuming all the retries without detecting success, we’ll wait as much as in the just sleeping approach that used a fixed timeout.

Extracting a helper.

Scattering ad hoc low level code that polls and probes like the one in checkThatDataHasExpired throughout your tests not only make them difficult to understand, but also is a very bad case of duplication. So we extracted it to a helper so we could reuse it in different situations.

What varies from one application of this approach to another are the probe, the check, the number of probes before failing and the time between probes, everything else we extracted to the following helper^[3]:

This is how the previous tests would look after using the helper:

Notice that we’re passing the probe, the check, the number of probes and the sleep time between probes to the AsyncTestHelpers::assertWithPolling function.

Conclusions.

We showed an example in Php of an approach to test asynchronous code described by Steve Freeman and Nat Pryce in their Growing Object-Oriented Software, Guided by Tests book. This approach avoids flickering test and produces much faster test suites than using a fixed timeout. We also showed how we abstracted this approach by extracting a helper function that we are reusing in our code.

We hope you’ve found this approach interesting. If you want to learn more about this and several other techniques to effectively test asynchronous code, have a look at the wonderful Growing Object-Oriented Software, Guided by Tests book^[4].

Acknowledgements.

Thanks to my Codesai colleagues for reading the initial drafts and giving me feedback and to Chrisy Totty for the lovely cat picture.

Notes.

[1] This is a nice example of Connascence of Timing (CoT). CoT happens when when the timing of the execution of multiple components is important. In this case the action being tested must run before the assertion that checks its observable effects. That's the coordination we talk about. Check our post about Connascence to learn more about this interesting topic.

[2] In the capturing notifications strategy the test code "observes the system by listening for events that the system sends out. An event-based assertion waits for an event by blocking on a monitor until gets notified or times out", (from Growing Object-Oriented Software, Guided by Tests book).
Some time ago we developed some helpers using the capturing notifications strategy to test asynchronous ClojureScript code that was using core.async channels. Have a look at, for instance, the expect-async-message assertion helper in which we use core.async/alts! and core.async/timeout to implement this behaviour. The core.async/alts! function selects the first channel that responds. If that channel is the one the test code was observing we assert that the received message is what we expected. If the channel that responds first is the one generated by core.async/timeout we fail the test. We mentioned these async-test-tools in previous post: Testing Om components with cljs-react-test.

[3] Have a look at the testing asynchronous systems examples in the GOOS Code examples repository for a more object-oriented implementation of helper for the polling for changes strategy, and also examples of the capturing notifications strategy.

[4] Chapter 27, Testing Asynchronous code, contains a great explanation of the two main strategies to test asynchronous code effectively: capturing notifications and polling for changes.

References.

Growing Object-Oriented Software, Guided by Tests, Steve Freeman and Nat Pryce (chapter 27: Testing Asynchronous Code)
Eradicating Non-Determinism in Tests, Martin Fowler

Garajeando