Feature When designing systems that our businesses will rely on, we do so with resilience in mind.
Twenty-five years ago, technologies like RAID and server mirroring were novel and, in some ways, non-trivial to implement; today this is no longer the case and it is a reflex action to procure multiple servers, LAN switches, firewalls, and the like to build resilient systems.
This does not, of course, guarantee us 100 per cent uptime. The law of Mr Murphy applies from time to time: if your primary firewall suffers a hardware failure, there is a tiny, but non-zero, chance that the secondary will also collapse before you finish replacing the primary.
If you have a power failure, there is a similarly micro-tangible likelihood that the generator you have tested weekly for years will choose this moment to cough stubbornly rather than roaring into life. Unless you are (or, more accurately, the nature of your business is) so risk-averse that you can justify spending on more levels of resilience to reduce the chance of an outage even further (but never, of course, to nothing).
There are occasions, though, where planning for failure becomes hard.
Let us look at a recent example. In July 2020, the main telco in Jersey had a major outage because of a problem with a device providing time service to the organisation’s network. The kicker in this event was that the failed device did not fail in the way we are all used to – by making a “bang” noise and emitting smoke; had it done so, in fact, all would have been well as the secondary unit would have taken over.
No, this was a more devious kind of time server which only part-failed. It kept running but started serving times from about 20 years in the past (by no coincidence at all this was the factory default time setting), thus confusing network infrastructure devices and causing traffic to stop flowing.
Customer dissatisfaction was palpable, of course, but as an IT specialist one does have to feel something for the company’s technical team: how many of us would ever consider, as a possible failure case, something that the technical chief described quite correctly as a “sequence of events that was almost impossible to foresee”?
(Incidentally, in a somewhat more good-news story, stepping back a moment to our point about extra layers of resilience, the same company had previously survived three offshore cables being severed… by having a fourth).
Could monitoring tools have been put in place to see issues like this when they happen? Yes, absolutely, but the point is that to do so one would first need to identify the scenarios as something that could happen. In the sense of risk management, this type of failure – very high impact but infinitesimally unlikely – is the worst possible kind for a risk manager. There are theories and books about how one can contemplate and deal with such risks, the best-known of which is probably Nassim Nicholas Taleb’s book The Black Swan, which talks of just this kind of risk, but if you want to try to defend against the unexpected then at the very least you need to sit down with a significant number of people in a highly focused way, preferably with an expert in the field to guide and moderate, and work on identifying such possible “black swan” events.
While the black swan concept is most definitely a thing to bear in mind, there is in fact a far more common problem with systems that we consider resilient – a failure to understand how the resilience works.
One particular installation at a company with an office and two data centres had point-to-point links in a triangle between each premises, and each data centre had an internet connection. The two firewalls, one in each data centre, were configured as a resilient pair, and worked as such for years. One day internet service went down, and investigation showed that the secondary unit had lost track of the primary and had switched itself to become the primary. Having two active primaries caused split traffic flows, and hence an outage.
In hindsight, this was completely predictable. The way the primary/secondary relationship was maintained between the devices was for the primary to send a “heartbeat” signal to the secondary every few seconds; if the secondary failed to receive the heartbeat three times, it woke up and acted as a primary. Because the devices were in separate data centres, they were connected through various pieces of technology: a LAN patch cord into a switch, into a fibre transceiver, into a telco fibre, then the same in reverse at the other end.
A fault on any one of those elements could cause the network devices to reconfigure their topology to switch data over the other way around the fibre triangle – with the change causing a network blip sufficiently long to drop three heartbeats. In fact, the only approved configuration for the primary/secondary interconnection was a crossover Ethernet cable from one device to the other: the failover code was written with the assumption that, aside perhaps from a highly unlikely sudden patch cord fault, the primary becoming invisible to the secondary meant that the former had died.
Many of us have come across similar instances, where something we expected to fail over has not done so. It’s equally common, too, to come across instances where the failover works OK but then there are issues with the failback, which can be just as problematic. I recall a global WAN I once worked on where, for whatever reason, failovers from primary to secondary were so quick that you didn’t notice any interruption (the only clue was the alert from the monitoring console) but there was a pause of several seconds when failing back.
In the firewall example, even when connectivity was restored the devices would not re-synch without a reboot: remember, the only supported failure scenario was the primary dying completely, which meant that it was only at boot time that it would check to see which role its partner was playing so it could act accordingly. Until someone turned it off and back on again, there was no chance that the problem would go away.
To make our resilient systems truly resilient, then, we need to do three things.
First, we should give some thought to those “black swan” events. It may be that we cannot afford masses of time and effort to consider such low-probability risks, but at the very least we should take a conscious decision on how much or how little we will do in that respect: risk management is all about reasoning and making conscious decisions like that.
Second, if we don’t have the knowledge of the precise way our systems and their failover mechanisms work, we must engage people who do and get the benefit of their expertise and experience… and while we’re at it, we should read the manual: nine times out of ten it will tell us how to configure things, even if it doesn’t explain why.
Finally, though, we need to test things – thoroughly and regularly. In our firewall example all potential failure modes should have been considered: if a failure of one of a handful of components could cause an outage, why not test all of them? And when we test, we need to do it for real: we don’t just test failover in the lab and then install the kit in a production cabinet, we test it once it’s in too.
This may need us to persuade the business that we need downtime – or at least potential downtime to cater for the test being unsuccessful – but if management have any sense, they will be persuadable that an approved outage during a predictable time window with the technical team standing by and watching like hawks is far better than an unexpected but entirely foreseeable outage when something breaks for real and the resilience turns out not to work.
Oh, and when you test failover and failback, run for several days in a failed-over state if you can: many problems don’t manifest instantly, and you will always learn more in a multi-day failover than in one that lasts only a couple of minutes. Bear in mind also the word “regularly” that I used alongside “thoroughly”. Even if we know there has been no change to a particular component, there may well be some knock-on effect from a change to something else. Something that used to be resilient may have become less resilient or even non-resilient because something else changed and we didn’t realise the implication – so regular resilience testing is absolutely key.
Because if something isn’t resilient, this will generally not be because of some esoteric potential failure mode that is next to impossible to anticipate and/or difficult or impossible to test. Most of the time it will because something went wrong – or something was configured wrongly – in a way you could have emulated in a test. ®