Connect with us

Technology

Excuse me, what just happened? Resilience is tough when your failure is due to a ‘sequence of events that was almost impossible to foresee’

Published

on

Feature When designing systems that our businesses will rely on, we do so with resilience in mind.

Twenty-five years ago, technologies like RAID and server mirroring were novel and, in some ways, non-trivial to implement; today this is no longer the case and it is a reflex action to procure multiple servers, LAN switches, firewalls, and the like to build resilient systems.

This does not, of course, guarantee us 100 per cent uptime. The law of Mr Murphy applies from time to time: if your primary firewall suffers a hardware failure, there is a tiny, but non-zero, chance that the secondary will also collapse before you finish replacing the primary.

If you have a power failure, there is a similarly micro-tangible likelihood that the generator you have tested weekly for years will choose this moment to cough stubbornly rather than roaring into life. Unless you are (or, more accurately, the nature of your business is) so risk-averse that you can justify spending on more levels of resilience to reduce the chance of an outage even further (but never, of course, to nothing).

There are occasions, though, where planning for failure becomes hard.

Let us look at a recent example. In July 2020, the main telco in Jersey had a major outage because of a problem with a device providing time service to the organisation’s network. The kicker in this event was that the failed device did not fail in the way we are all used to – by making a “bang” noise and emitting smoke; had it done so, in fact, all would have been well as the secondary unit would have taken over.

Impossible

No, this was a more devious kind of time server which only part-failed. It kept running but started serving times from about 20 years in the past (by no coincidence at all this was the factory default time setting), thus confusing network infrastructure devices and causing traffic to stop flowing.

Customer dissatisfaction was palpable, of course, but as an IT specialist one does have to feel something for the company’s technical team: how many of us would ever consider, as a possible failure case, something that the technical chief described quite correctly as a “sequence of events that was almost impossible to foresee”?

(Incidentally, in a somewhat more good-news story, stepping back a moment to our point about extra layers of resilience, the same company had previously survived three offshore cables being severed… by having a fourth).

Could monitoring tools have been put in place to see issues like this when they happen? Yes, absolutely, but the point is that to do so one would first need to identify the scenarios as something that could happen. In the sense of risk management, this type of failure – very high impact but infinitesimally unlikely – is the worst possible kind for a risk manager. There are theories and books about how one can contemplate and deal with such risks, the best-known of which is probably Nassim Nicholas Taleb’s book The Black Swan, which talks of just this kind of risk, but if you want to try to defend against the unexpected then at the very least you need to sit down with a significant number of people in a highly focused way, preferably with an expert in the field to guide and moderate, and work on identifying such possible “black swan” events.

While the black swan concept is most definitely a thing to bear in mind, there is in fact a far more common problem with systems that we consider resilient – a failure to understand how the resilience works.

One particular installation at a company with an office and two data centres had point-to-point links in a triangle between each premises, and each data centre had an internet connection. The two firewalls, one in each data centre, were configured as a resilient pair, and worked as such for years. One day internet service went down, and investigation showed that the secondary unit had lost track of the primary and had switched itself to become the primary. Having two active primaries caused split traffic flows, and hence an outage.

Predictable

In hindsight, this was completely predictable. The way the primary/secondary relationship was maintained between the devices was for the primary to send a “heartbeat” signal to the secondary every few seconds; if the secondary failed to receive the heartbeat three times, it woke up and acted as a primary. Because the devices were in separate data centres, they were connected through various pieces of technology: a LAN patch cord into a switch, into a fibre transceiver, into a telco fibre, then the same in reverse at the other end.

A fault on any one of those elements could cause the network devices to reconfigure their topology to switch data over the other way around the fibre triangle – with the change causing a network blip sufficiently long to drop three heartbeats. In fact, the only approved configuration for the primary/secondary interconnection was a crossover Ethernet cable from one device to the other: the failover code was written with the assumption that, aside perhaps from a highly unlikely sudden patch cord fault, the primary becoming invisible to the secondary meant that the former had died.

Many of us have come across similar instances, where something we expected to fail over has not done so. It’s equally common, too, to come across instances where the failover works OK but then there are issues with the failback, which can be just as problematic. I recall a global WAN I once worked on where, for whatever reason, failovers from primary to secondary were so quick that you didn’t notice any interruption (the only clue was the alert from the monitoring console) but there was a pause of several seconds when failing back.

In the firewall example, even when connectivity was restored the devices would not re-synch without a reboot: remember, the only supported failure scenario was the primary dying completely, which meant that it was only at boot time that it would check to see which role its partner was playing so it could act accordingly. Until someone turned it off and back on again, there was no chance that the problem would go away.

To make our resilient systems truly resilient, then, we need to do three things.

First, we should give some thought to those “black swan” events. It may be that we cannot afford masses of time and effort to consider such low-probability risks, but at the very least we should take a conscious decision on how much or how little we will do in that respect: risk management is all about reasoning and making conscious decisions like that.

Expertise

Second, if we don’t have the knowledge of the precise way our systems and their failover mechanisms work, we must engage people who do and get the benefit of their expertise and experience… and while we’re at it, we should read the manual: nine times out of ten it will tell us how to configure things, even if it doesn’t explain why.

Finally, though, we need to test things – thoroughly and regularly. In our firewall example all potential failure modes should have been considered: if a failure of one of a handful of components could cause an outage, why not test all of them? And when we test, we need to do it for real: we don’t just test failover in the lab and then install the kit in a production cabinet, we test it once it’s in too.

This may need us to persuade the business that we need downtime – or at least potential downtime to cater for the test being unsuccessful – but if management have any sense, they will be persuadable that an approved outage during a predictable time window with the technical team standing by and watching like hawks is far better than an unexpected but entirely foreseeable outage when something breaks for real and the resilience turns out not to work.

Testing

Oh, and when you test failover and failback, run for several days in a failed-over state if you can: many problems don’t manifest instantly, and you will always learn more in a multi-day failover than in one that lasts only a couple of minutes. Bear in mind also the word “regularly” that I used alongside “thoroughly”. Even if we know there has been no change to a particular component, there may well be some knock-on effect from a change to something else. Something that used to be resilient may have become less resilient or even non-resilient because something else changed and we didn’t realise the implication – so regular resilience testing is absolutely key.

Because if something isn’t resilient, this will generally not be because of some esoteric potential failure mode that is next to impossible to anticipate and/or difficult or impossible to test. Most of the time it will because something went wrong – or something was configured wrongly – in a way you could have emulated in a test. ®

Source link

Technology

Orion the humpback whale ‘a dream sighting’ for marine observers

Published

on

A member of the Irish Whale and Dolphin Group spotted the humpback whale while out conducting a survey on marine life off the Donegal coast.

Marine mammal observer Dr Justin Judge described the moment he spotted a lone humpback whale off the coast of Donegal as “a dream sighting.”

Judge spotted the whale at 9.30 on the morning of 9 July while representing the Irish Whale and Dolphin Group (IWDG) on board the Marine Institute’s RV Celtic Explorer.

The group of researchers and observers was out on the waters around 60 kilometres north-northwest of Malin Head when they saw the whale. They were carrying out the annual Western European Shelf Pelagic Acoustic (WESPAS) survey.

“This is a dream sighting for a marine mammal observer,” Judge said. He explained that the creature would be nicknamed Orion – which had a personal meaning for Judge and his family.

“The individual humpback whale ‘Orion’ has been named after the Greek mythological hunter, since the whale was moving with the fish stocks for food. It is also my son’s middle name so fitting on both fronts,” Judge said.

Support Silicon Republic

He added that the team had also observed “a lot of feeding action from a multitude of cetacean species that day, including bottlenose, common, Risso’s and white-sided dolphins, grey seals and minke whales.”

To date, the IWDG has documented 112 individual humpback whales in Irish waters since 1999, many of which are recorded year after year. Humpback whales are frequent visitors to Irish waters as they are an ideal feeding area for humpback whales stopping off in the area on their migration across the Atlantic.

The beasts are identifiable thanks to the distinctive pattern on the underside, which is unique to every individual whale.

“Observing any apex predator in its natural environment is exciting but a new humpback whale for Irish waters, this is special,” WESPAS survey scientist, Ciaran O’Donnell of the Marine Institute said.

The Marine Institute’s WESPAS survey is carried out annually, and surveys shelf seas from France northwards to Scotland, and west of Ireland. WESPAS is the largest single vessel survey of its kind in the Northeast Atlantic, covering upwards of 60,000 nautical miles every summer. The survey is funded through the European Maritime Fisheries and Aquaculture Fund under the Data Collection Programme which is run by the Marine Institute.

Source link

Continue Reading

Technology

Tesla second-quarter profits top $1bn even as it struggles to handle demand | Tesla

Published

on

Tesla made a profit of more than $1bn in the last three months even as it struggled to keep up with demand for electric cars in the face of a global chip shortage.

The company announced Monday that it has made a profit of $1.14bn in its second quarter, 10 times what it made a year ago and its eighth quarter of back-to-back profits.

Tesla has already reported deliveries of 201,250 electric vehicles, and production of 206,421 total vehicles, during the quarter ending 30 June.

Car manufacturers across the world have struggled to keep up with demand amid a shortfall of semiconductors.

“Our biggest challenge is supply chain, especially microcontroller chips. Never seen anything like it,” said Elon Musk, Tesla’s chief executive, in June. “Fear of running out is causing every company to over-order – like the toilet-paper shortage, but at epic scale.”

The company has pivoted to using other suppliers and the shortage has not dampened enthusiasm for its vehicles, especially Tesla’s Model Y compact sport-utility vehicle, the most popular all-electric vehicle in the US.

“Public sentiment and support for electric vehicles seems to be at a never-before-seen inflection point,” the company said in a statement.

The company’s shares rose more than 2% in after hours trading.

Source link

Continue Reading

Technology

SSD belonging to Euro-cloud Scaleway was stolen from back of a truck, then turned up on YouTube • The Register

Published

on

In brief Deepmind and the European Bioinformatics Institute released a database of more than 350,000 3D protein structures predicted by the biz’s AI model AlphaFold.

That data covers the 20,000 or so proteins made in the human body, and is available for anyone to study. The proteomes of 20 other organisms, from Zebrafish to E.coli bacteria, are also in there, too, and hundreds of millions of more structures will be added over time, we’re told.

“In the hands of scientists around the world, this new protein almanac will enable and accelerate research that will advance our understanding of these building blocks of life,” said DeepMind’s CEO Demis Hassabis. He hopes that it will be a valuable resource that will be used in the discovery of new drugs and our understanding of diseases.

Source link

Continue Reading

Trending

Subscribe To Our Newsletter

Join our mailing list to receive the latest news and updates 
directly on your inbox.

You have Successfully Subscribed!