Connect with us

Technology

Today’s computer chips are so advanced, they are more ‘mercurial’ than precise – and here’s the proof • The Register

Voice Of EU

Published

on

Computer chips have advanced to the point that they’re no longer reliable: they’ve become “mercurial,” as Google puts it, and may not perform their calculations in a predictable manner.

Not that they were ever completely reliable. CPU errors have been around as long as CPUs themselves. They arise not only from design oversights but also from environmental conditions and from physical system failures that produce faults.

But these errors have tended to be rare enough that only the most sensitive calculations get subject to extensive verification if systems appear to be operating as expected. Mostly, computer chips are treated as trustworthy.

Lately, however, two of the world’s larger CPU stressors, Google and Facebook, have been detecting CPU misbehavior more frequently, enough that they’re now urging technology companies to work together to better understand how to spot these errors and remediate them.

“Our adventure began as vigilant production teams increasingly complained of recidivist machines corrupting data,” said Peter Hochschild, a Google engineer, in a video presented as a part of the Hot Topics in Operating Systems (HotOS) 2021 conference this week.

“These machines were credibly accused of corrupting multiple different stable well-debugged large-scale applications. Each machine was accused repeatedly by independent teams but conventional diagnostics found nothing wrong with them.”

Looking more deeply at the code involved and operational telemetry from their machines, Google engineers began to suspect problems with their hardware. Their investigation found that the incidence of hardware errors was greater than expected and these issues showed themselves sporadically, long after installation, and on specific, individual CPU cores rather than entire chips or a family of parts.

The Google researchers examining these silent corrupt execution errors (CEEs) concluded “mercurial cores” were to blame – CPUs that miscalculated occasionally, under different circumstances, in a way that defied prediction. (That’s mercurial as in unpredictable, not Mercurial as in the version control system of the same name.)

The errors were not the result of chip architecture design missteps, and they’re not detected during manufacturing tests. Rather, Google engineers theorize, the errors have arisen because we’ve pushed semiconductor manufacturing to a point where failures have become more frequent and we lack the tools to identify them in advance.

In a paper titled “Cores that don’t count” [PDF], Hochschild and colleagues Paul Turner, Jeffrey Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David Culler, and Amin Vahdat cite several plausible reasons why the unreliability of computer cores is only now receiving attention, including larger server fleets that make rare problems more visible, increased attention to overall reliability, and software development improvements that reduce the rate of software bugs.

“But we believe there is a more fundamental cause: ever-smaller feature sizes that push closer to the limits of CMOS scaling, coupled with ever-increasing complexity in architectural design,” the researchers state, noting that existing verification methods are ill-suited for spotting flaws that occur sporadically or as a result of physical deterioration after deployment.

Google’s not alone

Facebook has noticed the errors, too. In February, the social ad biz published a related paper, “Silent Data Corruption at Scale,” that states, “Silent data corruptions are becoming a more common phenomena in data centers than previously observed.” The paper proposes mitigation strategies though doesn’t address the root cause.

As Google’s researchers see it, Facebook spotted a symptom of unreliable cores – silent data corruption. But identifying the cause of the problem, and coming up with a fix, will require further work.

The risks posed by misbehaving cores include not only crashes, which the existing fail-stop model for error handling can accommodate, but also incorrect calculations and data loss, which may go unnoticed and pose a particular risk at scale.

Hochschild recounted an instance where Google’s errant hardware conducted what might be described as an auto-erratic ransomware attack.

One of our mercurial cores corrupted encryption. It did it in such a way that only it could decrypt what it had wrongly encrypted

“One of our mercurial cores corrupted encryption,” he explained. “It did it in such a way that only it could decrypt what it had wrongly encrypted.”

Google’s researchers declined to reveal detected CEE rates at its data centers citing “business reasons,” though they provided a ballpark figure “on the order of a few mercurial cores per several thousand machines – similar to the rate reported by Facebook.”

Ideally, Google would like to see automated methods to identify mercurial cores and has suggested strategies like CPU testing throughout the chip’s lifecycle rather than relying only on burn-in testing prior to deployment. The mega-corp is currently relying on human-driven core integrity interrogation, which is not particularly accurate, because tools and techniques for identifying dubious cores remain works in progress.

“In our recent experience, roughly half of these human identified suspects are actually proven, on deeper investigation, to be mercurial cores – we must extract ‘confessions’ via further testing (often after first developing a new automatable test),” Google’s researchers explain. “The other half is a mix of false accusations and limited reproducibility.”

Let the Core Inquisition begin. ®

Source link

Technology

Amazon Web Services outage hits sites and apps such as IMDb and Tinder | Amazon

Voice Of EU

Published

on

Several Amazon services – including its website, Prime Video and applications that use Amazon Web Services (AWS) – went down for thousands of users on Tuesday.

Amazon said the outage was probably due to problems related to application programming interface (API), which is a set of protocols for building and integrating application software, Reuters reported.

“We are experiencing API and console issues in the US-East-1 Region,” Amazon said in a report on its service health dashboard, adding that it had identified the cause. By late late afternoon the outage appeared to be partially resolved, with the company saying that it was “working towards full recovery”.

“With the network device issues resolved, we are now working towards recovery of any impaired services,” the company said on the dashboard.

Downdetector showed more than 24,000 incidents of people reporting problems with Amazon. It tracks outages by collating status reports from a number of sources, including user-submitted errors on its platform.

The outage was also affecting delivery operations. Amazon’s warehouse operation use AWS and experienced disruptions, spokesperson Richard Rocha told the Washington Post. A Washington state Amazon driver said his facility had been “at a standstill” since Tuesday morning, CNBC reported.

Other services, including Amazon’s Ring security cameras, mobile banking app Chime and robot vacuum cleaner maker iRobot were also facing difficulties, according to their social media pages.

Ring said it was aware of the issue and working to resolve it. “A major Amazon Web Services (AWS) outage is currently impacting our iRobot Home App,” iRobot said on its website.

Other websites and apps affected include the Internet Movie Database (IMDb), language learning provider Duolingo and dating site Tinder, according to Downdetector.

The outage also affected presale tickets for Adele’s upcoming performances in Las Vegas. “Due to an Amazon Web Services (AWS) outage impacting companies globally, all Adele Verified Fan Presales scheduled for today have been moved to tomorrow to ensure a better experience,” Ticketmaster said on Twitter.

In June, websites including the Guardian, Reddit, Amazon, CNN, PayPal, Spotify, Al Jazeera Media Network and the New York Times were hit by a widespread hour-long outage linked to US-based content delivery network provider Fastly Inc, a smaller rival of AWS.

In July, Amazon experienced a disruption in its online stores service, which lasted for nearly two hours and affected more than 38,000 users.

Users have experienced 27 outages over the past 12 months on Amazon, according to the web tool reviewing website ToolTester.



Source link

Continue Reading

Technology

South Korea sets reliability standards for Big Tech • The Register

Voice Of EU

Published

on

South Korea’s Ministry of Science and ICT has offered Big Tech some advice on how to make their services suitably resilient, and added an obligation to notify users – in Korean – when they fail.

The guidelines apply to Google, Meta (parent company of Facebook), Netflix, Naver, Kakao and Wavve. All have been told to improve their response to faults by beefing up preemptive error detection and verification systems, and create back up storage systems that enable quick content recovery.

The guidelines offer methods Big Tech can use to measure user loads, then plan accordingly to ensure their services remain available. Uptime requirements are not spelled out.

Big techs is already rather good at resilience. Google literally wrote the book on site reliability engineering.

The guidelines refer to legislation colloquially known as the “Netflix law” which requires major service outages be reported to the Ministry.

That law builds on another enacted in 2020 that made online content service providers responsible for the quality of their streaming services. It was put in place after a number of outages, including one where notifications of the problem were made on the offending company’s social media site – but only in English.

The new regulations follow South Korean telcos’ recent attempts to have platforms that guzzle their bandwidth pay for the privilege. Mobile carrier SK Broadband took legal action in October of this year, demanding Netflix pitch in some cash for the amount of bandwidth that streaming shows – such as Squid Game – consume.

In response, Netflix pointed at its own free content delivery network, Open Connect, which helps carriers to reduce traffic. Netflix then accused SK Broadband of trying to double up on profits by collecting fees from consumers and content providers at the same time.

For the record, Naver and Kakao pay carriers, while Apple TV+ and Disney+ have at the very least given lip service to the idea.

Korea isn’t the only place where telcos have noticed Big Tech taking up more than its fair share of bandwidth. The European Telecommunications Network Operators’ Association (ETNO) published a letter from ten telco CEOs asking that larger platforms “contribute fairly to network costs”. ®

Source link

Continue Reading

Technology

Twitter acquires Slack competitor Quill to improve its messaging services

Voice Of EU

Published

on

As part of the acquisition, Quill will be shutting down at the end of the week as its team joins the social media company.

Twitter has acquired the messaging platform Quill, seen as a potential competitor to Slack, in order to improve its messaging tools and services.

Quill announced that it will be shutting down at the end of the week as its team joins the social media company to continue its original goal “to make online communication more thoughtful, and more effective, for everyone”.

The purchase of Quill could be linked to Twitter’s new strategy to reduce its reliance on ad revenue and attract paying subscribers.

Twitter’s general manager for core tech, Nick Caldwell, described Quill as a “fresher, more deliberate way to communicate. We’re bringing their experience and creativity to Twitter as we work to make messaging tools like DMs a more useful and expressive way people can have conversations on the service”.

Users of Quill have until 11 December to export their team message history before the servers are fully shut down at 1pm PST (9pm Irish time). The announcement has instructions for users who wish to import their chat history into Slack and states that all active teams will be issued full refunds.

The team thanked its users and said: “We can’t wait to show you what we’ll be working on next.”

Quill was launched in February with the goal to remove the overwhelming aspects of other messaging services and give users a more deliberate and focused form of online chat.

In an online post, Quill creator Ludwig Pettersson said: “We started Quill to increase the quality of human communication. Excited to keep doing just that, at Twitter.”

The company became a potential competitor for Slack, which was bought by Salesforce at the end of 2020 for $27.7bn. The goal of that acquisition was to combine Salesforce’s CRM platform with Slack’s communications tools to create a unified service tailored to digital-led teams around the world.

Last week, Salesforce announced the promotion of Bret Taylor to vice-chair and co-CEO, just days after he was appointed independent chair of Twitter after CEO Jack Dorsey stepped down.

Don’t miss out on the knowledge you need to succeed. Sign up for the Daily Brief, Silicon Republic’s digest of need-to-know sci-tech news.

Source link

Continue Reading

Trending

Subscribe To Our Newsletter

Join our mailing list to receive the latest news and updates 
directly on your inbox.

You have Successfully Subscribed!