A Verizon error resulted in a “cascading catastrophic failure.”
After last month’s ‘catastrophic failure’ on Google Cloud due to a bad configuration, Google Cloud Platform (GCP) has suffered another multi-hour problem. This time, Google doesn’t quite consider it an outage, even though it was causing a spike in latency for customers.
Google said Tuesday’s “disruptions” to Google Cloud Networking and Load Balancing were caused by physically damaged fiber bundles serving its us-east1 data center in South Carolina.
The issues were first posted at 10:25 Pacific Time on Google’s status page, which has several updates detailing its response and explanation for why the disruptions were happening.
Google has mitigated the damaged fiber by “electively rerouting some traffic to ensure that customers’ services will continue to operate reliably until the affected fiber paths are repaired”.
Despite these measures, the cloud provider warned that some customers will still see higher than usual latency until it fixes the damaged fiber, which it expects to do within the next 24 hours and will fully resolve the latency problems.
An individual who claimed to work for Google Cloud using the handle ‘boulos’ on Hacker News popped into a thread on the site to correct comments that the networking issue meant the region was “down” – although boulos admitted that “network latency spiking up for external connectivity is bad”.
Another Hacker News user, ‘mrweasel’ challenged the explanation that the region technically wasn’t down.
“As one of my old bosses said: I don’t care that the site/service is technically running, if the customers can’t reach it, then IT’S DOWN,” wrote mrweasel.
Another user said mrweasel’s boss was “nitpicking” over words during a crisis and sacrificing “accuracy and precise understanding”.
Mrweasel countered that it was accurate: “From a business perspective the site was down. Nitpicking is telling him: No it is in fact up, the customer just can’t use it.”
Bolous explained that he or she had intervened because of “confusion” among commenters who claimed the region was down.
“During an outage is a tricky time for comms, so short corrections are best until a full postmortem can be done,” wrote boulos.
A ‘david-cako’ who claimed to work for AWS chimed in: “I work for AWS. There is typically a balance that has to be struck when sharing information with customers. I would imagine this goes for most companies, which is why it isn’t until a post-mortem that the messaging is fully refined.”
Like Google, popular CDN provider Cloudflare has suffered two outages in the past week and has done plenty of explaining. The first was blamed on a Verizon internet-routing error that caused a “catastrophic cascading failure”.
The second, on Tuesday, was caused by an internal “bad software deploy” that triggered an unprecedented CPU spike on its equipment. The outage only lasted 30 minutes, but impacted every single data center Cloudflare operates across the globe.
Visitors to sites that depend on Cloudflare were met with 502 ‘bad gateway’ error messages.
“Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100 percent on our machines worldwide. This 100 percent CPU spike caused the 502 errors that our customers saw. At its worst traffic dropped by 82 percent,” wrote Graham-Cumming.
He admitted the company’s testing procedures were “insufficient” and said they are currently under review. The widespread impact was because the new WAF rules were “deployed globally in one go”.