A February AT&T network outage that lasted 12 hours and affected all of its mobile network users was due to a misconfiguration that occurred during a “routine night maintenance” that failed to meet required quality assurance (QA) policies before it was deployed, according to a US Federal Communications Commission (FCC) post-mortem report.
This reason—which is similar to the one that caused the crippling CrowdStrike outage Friday—prompted both FCC recommendations and AT&T mitigations that demonstrate how all infrastructure providers should be doing more to evaluate and test system updates before deployment to prevent similar incidents in the future.
The nationwide wireless service outage across AT&T’s network occurred on Thursday, February 22, at 2:45 am, immediately after an employee “placed a new network element into its production network during a routine night maintenance window in order to expand network functionality and capacity,” according to the FCC’s report.
The configuration of the element, which “did not conform to AT&T’s established network element design and installment procedures, which require peer review,” triggered an automated response that shut down all network connections to prevent the traffic from propagating further into the network.
The outage affected AT&T Mobility’s network, including AT&T Mobility subscribers, Cricket subscribers, FirstNet customers, and customers of MVNOs with access to AT&T Mobility’s network. In total it affected 125 million registered devices and blocked more than 92 million voice calls during its duration.
“AT&T Mobility’s outage caused serious disruptions to members of the public,” said the report. “This outage illustrates the need for mobile wireless carriers to adhere to best practices, implement adequate controls in their networks to mitigate risks, and be capable of responding quickly to restore service when an outage occurs.”
Mitigation and recommendations
In light of the incident, AT&T has taken “numerous steps” to put better QA in place to avoid such slip-ups in the future, including additional steps that ensure confirmation that “required peer reviews have been completed” before deploying any maintenance work.
The provider also implemented technical controls within 48 hours of the incident to scan the network “for any network elements lacking the controls that would have prevented the outage,” so those controls could be put in place. AT&T continues to be engaged in a forensic investigation of the incident and also has enhanced its network for “robustness and resilience,” according to the report.
The FCC also recommended that only previously approved network changes developed “pursuant to internal procedures and industry best practices” should be deployed on the AT&T production network in the future. “It should not be possible to load changes that fail to meet those criteria,” the FCC said in the report.
Indeed, proper peer review also could have helped avoid the scenario that befell CrowdStrike on Friday, when “a defect found in a Falcon content update for Windows hosts” delivered the infamous Blue Screen of Death across millions of Windows systems worldwide, resulting in missed flights, closed call centers, and cancelled surgeries.
However, these reviews “are not adequate for the implementation of code at this level of hardware/software risk,” noted Marcus Merrell, principal test strategist at Sauce Labs.
“’Peer reviews’ imply that a peer is looking over code, to make sure it’s high quality,” he said. “It rarely, if ever, involves actually executing said code on the target hardware in the target environment.”
Same same, but different
There are parallels between what happened at AT&T and the CrowdStrike debacle, in that both companies pushed out bad updates. However, the CrowdStrike scenario was more severe, Merrell observed, in that “it killed the computers themselves” rather than “crashed the connective tissue between billions of devices on the network, but the computers at every end of the network still worked fine.”
Still, what both scenarios also have in common is that they were caused by human error, with the person in question “acting without regard to the most basic procedures,” he said.
They also both were a “direct result of the lack of functional software testing,” and more should be done on the part of all top-level infrastructure providers to do better in terms of QA when it comes to deploying and installing updates that could potentially break the system, Merrell said.
“These two events could not have been detected through security or performance testing, and could have been easily mitigated by rigorous functional testing—in the case of CrowdStrike, simply installing the update and turning on a Windows computer,” he observed. “It’s shameful that the most basic tenets of quality assurance, in 2024, cause the world to essentially realize all the worst predictions of Y2K.”
Source:: Network World