‘Significant’ outage at Alaska Airlines not a security incident, but a hardware breakdown

An outage that grounded flights at Alaska Airlines for three hours on Sunday wasn’t the result of a cyberattack, but a hardware failure in one of the company’s data centers.

Alaska Airlines told Network World in an email that a piece of “multi-redundant hardware, manufactured by a third-party”, experienced an “unexpected failure” that impacted several of its systems.

Initial speculation was that the outage was caused by a security incident, particularly as Hawaiian Airlines, which Alaska Airlines acquired last September, was hit with a cyber attack in June. Also, it occurred as Microsoft has been warning of “active attacks” targeting its on-premises SharePoint servers, and as cybercriminals increasingly have set their sights on the aviation industry.

The outage underscores the fact that even multi-redundant systems aren’t foolproof, and can go wrong all on their own.

“I think they emphasized redundancy to make it clear that they believe they invested appropriately in the solution: ‘We anticipated this with redundant systems and were still impacted!,’” commented Jeremy Roberts, senior director of content and research at Info-Tech Research Group.

Hardware failure halts flights for three hours

At around 8 p.m. Pacific time on July 20 (Sunday), Alaska Airlines announced an IT outage was affecting its operations and had resulted in a “system-wide ground stop of flights.” This stop was lifted three hours later, at 11 p.m..

The airline told Network World that when the critical piece of what it described as “third-party multi-redundant hardware” failed unexpectedly, “it impacted several of our key systems that enable us to run various operations.” The company is currently working with its vendor to replace the faulty equipment at the data center.

The airline has cancelled more than 150 flights since Sunday evening, including 64 on Monday. The company said additional flight disruptions are likely as it repositions aircraft and crews throughout its network.

Alaska Airlines emphasized that the safety of its flights was never compromised, and that “the IT outage is not related to any other current events, and it’s not connected to the recent cybersecurity incident at Hawaiian Airlines.”

The airline did not provide additional information to Network World about the specifics of the outage.

“There are many redundant components that can fail,” said Roberts, noting that it could have been something as simple as a RAID array (which combines multiple physical data storage components into one or more logical units). Or, on the network side, it could have been the failure of a pair of load balancers.

“It’s interesting that redundancy didn’t save them,” said Roberts. “Perhaps multiple pieces of hardware were impacted by the same issue, like a firmware update. Or, maybe they’re just really unlucky.”

While Alaska Airlines hasn’t released the total cost of this outage (and likely hasn’t even determined it yet), “it’s easily in the millions,” said Roberts.

Of course, he noted, there’s really no definitive answer until (or if) Alaska Airlines releases a post-mortem.

Redundancy, redundancy, redundancy

Investment in redundancy is a critical business practice, and the Alaska Airlines incident emphasizes the importance enterprises should place on specifying, managing, and maintaining such equipment.

“Redundancy is (or should be) ubiquitous,” said Roberts. “Basically, it should not be possible to run an enterprise service without some form of redundancy pretty much all the way down the stack.”

Multiple switches, load balancers, and internet connections are all standard when an enterprise has a service level to maintain, he said. “Basically, if the system is critical, you should have an active/active configuration so you can fail over in real time.”

He pointed out that redundancy is why, when Elon Musk got frustrated with the speed of a data center move and began unplugging servers at Twitter, the website didn’t go down. And some enterprises actively test their multi-redundant systems; Netflix, for one, uses and open-sources a tool called Chaos Monkey that randomly kills production instances to make sure systems are fault tolerant.

Ultimately, as evidenced by Alaska Airlines, “there’s a lot at stake,” said Roberts.

Source:: Network World