Microsoft has explained how a cascading series of cockups left some of its Northern European Azure customers without access to services for nearly seven hours.
On September 29, the sounds of “Sacré bleu!” “Scheisse!” and “What are the bastards up to now?” were, we’re guessing, heard from Redmond’s Euro clients after key systems went down between 1327 and 2015 UTC. Virtual Machines, Cloud Services, Azure Backup, App Services and Web Apps, Azure Cache, Azure Monitor, Azure Functions, Time Series Insights, Stream Analytics, HDInsight, Data Factory and Azure Scheduler, and Azure Site Recovery were all titsup.
The problems started when one of Microsoft’s data centers was carrying out routine maintenance on fire extinguishing systems, and the workmen accidentally set them off. This released fire suppression gas, and triggered a shutdown of the air con to avoid feeding oxygen to any flames and cut the risk of an inferno spreading via conduits. This lack of cooling, though, knackered nearby powered-up machines, bringing down a “storage scale unit.”
“During a routine periodic fire suppression system maintenance, an unexpected release of inert fire suppression agent occurred. When suppression was triggered, it initiated the automatic shutdown of Air Handler Units (AHU) as designed for containment and safety,” the Azure team reported.
“While conditions in the data center were being reaffirmed and AHUs were being restarted, the ambient temperature in isolated areas of the impacted suppression zone rose above normal operational parameters. Some systems in the impacted zone performed auto shutdowns or reboots triggered by internal thermal health monitoring to prevent overheating of those systems.”
Microsoft said it got the air conditioning back up and running within 35 minutes and temperatures quickly returned to normal. However, some of the overheated servers and storage systems “did not shutdown in a controlled manner,” and it took a while to bring them back online.
As a result, virtual machines were axed to avoid any data corruption by keeping them alive. Azure Backup vaults were not available, and this caused backup and restore operation failures. Azure Site Recovery lost failover ability and HDInsight, Azure Scheduler and Functions dropped jobs as their storage systems went offline.
Azure Monitor and Data Factory showed serious latency and errors in pipelines, Azure Stream Analytics jobs stopped processing input and producing output, albeit only for a few minutes, and Azure Media Services saw failures and latency issues for streaming requests, uploads, and encoding.
All maintenance work on the fire suppression systems has been put on hold while the cockup is reviewed and a fuller report into what exactly happened is expected shortly.
“We sincerely apologize for the impact to affected customers,” the Azure team said in its refreshingly candid incident report.
“We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to): Suppression system maintenance analysis continues with facility engineers to identify the cause of the unexpected agent release, and to mitigate risk of recurrence.
“Engineering continues to investigate the failure conditions and recovery time improvements for storage resources in this scenario. As important investigation and analysis are ongoing, an additional update to this RCA will be provided before Friday [October 13].”
Source : theregister.co.uk