In the past few weeks, Microsoft has experienced multiple outages related to Azure and Office 365. We want to share our perspective on the outages, including the root cause, and some thoughts on how our customers can protect themselves against future disruption. This insight is a mix of research, direct conversations with Microsoft, and our perspective.
While not all inclusive, here are some recent outages:
- 7 October 2020 – Azure outage lasting approximately 3.5 hours
- 6 October 2020 – Azure Front Door outage lasting approximately 4.5 hours
- 1 October 2020 – Office 365 outage lasting approximately four hours
- 28 September 2020 – Office 365 outage locking many customers out for as long as five hours
- 18 September 2020 – Azure Storage Premium File Share outage lasting approximately 8.5 hours
The root cause analysis for each issue is different, but many do share a common thread of software bugs being introduced to the environment via the DevOps process. Let us take a deeper look at the 7 October outage as an example.
According to the Microsoft Preliminary Root Cause Analysis https://status.azure.com/en-us/status/history/(Tracking ID 8TY8-HT0), the preliminary findings are that a bug in a software update was introduced to the Azure Wide Area Network service. It took 1 hour after release to manifest the problem. For reasons not fully specified, the normal auto-mitigation process (which should take 2-3 minutes) did not function properly and the issue self-mitigated after 22 minutes. It also created a cascading effect that resulted in multiple hours of performance impact.
It is interesting to note that the outages on 7 October, 6 October, 28 September, and 18 September were all caused by the introduction of new code with defects.
The outages within Azure and O365 are not attributed to cyberattack, but it is worth noting we are seeing increased level of large-scale DDoS (Distributed Denial of Service) attacks launched against cloud providers. Here is one example that targeted AWS back in February 2020. https://www.zdnet.com/article/aws-said-it-mitigated-a-2-3-tbps-ddos-attack-the-largest-ever/)
The rapid pace of innovation and feature updates will unfortunately make brief outages such as these a reality for some time. While in most cases the outage duration is small and overall availability exceeds that of on-premise environments, there is still a valid question around how organizations can insulate themselves from these outages more effectively. Here are a few thoughts on that topic.
- Distribute workloads for services built on IaaS (not applicable to Teams and O365) across multiple edge zones to provide redundancy in the case of a single data center failure.
- Leverage a multi-cloud strategy for IaaS to allow for failover between cloud providers should one experience an outage.
- Evaluate the level of redundancy available within your critical applications and consider moving from a purely Infrastructure as a Service option to more sophisticated Platform as a Service solutions (which bring fault tolerance into the underlying supporting structure).
Another aspect to keep in mind is the distribution of responsibility between the customer and the cloud provider. The image below comes from Microsoft, though the concept is quite similar for any public cloud environment. In general terms, the provider owns the availability and security of the cloud platform, but the customer owns availability and security of their data, applications, etc. In a Software as a Service model (O365 / Teams) the provider owns the overall application, but data is still the customer’s responsibility.
Lastly, while outages are always frustrating and frequently disruptive to business, in many cases the availability we see from most public cloud providers equals or exceeds the levels many of us experienced with our own private data centers and infrastructure.
Logicalis assists organizations facing these challenges with our Production Ready Cloud framework. It incorporates proven best practice design elements for availability, performance, and security to help mitigate these exposures. Through our prescriptive Azure Expert MSP design engagements, we help limit outages with multi-region deployments and disaster recovery foundations built in. Additionally, to help maintain this compliance and governance to best practices, our Cloud Team Managed Services provides the daily operational support to stay on top of and implement relevant new updates as they are announced.
If you have concerns regarding your existing deployment, or have been affected by the recent outages, contact Logicalis to perform a Cloud Security Assessment on your Azure environment to quickly see what issues, if any, may need to be addressed.