Kudos to Microsoft to share issue, impact, impact on customers, workaround, root cause mitigation and next steps on the Azure Status History (see here)
So let’s dissect the information available in our customary style:
MyPOV – Good summary of what happened, a power failure / power event. Good to see that customers were notified. Power events can always be tricky to recover, and it looks like Azure management erred on the side of caution bringing up services rack by rack and then adding services like SLB later. But the downtime for affected customers was long, best case – when in the first 25% of racks was almost four hours, and worst case 10 hours+. Remarkable it took Azure technicians 2 hours 20 or so minutes to get the power back. Microsoft needs to (and say it will) review power restore capabilities and find ways to bring storage back quicker. Luckily for customers and Microsoft this happened over night, with possibly lesser effect on customers… but that said we don’t know what kind of load was run on the infrastructure.
Rating: 3 Cloud Load Toads
MyPOV – Kudos for Microsoft to give insight into the percentage of customers affected. It looks like Azure Storage units are using mixed load – across Azure services. That has pros and cons, e.g. co-location of customer data, mixed averaged load profiles – but also means that a lot of services are affected when a storage unit goes down.
Rating 2 Cloud Load Toads
MyPOV – Good to see Microsoft explaining how customers could have avoided the downtime. But the Managed Disk option only applies VMs affected by the storage. Good to see the Redis Cache option – the question is though, how efficient (and costly) that would have been. Cache synching is chatty and therefore expensive. More importantly good to see the Azure SQL option, that is key for any transactional database system that needs higher availability. Again enterprise will have to balance cost benefits.
More of concern is that the other 4 services affected by the outage seem to have no Azure provided workaround, in case customers needed and would decide implement (and pay for one). No work around for Event Hub and Service Bus would not be a good situation, especially since event and bus infrastructures are used to make systems more resilient. Azure Search seems to lack a workaround, too, affecting customers using those services. It’s not clear what the statistic means though: Was Search itself not available or could the information of the affected storage units not be searched. Important distinction. The Azure Site Recovery affected isn’t good either, but kudos for Microsoft to start those manually. But manual starts can only be a workaround, as they don’t scale, e.g. in a greater outage. The failure of Azure Backup is probably the least severe, but in case of power failures, which may not be contained and might cascade, of equal substantial severity, as customers loose backup capability to protect them from potential further outages.
Rating: 2 cloud load toads (with workaround would be 1 / with no workaround 3 – the maximum, as we don’t have full clarity here, we use 3 as the average).
MyPOV – Always ironic how a cheap breaker can affect a lot of business. I am not a power specialist / electrician, but reading this – if one power panel fails and load has to be transferred, the system should still be operating. Maybe something was not considered in the redundant design vs remaining throughput capacity, not a good place to be.
Rating - 5 toads
MyPOV – Kudos for the hands-on next steps. The key measure (which I am sure Microsoft is doing) is though: How many other storage system power units, or overall Azure power units may have the same issue, and when will they be fixed and have the right capacity / redundancy, so this event cannot repeat. And then we have the question of standardization, is this a local knowledge event, are other data centers setup differently – or the same and can the same incident with a higher certainty be avoided.
Out of curiosity – there was another event in Storage provisioning, a software defect, only 37 minutes before (you can find it on the Azure status page, right below the above incident) … and these two events could / may have been connected. The connection between the two is at hand: When having a storage failure in one location, customers (and IaaS technicians) may scramble to open storage accounts – at the same or other locations, if they cannot, ad hoc needed remediation and workaround cannot happen. There maybe a connection / there may not be a connection. But when hardware goes down and the software to manage accounts for the hardware – that’s an unfortunate – and hopefully highly unlikely – connection of events.
Have you built for resilience? Sure, it costs, but all major IaaS providers offer strategies on how to avoid single location / data center failures. Way too many prominent internet properties did not chose to do so – so if ‘born on the web’ properties miss this – its key to check regular enterprises do not miss this. Uptime has a price, make it a rational decision, now is a good time to get budget / investment approved, when warranted and needed.
Ask your IaaS vendor a few questions: Enterprises should not be shy to ask IaaS providers if they have done a few things:
And some key internal questions, customers of IaaS vendors have to ask themselves:
Power failures are always tricky. IT is full of anecdotes of independent power supplies not starting – even in the case of formal test. But IaaS vendors need to do better and learn from what went wrong with Azure. There maybe a commonality with the recent AWS downtime, that IaaS vendors can become the victims of their own success. AWS saw more usage of S3 systems, Microsoft may have seen more utilization of the servers attached to the failing power system setup. And CAPEX demands flow into opening new data centers versus refreshing and upgrading older data centers.
RCA - Storage Availability in East US
Summary of impact: Beginning at 22:19 UTC Mar 15 2017, due to a power event, a subset of customers using Storage in the East US region may have experienced errors and timeouts while accessing storage accounts or resources dependent upon the impacted Storage scale unit. As a part of standard monitoring, Azure engineering received alerts for availability drops for a single East US Storage scale unit. Additionally, data center facility teams received power supply failure alerts which were impacting a limited portion of the East US region. Facility teams engaged electrical engineers who were able to isolate the area of the incident and restored power to critical infrastructure and systems. Power was restored using safe power recovery procedures, one rack at time, to maintain data integrity. Infrastructure services started recovery around 0:42 UTC Mar 16 2017. 25% of impacted racks had been recovered at 02:53 UTC Mar 16 2017. Software Load Balancing (SLB) services were able to establish a quorum at 05:03 UTC Mar 16 2017. At that moment, approximately 90% of impacted racks were powered on successfully and recovered. Storage and all storage dependent services recovered successfully by 08:32 UTC Mar 16 2017. Azure team notified customers who had experienced residual impacts with Virtual Machines after mitigation to assist with recovery.
MyPOV – Good summary of what happened, a power failure / power event. Good to see that customers were notified. Power events can always be tricky to recover, and it looks like Azure management erred on the side of caution bringing up services rack by rack and then adding services like SLB later. But the downtime for affected customers was long, best case – when in the first 25% of racks was almost four hours, and worst case 10 hours+. Remarkable it took Azure technicians 2 hours 20 or so minutes to get the power back. Microsoft needs to (and say it will) review power restore capabilities and find ways to bring storage back quicker. Luckily for customers and Microsoft this happened over night, with possibly lesser effect on customers… but that said we don’t know what kind of load was run on the infrastructure.
Rating: 3 Cloud Load Toads
Customer impact: A subset of customers using Storage in the East US region may have experienced errors and timeouts while accessing their storage account in a single Storage scale unit. Virtual Machines with VHDs hosted in this scale unit shutdown as expected during this incident and had to restart at recovery. Customers may have also experienced the following:
- Azure SQL Database: approx. 1.5% customers in East US region may have seen failures while accessing SQL Database.
- Azure Redis Cache: approx. 5% of the caches in this region experienced availability loss.
- Event Hub: approx. 1.1% of customers in East US region have experienced intermittent unavailability.
- Service Bus: this incident affected the Premium SKU of Service Bus messaging service. 0.8% of Service Bus premium messaging resources (queues, topics) in the East US region were intermittently unavailable.
- Azure Search: approx. 9 % of customers in East US region have experienced unavailability. We are working on making Azure Search services to be resilient to help continue serving without interruptions at this sort of incident in future.
- Azure Site Recovery: approx. 1% of customers in East US region have experienced that their Site Recovery jobs were stuck in restarting state and eventually failed. Azure Site Recovery engineering started these jobs manually after the incident mitigation.
- Azure Backup: Backup operation would have failed during the incident, after the mitigation the next cycle of backup for their Virtual Machine(s) will start automatically at the scheduled time.
MyPOV – Kudos for Microsoft to give insight into the percentage of customers affected. It looks like Azure Storage units are using mixed load – across Azure services. That has pros and cons, e.g. co-location of customer data, mixed averaged load profiles – but also means that a lot of services are affected when a storage unit goes down.
Rating 2 Cloud Load Toads
Workaround: Virtual Machines using Managed Disks in an Availability Set would have maintained availability during this incident. For further information around Managed Disks, please visit the following sites. For Managed Disks Overview, please visit https://docs.microsoft.com/en-us/azure/storage/storage-managed-disks-overview. For information around how to migrate to Managed Disks, please visit: https://docs.microsoft.com/en-us/azure/virtual-machines/virtual-machines-windows-migrate-to-managed-disks.
- Azure Redis Cache: although caches are region sensitive for latency and throughput, pointing applications to Redis Cache in another region could have provided business continuity.
- Azure SQL database: customers who had SQL Database configured with active geo-replication could have reduced downtime by performing failover to geo-secondary. This would have caused a loss of less than 5 seconds of transactions. Another workaround is to perform a geo-restore, with loss of less than 5 minutes of transactions. Please visit https://azure.microsoft.com/en-us/documentation/articles/sql-database-business-continuity/ for more information on these capabilities.
MyPOV – Good to see Microsoft explaining how customers could have avoided the downtime. But the Managed Disk option only applies VMs affected by the storage. Good to see the Redis Cache option – the question is though, how efficient (and costly) that would have been. Cache synching is chatty and therefore expensive. More importantly good to see the Azure SQL option, that is key for any transactional database system that needs higher availability. Again enterprise will have to balance cost benefits.
More of concern is that the other 4 services affected by the outage seem to have no Azure provided workaround, in case customers needed and would decide implement (and pay for one). No work around for Event Hub and Service Bus would not be a good situation, especially since event and bus infrastructures are used to make systems more resilient. Azure Search seems to lack a workaround, too, affecting customers using those services. It’s not clear what the statistic means though: Was Search itself not available or could the information of the affected storage units not be searched. Important distinction. The Azure Site Recovery affected isn’t good either, but kudos for Microsoft to start those manually. But manual starts can only be a workaround, as they don’t scale, e.g. in a greater outage. The failure of Azure Backup is probably the least severe, but in case of power failures, which may not be contained and might cascade, of equal substantial severity, as customers loose backup capability to protect them from potential further outages.
Rating: 2 cloud load toads (with workaround would be 1 / with no workaround 3 – the maximum, as we don’t have full clarity here, we use 3 as the average).
Root cause and mitigation: Initial investigation revealed that one of the redundant upstream remote power panels for this storage scale unit experienced a main breaker trip. This was followed by a cascading power interruption as load transferred to remaining sources resulting in power loss to the scale unit including all server racks and the network rack. Data center electricians restored power to the affected infrastructure. A thorough health check was completed after the power was restored, and any suspect or failed components were replaced and isolated. Suspect and failed components are being sent for analysis.
MyPOV – Always ironic how a cheap breaker can affect a lot of business. I am not a power specialist / electrician, but reading this – if one power panel fails and load has to be transferred, the system should still be operating. Maybe something was not considered in the redundant design vs remaining throughput capacity, not a good place to be.
Rating - 5 toads
Next steps: We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):
- The failed rack power distribution units are being sent off for analysis. Root cause analysis continues with site operations, facility engineers, and equipment manufacturers.
- To further mitigate risk of reoccurrence, site operations teams are evacuating the servers to perform deep Root Cause Analysis to understand the issue
- Review Azure services that were impacted by this incident to help tolerate this sort of incidents to serve services with minimum disruptions by maintaining services resources across multiple scale units or implementing geo-strategy.
MyPOV – Kudos for the hands-on next steps. The key measure (which I am sure Microsoft is doing) is though: How many other storage system power units, or overall Azure power units may have the same issue, and when will they be fixed and have the right capacity / redundancy, so this event cannot repeat. And then we have the question of standardization, is this a local knowledge event, are other data centers setup differently – or the same and can the same incident with a higher certainty be avoided.
Out of curiosity – there was another event in Storage provisioning, a software defect, only 37 minutes before (you can find it on the Azure status page, right below the above incident) … and these two events could / may have been connected. The connection between the two is at hand: When having a storage failure in one location, customers (and IaaS technicians) may scramble to open storage accounts – at the same or other locations, if they cannot, ad hoc needed remediation and workaround cannot happen. There maybe a connection / there may not be a connection. But when hardware goes down and the software to manage accounts for the hardware – that’s an unfortunate – and hopefully highly unlikely – connection of events.
(Luckily) a mostly minor event
Unless someone was an affected party - this was a minor cloud down event. But it was luckily only a minor event, as power failures can quickly propagate and create cascading effects. Unfortunately for some of the services, there is no easy or no workaround at all available that when they go down, they are down. Apart from Microsoft's lessons learned - this is the larger concern going forward. I count a total of 12 toads, averaging 3 Cloud Loud Toads for this event.
Lessons for Cloud Customers
Here are the key aspects for customers to learn from the Azure outage:Have you built for resilience? Sure, it costs, but all major IaaS providers offer strategies on how to avoid single location / data center failures. Way too many prominent internet properties did not chose to do so – so if ‘born on the web’ properties miss this – its key to check regular enterprises do not miss this. Uptime has a price, make it a rational decision, now is a good time to get budget / investment approved, when warranted and needed.
Ask your IaaS vendor a few questions: Enterprises should not be shy to ask IaaS providers if they have done a few things:
- How do you test your power system equipment?
- How much redundancy is in the power system?
- What are the single points of failure in the data center being used?
- When have you tested / taken off components of the power system?
- How do you make sure your power infrastructure remains adequate as you are putting more load through it (assuming the data center gets more utilized).
- What is the expected up in case of power failure?
- How can we code for resilience – and what does it cost?
- What kind of renumeration / payment / cost relief can be expected with a downtime?
- What other single point of failure should we be aware of?
- How do you communicate in a downtime situation with customers?
- How often and when do you refresh your older datacenters, power infrastructure / servers?
- How often have your reviewed and improved your operational procedures in the last 12 months? Give us a few examples how you have increased resilience.
And some key internal questions, customers of IaaS vendors have to ask themselves:
- How do you and how often do you test your power infrastructure?
- How do you ensure your power infrastructure keeps up with demand / utilization?
- How do you communicate with customers in case of power failure?
- How do you determine which systems to bring up and when?
- How do you isolate power failures and at what level to minimize downtime
- Make sure to learn from AWS (recent) and Microsoft’s mistakes – what is your exposure to the same event?
Overall MyPOV
Power failures are always tricky. IT is full of anecdotes of independent power supplies not starting – even in the case of formal test. But IaaS vendors need to do better and learn from what went wrong with Azure. There maybe a commonality with the recent AWS downtime, that IaaS vendors can become the victims of their own success. AWS saw more usage of S3 systems, Microsoft may have seen more utilization of the servers attached to the failing power system setup. And CAPEX demands flow into opening new data centers versus refreshing and upgrading older data centers. There is learning all around for all participants – customers using IaaS services, and IaaS providers. Redundancy always comes at a cost, and the tradeoff in regards of how much redundancy an enterprise and a IaaS providers want and need will differ from use case to use case. The key aspect is that redundancy options exist and that tradeoffs are made in an ideally fully aware state of the repercussions. And get revisited on a regular basis.
Ironically for the next few years – more minor IaaS failures like this can get the level of cloud resiliency up to the levels where they should be for both IaaS vendors and IaaS consuming enterprises. As long as all keep learning and then acting appropriately.