Amazon Web Services has now recovered from the latest major outage for cloud computing service, that spanned Christmas eve and affected large customers, including Netflix and Heroku. We reported the news yesterday and provided constant updates as NetFlix’s Customer care tried to solve the issue .
This was the second AWS-related outage in six months for Netflix, one of Amazon’s most sophisticated customers, which noted on its Twitter feed that it was “terrible timing.” The streaming video service gradually restored service to different devices throughout the night, but it wasn’t until 9 a.m. Pacific on Christmas morning – more than 19 hours after the incident began – that Netflix reported full recovery.
The problems , it seems, was with Amazon’s Elastic Load Balancing (ELB) service and it began on Christmas Eve at 1:45 p.m. Pacific time, and wasn’t fully resolved until 9:41 a.m. on Christmas Day, an outage of about 20 hours.
The ELB service is important because it is widely used to manage reliability, allowing customers to shift capacity between different availability zones, an important strategy in preserving uptime when a single data center experiences problems. Amazon faced problems with ELB a few months before too. During a June 29 outage, Amazon said a bug in its Elastic Load Balancing system prevented customers from quickly shifting workloads to other availability zones. This had the effect of magnifying the impact of the outage, as customers that normally use more than one availability zone to improve their reliability (such as Netflix) were unable to shift capacity.
On being asked what steps they’ll take to prevent any such future outages, an Amazon representative said “As a result of these impacts and our learning from them, we are breaking ELB processing into multiple queues to improve overall throughput and to allow more rapid processing of time-sensitive actions such as traffic shifts. We are also going to immediately develop a backup DNS re-weighting that can very quickly shift all ELB traffic away from an impacted Availability Zone without contacting the control plane.”
It will be interesting to see whether Amazon’s load balancing problems were related to any of the issues identified in June outage mentioned above, and how they’re planning to address them. We’ll likely see information on that front soon, as the Amazon team has been scrupulous about publishing details incident reports and their customer care service has been pretty agile.