For the third time this month, AWS today suffered an outage in one of its data centers. This morning, a power outage in its US-EAST-1 region affected services like Slack, Asana, Epic Games and others.
The issues started around 7:30am ET and the knock-on effect of these issues continues to plague the service as of 1pm ET, as AWS continues to report issues with a number of services in this region, specifically its EC2 compute service and related networking functions. Most recently, the single sign-on service in this region also started seeing increased error rates.
“We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region,” the company explained in an update at 8am ET. “This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so.”
If this had been the only AWS outage in recent weeks, it would have barely been noteworthy. Given the complexity of the modern hyper clouds, outages are bound to happen every now and then. But outages are currently a weekly occurrence for AWS. On December 7, the same US-EAST-1 region went down for hours due to a networking issue. Then, on December 17, an outage that affected connectivity between two of its West Coast regions took services from the likes of Netflix, Slack and Amazon’s own Ring down. To add insult to injury, all of these outages happened shortly after AWS touted the resilience of its cloud at its re:Invent conference earlier this month.
Ideally, of course, none of these outages would ever happen and there are some ways that AWS users can protect themselves from them by architecting their systems to fail over to a geographically separate region — but that can add significant cost, so some decide that the tradeoff between downtime and cost isn’t worth it. At the end of the day, it’s on AWS to provide a stable platform. And while it’s hard to say if the company is just having a string of bad luck or if there are any systematic issues that have led to these problems, if I were hosting a service in the US-EAST-1 region right now, I would probably at least consider moving it elsewhere.