How SimpleGeo Stayed Up During the AWS Downtime
Last week we experienced some turbulence in the cloud. It started early Thursday morning when a “network event” at Amazon Web Services (AWS), the largest infrastructure services provider on the net, triggered repair mechanisms built into some of their core infrastructure. These compensating actions were meant to address what appeared, to the software system, to be a partial failure. Unfortunately, this process burdened the system so heavily that it led to cascading failures, over provisioning of hardware resources, and correlated failures across a number of AWS products and properties.
As the morning wore on it became apparent that this outage was very serious. A large number of high profile web applications were down, and people were noticing. Concerned, we began auditing our own monitoring and metrics to see how things had held up. We were very happy to find that they had held up well. Aside from a small, presumably related blip in availability measured by a third party API monitoring service [http://api-status.com/6404/154812] we’ve managed to stay up throughout the outage.
Since so many applications were affected, we’ve been fielding lots of questions about why ours wasn’t. There are a number of reasons, and they touch on business, technology, procedures, and even philosophy. Instead of burying the discussion in personal emails and short twitter messages we’re here to elaborate long-form.
First though, a short digression: I realize that this is a sore subject for many teams. Believe me, everyone at SimpleGeo can relate. I certainly don’t want to antagonize or alienate anyone. If anything I say here comes off as abrasive or condescending in any way, I personally accept full responsibility. But please know that’s not my intent. Honestly, we’re just into this sort of stuff, so we’re rather enthusiastic to participate in the conversation as things unfold.
Expect to fail
The first thing to observe about this failure was that it was highly unusual. AWS has been running for almost nine years and nothing quite like this has ever happened before. It’s the sort of thing that only happens when multiple, seemingly unrelated components go haywire at the same time. The perfect storm. Who could have predicted it?
While this is true, many unusual events are also normal, at least in the sense that they are inevitable. In his book “Normal Accidents,” Charles Perrow points out that it’s normal for us to die, but we only do it once. It’s an infrequent event, but it’s also inevitable and catastrophic. As morbid and unpleasant as it may sound, it’s also an event that software development teams often plan for [http://en.wikipedia.org/wiki/Bus_factor]. The point is that it’s important to separate the frequency with which something will happen from the likelihood that it eventually will.
More to the point, large scale regional infrastructure failures are inevitable, even for reliable basic infrastructure that we take for granted [http://en.wikipedia.org/wiki/List_of_power_outages]. While the actual mechanism of this AWS failure was unusual, it’s not hard to imagine other ways to replicate the result. A data center exists in one physical location. Any number of infrequent, but inevitable events would trigger a similar datacenter-wide outage.
There were some qualitative differences with this outage relative to those we’ve seen before. It was widespread, and it lasted a long time. These characteristics were perhaps unexpected. In our experience, though, they go hand-in-hand, particularly for large decentralized systems. Once things go badly sideways it can take some time for the system to return to steady state.
The applications that failed were tacitly accepting these risks. We were too. Understandably, our availability requirements are different than requirements for many of the applications that were effected. Consumer web sites rightly put more emphasis on feature development and less on perfect operational continuity. So it stands to reason that services like SimpleGeo and Twilio managed risk differently and therefore weathered the storm largely unscathed.
However, as we’ve adapted to the AWS environment, we’ve begun to see some patterns. Patterns that other folks have seen too. It starts with avoiding certain “high risk activities,” like running critical components in a single availability zone (AZ). Ultimately though, I think the way an application, or organization, gets through something like this has a lot to do with its philosophy on failure.
At SimpleGeo failure is a first class citizen. We talk a lot about it in design discussions, it influences our operational procedures, we think about it when we’re coding, and we joke about it at lunch. I believe that this emphasis on understanding system failure mechanisms and being open about them is the first step towards dealing with them. Before we introduce a new component into our infrastructure we plan how we’ll deal with it when it inevitably fails.
Understand interaction
Our interest and emphasis on failure extends beyond individual components though. It’s equally valuable to think about how things fail together. In a distributed system, it’s critical.
Correlated failures are often environmental. AWS is a huge part of our environment, and one characteristic of AWS is that availability zones introduce correlated failure scenarios that are outside of our control. To address these risks we run infrastructure in multiple AZs. This may sound cost prohibitive, but transparent replication techniques are quickly being commoditized. So costs are coming down fast.
Somewhat counter-intuitively, one strategy we use to address cascading, correlated failures is to simplify components and re-introduce complexity by layering systems and creating more sophisticated interactions. If we notice some unusual state we try to make small components fail fast and independently to avoid infecting their peers, which may then fail together. When failures do occur they are isolated to a small, tractable sub-component.
This is why we chose to layer our spatial indexing system loosely on top of a distributed hash table rather than introducing a more sophisticated partitioning strategy that would have been tightly coupled to the underlying system. The risk is that things will go rogue, which can be hard to manage in a decentralized environment. We do our best to code defensively for these scenarios by introducing dampening mechanisms and back-off whenever something suspicious occurs.
In reality it’s impossible to anticipate all of the possible interactions that could result in a failure. Realize that this reality applies to Amazon as much as it applies to SimpleGeo, or you. In the current case, as far as I can tell, it was a single availability zone that initially went rogue on AWS. We’ve been told by Amazon that availability zones are independent, isolated infrastructure. So why do official and unofficial reports indicate that multiple availability zones experienced issues?
First, reading between the lines on outage reports, it seems that there is some built-in cross-AZ failover mechanism in EBS. When the first AZ failed it triggered cross-AZ replication, network congestion, and eventually over-provisioning of limited hardware resources. It’s likely that Amazon is currently re-assessing how useful this strategy is. It’s probably better for them to cut their losses in a single AZ rather than risking a cascading failure across their entire infrastructure.
The second correlation mechanism is more subtle. Even if Amazon did completely isolate failures to a single availability zone internally, there’s not a lot they can do to isolate them at the level of customer interaction with the service. When EBS service degradation occurred in one AZ, I suspect that customers started provisioning replacement volumes elsewhere via the AWS API. This is largely unavoidable. The only way to address this issue is through careful capacity planning — over-provisioning isolated infrastructure so that it can absorb capacity from sub-components that might fail together. This is precisely what we do, and it’s one of the reasons we love Amazon. AWS has reduced the lead time for provisioning infrastructure from weeks to minutes while simultaneously reducing costs to the point where maintaining slack capacity is feasible.
In the wake of this event I suspect we’ll hear more about how “multi-cloud” strategies can make infrastructure more robust. These arguments may have some credence. But during this discussion be sure to consider what you’re protecting against, and how the human component may affect things. If Amazon were to experience a truly catastrophic failure across all availability zones, would Rackspace, Linode, or even Azure have enough capacity to quickly absorb the increased demand?
For our purposes, we’ve designed our system under the assumption that AZs will fail independently, but they’re unlikely to fail together. We run parallel infrastructure in multiple AZs, each with operational replicas of all production data. We also maintain enough capacity to stay up if one zone goes down. To protect against correlated failures we code defensively, preferring to cut our losses and accept some small service degradations if a sub-component fails instead of risking a larger incident.
It could have been worse… but we’ve planned for worse, too
If you’ve gotten this far, you’ve earned some honesty. While we have taken care to design around obvious AWS failure scenarios, it was sheer luck that our infrastructure was not affected at all by this outage. Had we been impacted, though, we were prepared.
We use RDS for our website, so that could have gone down. But our website is independent, non-critical infrastructure, separate from our web services. If it went down for a while we’d be sad, but it wouldn’t impact our core services. So we’ve intentionally risked some availability in the interest of architectural simplicity.
We also use EBS fairly extensively — even on some critical services. But where we have used EBS we’ve been careful to keep redundant systems online in separate availability zones. One thing we learned from last week’s outage is that this may not be enough. We considered EBS in separate AZs non-correlated subsystems. Clearly that’s not the case right now, and a more widespread failure may have introduced service degradations for some of our products. In that case we would have had to provision new boxes on AWS or elsewhere if possible. In any case, we may have taken an availability hit.
In order of escalating severity, the entire us-east region could have failed. This would be bad, to say the least, since all of our production infrastructure runs in east. It’s also far less likely to occur, so we’re willing to accept higher recovery costs. We’ve developed a fairly sophisticated automated deployment and configuration management system, and take periodic backups to cold storage. So we’re confident that re-provisioning in another region, or even on another provider, would be a chore that takes hours, not days. However, current events have inspired us to think more about these scenarios.
Of course, this is all easy to say in hindsight. But these are the realities of distributed systems that we think about daily at SimpleGeo. We’ve thought carefully about how our systems, and the systems that we depend on might fail. We try to be honest and open about risks, and design around them where cost permits. Where they don’t, we’re open about the possibility of failure and try, at a minimum, to plan a way out. While we push for perfection, we also accept that some amount of failure is inevitable. So, in a sense, the reason SimpleGeo stayed up is because we expected this to happen.
Photo by Craig O’Neal
| Tweet |
