I felt compelled to take a quick break from my 3 part series on moving apps to the cloud to focus on a major topic with cloud solutions. That topic is resiliency and in light of the recent outage for Amazon Web Services (http://zd.net/MuXbjE) I wanted to get some of my thoughts written up. There are a number of articles that have been written recently about who’s fault it is when a service like Netflix or Pintrest goes down. Is it fair to blame the cloud vendor? Is it fair to blame the solution architects? Is it fair to blame the folks making the tough cost decisions?
These are the conversations that occur in the weeks that follow any major platform/infrastructure failure from the cloud vendors. It highlights one very important thing in my mind, cloud solutions are still in their infancy. So much is said about disaster recovery scenarios and the ability to geo-distribute services but when it comes down to actually seeing that work when the chips are down … well … we see what happened just 2 weeks ago. This was not the first failure and it is most likely not the last for any of the cloud vendors. The question we all have to ask is how do we mitigate the risk in a cost effective way? The rest of this blog will take a look at resiliency from a variety of different scopes and how to best insulate solutions from failure.
The type of solution matters … a lot!
There are a ton of different varieties of cloud solutions and each of them have a totally different profile for fault tolerance. Lets look at the most common cloud topologies.
- Pure PaaS
This is not really the most common solution out there but when you do have a purely PaaS implementation you have a lot of options when it comes to resiliency. Just to be clear, when I say pure PaaS I mean you are using the data platform of that PaaS offering (like entities in GAE or Table Storage with Windows Azure or DynamoDB/SimpleDB on AWS ) along with the traditional app container model.
There are still ways to do some bad things even in PaaS that will make your service far less resilient to fail over. For example, any service can choose to rely on a static IP in a DNS A record and the moment the host goes down it will take a serious amount of time to recover because of typical DNS change latency. Another common example is when developers choose to use a relational data store without any backup/restore capability. Yes, this still has to be done and you should not store your backups in the same datacenter where you have your primary service! Always think about worst case scenarios and recognize that even in PaaS the service availability is something the developer still has major influence over.
2. Pure IaaS
These are probably the most common solutions in the industry today as most folks look to the public cloud as a way to move their canned solutions “as is”. EC2 in Amazon being by far the most popular and that recent outage put a real spotlight on the type of impact an IaaS outage can have. The big difference here is the state of the VMs and how can you replicate those for fault tolerance. You still have some of the same issues I referenced in PaaS (static IPs and no data backups) but with IaaS things can get even more daunting because the entire VM itself has state that you need to replicate to get back up and running.
That leads me to a major cross cutting concern for resiliency … data of course. How do we ensure that our data is always synchronized and ready for recovery? This is often amplified in some of these cloud solutions because a lot of services are being designed for massive internet scale. It is not often that developers are thinking about how to restore their 50 MB backup to another datacenter during a failure. It’s more common that we are talking about gigabytes or terabytes of data. That is a very challenging and sometimes costly problem to solve. This really is a cost/benefit decision for service owners. You can find dozens of ways to keep data (even large amounts of it) in sync but you are essentially always going to pay for it.
But wait, I thought the cloud vendors would geo-replicate for me? Well, that is true and everything I’ve read and researched would suggest that your data is almost never going to be lost. I say almost because you have to understand the risk factors with the data platform architecture you’re relying upon for your VMs. Strong consistency versus eventual consistency is a huge factor here. Platforms like Azure storage use strong consistency within a DC but use an eventual commit across data centers. Amazon’s elastic block store (EBS) is a bit complex but it does appear to be mirrored across availability zones within a DC and if you opt in to it you can have it pushed to S3 and then replicated across regions. Google doesn’t have a lot of information available yet on how their IaaS VM disk durability will work but I’m certain it’ll be something unique. The lesson here is that every single one of these is different and has risk windows. Losing data is simply something you have to be prepared for even when you’re looking at IaaS!
The entire concept of hybrid cloud solutions is confusing right from the start. It tends to invoke a bunch of different ideas in developers minds. To some folks hybrid means using PaaS roles connecting to IaaS VMs. To some folks it means public cloud components used in conjunction with private cloud instances. For some it means you have connectivity between on premise assets and public cloud assets. Last but not least, to some it means you’re leveraging specific services in the cloud for increased reach or some niche need (ex. using a federation identity provider for security on top of mobile apps). This is the hard part about focusing on resiliency in a hybrid solution, you have to figure out what hybrid means to you first.
For the purpose of this topic I’m going to focus on the scenario where we are connecting back to on premise assets. I feel like this is the only scenario that has a genuinely unique fault tolerance component to it. The other ones are mainly the same factors you saw with pure PaaS and pure IaaS. The difference with this connectivity back to onprem resources is that you have to deal with a potentially disconnected recovery plan and in some cases you introduce single points of failure.
Let’s expand on that last point, single points of failure is almost always the first thing to smoke out of any architecture that has a high availability requirement. When we design a solution that connects back to on premise resources we’ve likely done that because of some friction point regarding that component. Perhaps it is highly sensitive data that can never leave your DC. Perhaps it is a very specialized hardware profile (like something with a hardware security module or very high tier 1 I/O requirements). In other words, something you simply can’t restore in any public cloud infrastructure and likely something that is very expensive. These things are very often not redundant in an on premise infrastructure so your hybrid solution that depends on these needs to be prepared to survive when/if there is a problem.
There is nothing really cloud specific here, these are disaster recovery topics that engineers have been wrestling with for years. The key take away though is that you’re availability is only as good as your weakest link. If you have no fail over capability or a single point of failure then do not publicize a 4 9’s SLA. Unless of course you can insulate yourself from that component being down. For example, if it is a highly sensitive database that must live on premise then can your database operations be done in an async way? At this point lets transition into talking about designing for fault tolerance and scope of failure.
Is Geo-replicating enough?
You would think that taking your service and running it in Virginia and Oregon on AWS would be enough to protect you from any outage … but … you would be wrong! The recent failure by AWS was certainly triggered by a weather event but it was exacerbated by a bug in their platform (http://zd.net/N5r1rT). Another example of this was the leap day bug in the Windows Azure platform (http://bit.ly/AfdqyL). If you think like I do, reading these two cases would immediately make you think, “gee … maybe I need to determine how to keep my stuff running regardless of the platform”.
That is certainly easier said than done. Again this becomes a matter of cost/benefit. It also may be the type of thing you can do strategically based on varying availability demands. This is important to keep in mind here, if you have critical windows when your availability needs to be high then you have to spend money to insulate yourself from failure. If you are a retailer and your service absolutely must be available during the holiday season then I would suggest you can’t rely on any one cloud vendor. You may very well want to run your service in Google’s DC, Amazon’s DC, and Microsoft’s DC. The likelihood of all three major vendors being down at the same time is extremely low. Once you get out of that window where you need the highest amount of availability you can turn your service off (some platforms will just turn off for you if there is no traffic … like GAE).
Now the really tricky part, how do you design a service that can run in all three of these DCs? Do they all support the same infrastructure and application platforms? The answer is … kind of … you can write services in Python or Java and run on Linux across all three but you’ll find different aspects to each that still make it pretty tough to be totally cloud neutral. This is a much larger topic and one I intend to cover at a later time. I need to spend a bit more time learning how Python operates across all three before I can do an adequate job presenting a common cloud design framework. I think it is something the industry is trying to achieve and fortunately all the vendors seem very open to the idea. (see recent announcement of Python on Azure)
Designing for Resiliency
We covered the different topologies and addressed the idea of being cloud vendor neutral for the highest level of resiliency. Those are key topics but they are not necessarily the easiest way to impact your availability in a cost effective way. The best way to implement a service that can avoid outages is to … well … design a fault tolerant service! Even in the face of the AWS outage there were some services that did not go down at all. Twilio specifically wrote up the details on their engineering blog (http://bit.ly/gCRRaW) as to how they felt their service was insulated from the outage based on their design.
At this point let’s start to get deep with the things in a cloud solution that you have to design for. Some of these are obvious and honestly, they’ve been talked about a lot. The only different spin I’ll put on it here is that I want to help stack rank the important level based on the availability impact each one will have. To do that I have to start by grouping important design decisions into some context.
Services are often designed with multiple components. Imagine a web UI that has background worker processes or cron jobs. These are all part of the same service and without any one of them running the service is degraded or dead. Having a service degraded is far better than being down though. To achieve this you have to consider every piece of your service as a ticking time bomb. Queues and idempotency are basically your bomb defusing kit here. If you are able to queue up work between these components and if they fail or are lost then replay that work you are likely to insulate yourself from a ton of potential issues.
This was clearly covered in the Twilio write up but I think they did miss an important aspect of this design. It is critical that you have the components looseley coupled through queues and replay capability (aka idempotency) is crucial … but … how do we know that it failed in the first place? This requires a developer to do a bit more work. For example, each and every operation that changes the state of your data must be implemented with a multi-phased commit. If the operation is critical to you then you should always be thinking about babysitting it. Start with an initial ACID (Atomic, Consistent, Isolated, and Durable) operation and build generic health monitors that look for failed or orphaned operations. Without this I would argue you’ll never know if something needs to be replayed. Simply saying we have a queue and idempotent operation is not enough!
2. External dependencies
This comes in a number of different forms and the common guidance I hear is retry everything and backup everything. There is some truth to that but I would argue that a proper design for resiliency here should be much smarter. Let’s first look at the idea of retrying everything. I don’t agree with this as a universal piece of guidance because it leads to code that can be wasteful and potentially even damaging. Take for example one of the common transient conditions developers see in the Azure platform which is throttling. This can occur in any one of the multi-tenant features (ACS, SQL Database, ServiceBus, Storage, etc…) and I’d argue that this is a necessary evil with any shared service on any platform. Throttling is absolutely something we as developers need to build retry logic for because each service can take action to solve the problem on the fly. Your service should experience nothing but a hiccup in these situations.
That is a smart example of a retry, a poor example of a retry is when someone looks to design a framework for retrying every external call coming out of their service. Depending on the error or the state you are introducing bloat and inefficiencies. For example, don’t retry calls to an external service that returns a business fault. You know what they say about retrying the same thing with no changes and expecting different results … that’s right … it is insane!
Applications change … I know this is probably not a shocker to anyone reading this but when you’re trying to implement a solution with high availability you have to think about the strategies for versioning your application and even more challenging … your data. The application itself can be fairly simple depending on the platform. Just make sure you really understand the concepts and underlying platform support for rolling out changes. GAE has a nice way of tagging versions in their config and providing for immediate rollback as part of the platform. Azure has two key concepts, first rolling upgrades, which on their surface seem like a nice way to go about making app updates. These rolling upgrades actually can introduce a major complexity if you use them for app upgrades because of the way they take down an instance and stand it back up before all instances have been upgraded. This is great when the code isn’t changing because you can maintain higher capacity in your service but when you are upgrading code you risk having two different versions of your code running at the same time! The better solution is VIP swaps in Azure but those sometimes have limitations. (Amazon is something I still haven’t worked with enough to understand the internals but rest assured I would/will when I start to deploy and update production services there).
The application isn’t usually the hardest component to version. Things like load balancers make it pretty easy. The harder part is your database and the schema for your database. This actually plagues developers with on prem solutions as much as it does with cloud services. You have to plan for this and it can be even more challenging when you consider solutions that leverage 100’s or 1,000’s of databases. I’ve seen really interesting custom designs in this space that roll out changes using operational queues. I’ve also seen solutions that leverage views with different versions of the views maintained to insulate code from underlying schema. Whichever way you decide to solve this problem just be prepared to take some time to think it through and have a way to evolve without downtime or maintenance windows. Users today just don’t understand web sites with construction jpeg’s on them anymore 🙂
While application versioning may not seem like something that can impact your resiliency it really is. Mainly because the over arching requirement is availability and uptime and if you aren’t planning for how to handle this you’ll find yourself abandoning the 4 9’s soon after you rolled out your first or second hotfix.
Resiliency, fault tolerance, high availability, chasing the 9’s, whatever you want to call it. Cloud service developers have to understand a lot about the platform they run on and they need to consider a number of key design patterns in order to ensure they can keep their users happy. High availability is (or has become) a non negotiable requirement to users. If you think your cloud vendor is going to just give you this because they have an uptime SLA you are simply missing the finer points of service resiliency. Hopefully this post has given you some thoughts on how to design, deploy, and maintain your service to decrease the amount of time you are down and if you’re really good (and have deep pockets) you can get to the point where your SLA is as good or better than that of the vendor that hosts your solution.