Accessing an EC2 instance with RDP when deployed to a VPC

My Amazon experiments are coming along but I’ve had to step back from some of my ambitious plans to cover key workloads like DR and HA to focus on the AWS foundations first. Things like virtual networks or what Amazon calls VPC. Up until now most of what I’ve done has been simple provisioning and accessing of an instance. Things stay pretty easy in that mode. For example I had deployed at least a half dozen EC2 instances into what they call EC2-Classic which just means you’re dropping it into a canned tenant specific network and you don’t really need to configure the subnets or IP addresses or DHCP or wire up access to the internet. ¬†There is also all the wizards in the AWS management portal that just setup things like an RDP security group to the EC2 instance you want to access.

downloadAs a side note I also discovered an important factor in my AWS research. This stuff costs money … shocking right ūüôā So I have decided that I’m willing to spend about $50 a month on this stuff and I went and setup billing alerts on my AWS account to email me the minute I hit that $50 mark. Pretty cool right … ¬†I will say there is a lot of visibility into the usage in Azure for MSDN accounts which is nice. You can clearly see how much of your overall allotment ($150 for Ultimate and $100 for Pro) remains and the days you have left to consume it. I look forward to the day that same thing translates to full Azure subscriptions that are somehow governed by budget settings configured by enterprise administrators. There are some 3rd party tools making that possible today for both AWS and Azure but I want it native and the closest thing I have in either platform is an alerting system in AWS.

That’s a lot of words to just say that I decided to stop being lame and doing the intro level setup stuff and configure a VPC the same way I would a Virtual Network in Azure. So off I went and I started with a simple VPC. First good news I got was that this doesn’t cost me anything and I tracked down the section in the AWS environment easy peasy. You go to¬†https://console.aws.amazon.com/console/home click on “VPC” and then click “Create VPC” and here is what you see.

CreateVPC

Ok this is pretty simple but first thing that jumped out at me was the Tenancy drop down. If that were opened up you’d see “Default” or “Dedicated”. With dedicated you are informed that this will be on single tenant hardware but there are additional charges. You get a $2 per hour tax for this privilege and you pay more for each instance. Makes sense, nice option if you are not comfortable running next to anyone but sort of a silly fear in a mult-tenant cloud where underlying resources are all shared (firewalls, load balancers, storage subsystem, etc..). The other thing that was different for me from the Azure world was this forced CIDR block system. I’m just s dumb developer I have never had to worry about this magic system of slash something to define the subnet mask that gives me an idea of what IP addresses are available.

download (1)

Well guess what … I had a cool “a ha” moment with this one and I can’t believe it took me this long to figure this out. There are really only three key CIDR ranges you need to know. They are /24, /16, and¬†/8. Why is that, well it’s because they represent the main digits of the subnet¬†mask and will give you 256 addresses for every number you subtract. So a /24 is just two full ranges of 256. When you get to /16 it ends up being 256 * 256 for a total of 65,536 addresses and then if I remove one to /15 then I have two full ranges of that. I am pretty sure some professor showed me this in college since it’s all numbers and I literally took a math programming track but I’m sure I was either dozing off or hung over.

Ok now that I’m a CIDR expert I should have no problem creating Subnets. I will say I am kind of surprised you can’t label these VPCs to make it easier to select them later on when deploying an instance but it’s probably not a big deal since the IP address scheme will be something I am using to drive the deployments. I also had to decide if I wanted to let Amazon handle DNS resolution and DNS host name creation. Apparently my only option here is to use Amazon or create what is called a DHCP option set which specifies how I want to manage DNS myself. I’ll stick with the Amazon stuff for now since I’m not looking to run up my bill just to do DNS. And now we’re onto adding subnets. This wasn’t part of the VPC creation you have to jump to a new “Subnet” section in the console. No biggie, “Create Subnet” Go.

CreateSubnet

So here you can see that I can choose from a list of previously built VPCs and then I get to start defining a subnet that is within that range. Good thing I mastered that CIDR notation because I am sure I would screw this up if I didn’t understand how to set the starting address after a subnet range is added here. Also notice the availability zone option. This is unique to AWS and actually takes the region (East US/West US/etc..) and subdivides it into three sub region datacenters. Very nice way to provide resiliency beyond the internal server clusters. Now I was only able to pick between the three east availability zones and it turns out that was because I provisioned the VPC in the east region by selecting that region before I started creating it. This is a little less than obvious actually because I never told the VPC to be in the east region but I was defaulted to that in the console because that was what I had chosen in the top right corner (or at least that is what was set originally because I had no idea).

imagesAnother important factor here is the ability to create subnets that are in different availability zones to allow for snap-shotting and HA configurations beyond one AZ. See this is where I really needed to focus before I could start looking at HA and DR scenarios. Without these foundations I am basically clueless as to how it all works and is setup. So I went ahead and created a few subnets and as I did that i noticed I had the option to specify Network ACLs for each individual subnet. The default basically allows everything straight through but I could do some nice lock down work here if I wanted a private subnet. That’s only the start though I also have route tables and security groups to jump into. This is some crazy tight control you can have but any Network admin I’ve met is a borderline control freak so this probably works nicely. With Azure today there are inbound ACLs but only on the edge so this is clearly an area where our platform has some catching up to do.

By the way, I’ll mention this again, when you deploy into EC2-Classic you don’t encounter any of this stuff. I thought I was a bad ass when I could deploy and RDP into an instance in no time at all but this stuff all started to become real complex real fast. Lots of layering and I haven’t even gotten to the part where I was completely stuck. I basically knew I couldn’t RDP into my instance yet so I went and did a little google searching and found the AWS instructions for setting up a Security group. That was actually straight forward and FWIW I realized something important here too. You can only control inbound and outbound flow if you are on a VPC. I had always wondered why the security groups on my EC2 instance only had inbound because I knew AWS had outbound control. That’s the key, you need to use ¬†VPC to get access to more advanced controls like that. So I went to security groups and added one and you can see I have to pick with VPC I am adding this to. Remember what I mentioned before, it sure would be nice if these labels weren’t so cryptic.

CreateSecurityGroupOnce created the subnet can be updated to include additional rules for how flow can come into our out of the VPC. So of course I added 3389 to flow in (AWS has a nice feature to plug in your current IP address in /32 CIDR notation if you don’t want it open to the world). And I honestly thought I was pretty much done. My experiences in Windows Azure gave me a little false confidence that I had setup everything and exposed access in a logical way similar to an endpoint on our Azure firewall. Oh how wrong I was. The next thing I did was jump over to EC2 and deploy an instance into one of the subnets on that VPC and waited the 5 minutes or so for it to launch and tried to connect. FAIL #1

download (2)Back to the internet for some searching, I do have to say there is an abundance of information out there and the documentation AWS produces is very good. I realized that I absolutely screwed up by not creating an Elastic IP and associating that with my EC2 instance. Without a public address there was no way I would be able to access the instance from the internet. Then I also realized that I hadn’t deployed this thing called an internet gateway. What the hell is that? So I read about it a bit and it’s a key component to wire up the VPC to the outside world. You have to make an intentional deployment choice to start making anything accessible over the internet. Confusing … nice … but confusing for this dumb developer. So I must be ready now … I must have everything setup now right. Nope, still get no connection over RDP. FAIL #2

Now this was a good hour or so of head scratching before I finally found some good information. I actually went back and deployed an EC2 instance into classic to prove that I could even connect from my home network. The key missing element was a wire up step that took the internet gateway using a custom route table and set an association to my subnet. That did the trick! I was able to access my instance via RDP and holy crap I could control and block it in like 20 different ways. This is complex but after spending the last year and half talking to network specialists I understand why these things exist.

So just to recap, in order to setup an EC2 windows instance (linux as well just do SSH instead of RDP) that you can access remotely on a custom VPC you have to do the following:

  1. Create a VPC with a CIDR range that has private IPs you want to use. Use AWS for DNS and host name resolution or bring your own.
  2. Create Subnets and associate those with your VPC
  3. Create an EC2 instance and deploy that to your VPC
  4. Create an ElasticIP and associate that with your instance. (FYI, you can create multiple vNICs and multiple Elastic IPs and host multiple paths into your node … very nice)
  5. Create a Security Group and associate that with your VPC
  6. Create an internet gateway
  7. Create a custom route table and add an association to your subnet and the internet gateway.
  8. Pound your chest if you’re a developer and you just figured out how to configure an advanced software defined network ūüôā

download (3)

So what’s next for me, I think I’ll be doing either some cloud formations experiments or identity and access management . Both are very interesting and I’ve been watching a ton of youtube videos on them. Role based access for all AWS resources is very cool but cloud formations and JSON templates to drive multi node deployments is kind of hitting me in the developer nerve. That said, IAM is free and I may not have a ton of dollars left in February if I’m going to stay under $50.

Posted in AWS, Cloud Computing, IaaS, Virtual Networking | 1 Comment

Examining How AWS and Azure Handle Key Workloads

I’ve spent over three years now working 40 hours(ish) a week on a cloud platform. My jobs have included helping debug issues on that platform, helping design, code, and deliver a production SaaS solution on that platform, teaching developers how to build cloud oriented solutions on that platform and most recently helping customers in the earliest phases of thinking about using that cloud platform for their business. Now in my case 99% of my time has been spent making that a reality on Windows Azure.

downloadRecently though the wind has started to swirl a bit for me in my design and architecture discussions. It’s always started with the typical why cloud rhetoric but I do feel like much of that is now in the past for customers. Most folks really are starting to get it and they want to figure out great ways to use this elastic commodity often throw away infrastructure. Obviously I’m speaking from the point of view of someone looking to pay .05 an hour to use a machine in some other data center here. No one would call their own hardware disposable … right ūüôā

My Awakening

download (1)The last few months have been very interesting for me and it’s mainly because of the fact that I’m starting to get past those introductory discussions and into making things real. That’s put me neck deep in comparing how to do key workloads in Azure versus AWS. I’ve always had deep respect for what Amazon has been able to do with their cloud platform and I confidently can claim that I know about 40% of their services at maybe a 200 level of depth. I know … how impressive. So the big dilemma now is how to get deeper understanding of those services so I can be better equipped to discuss strategies for moving to the cloud. The broader world of cloud is starting to open up for me now and I want to drink it all in.

That brings me to the point of this blog post, I am officially going to commit time and energy as I go into 2014 to do real implementations of key workloads that I keep getting asked about in both AWS and Azure and document my experience. I will be extremely candid about my experience and I will explain where things are complex, challenging, expensive, or just simply a pain in the ass. This is the only way I know how to get that deep understanding … it’s time to stop reading about it and actually DO IT!

My First Experiment

download (2)This is probably a good time to mention that this blog is my own personal blog with my own personal opinions that have no reflection on my employer … there I’m covered right ūüôā Seriously though, I will begin next week with my first “experiment”. ¬†That experiment is setting up a disaster recovery scenario for a public facing web site. ¬†The whole disaster recovery scenario for a cloud provider is freaking huge because of the potential cost savings and geographic distribution. So how then can we do this in both environments. I’ll be exploring Global DNS fail over and cold versus warm stand by’s and how to synchronize data and of course features of the networking stacks on both platforms.

Should be fun, ¬†I sure do hope I get a bunch of Amazon gift cards for Christmas so I can help subsidize that AWS bill that’s about to start showing up on my Amex ūüôā

Future Experiments

If the disaster recovery experiment doesn’t get you excited then here are the first five experiments I plan to run. I will likely do about one ¬†month so this should keep me super busy.

  1. Implementing HA database servers (likely SQL Server since I’m a bit of an Oracle newb)
  2. Doing real elasticity on a multi-tiered web app based on spikey traffic
  3. Automating a multi-tiered deployment
  4. Setting up a Site 2 Site VPN tunnel to my home office
  5. Implementing a snapshot strategy for backup/restore
Posted in AWS, Cloud Computing, IaaS, Windows Azure | Tagged , , , | Leave a comment

Slides from .NET User Group Meeting in Atlanta

Thanks to everyone that attended the Azure Quick Hits session at the .NET User Group meeting this month. I’ve posted the slides we used here if you’d like to take a look at them.

PPTX –¬†Azure Quick Hits – Building Apps for the Cloud

Posted in Uncategorized | Leave a comment

Moving Apps to the Cloud (Good, Better, Best) ‚Äď Part 3

I have really been struggling with finishing off this 3 part series because the last part should be the “big finish”, it will drive huge amounts of traffic to this blog and folks will be raising their lighters in the air asking for a Part 4! Had I sat down and written all of this a week or two after we presented it back and June I think that would have been easier. The reality however is that I’m finding myself struggling with the idea of moving to PaaS as the “best of breed” solution these days. ¬†Through this blog I’ll explain what our original thoughts were about PaaS and why it is represented as the “best” model and throughout I am going to sprinkle in my more cynical view points on it as a migration path.

Why did we think PaaS was the “best” way to host a cloud solution?

If we take a relatively shallow view of this I think the PaaS model is actually very easy to sell as a migration target for your application. Let’s look at 5 extremely obvious benefits and how they are true on the surface but are way more complex when you peek under the hood.

1. Automation is good, manual intervention destroys elasticity

There is no way to deny that automation is a great thing. ¬†In fact if you look at any of the leading PaaS models today (Amazon Web Services w/ Elastic Beanstalk, Windows Azure with Web/Worker Roles, and Google App Engine with customized runtimes) they all cover this feature by providing a way to upload a “packaged” application and all of the environment setup (OS, App Server, etc…). ¬†Other setup aspects can be easily scripted out or in the case of Amazon simply baked into a template and reused. ¬†Being able to automate the installation of an application is paramount to being able to survive things like hardware outages or being able to handle spikes in demand that require quick responses by adding N instances of your application. ¬†Without automation there is no way cloud computing can deliver on half of the promises of elasticity and resiliency.

That’s the surface argument for moving to PaaS but in reality it’s not anything new. ¬†We have all sorts of demand for automation in setting up and rolling out applications in on premise web farms today and I can remember being involved in projects to handle this well over 10 years ago (the deployment processes at the time were typically 10+ page manual setup scripts that the ops team had to try and execute without error … repeatedly … sometimes after hours … FAIL). ¬†Private cloud options from all the key vendors are also taking this problem head on to simplify that task and when you start to move into the public cloud you have all sorts of automation APIs from the IaaS vendors as well. ¬†I guess what I’m saying is that there is nothing wrong with this as a necessary element for cloud computing but in reality we’re not talking about a PaaS specific aspect here.

The reason this often gets broken out as a PaaS specific thing is because it does in theory make your life a bit easier by providing you with a super simplified model for doing this. If you look at things like the Powershell cmdlets for IaaS or the Cloud Formation templates to tailor an Amazon Machine Image (AMI) you’ll find some relatively labor intensive things you have to do (albeit reusable if you have some canned architecture templates). ¬†With PaaS, you do in fact have all of this stuff super simplified. ¬†Role templates (Web/Worker/Cache) and Startup tasks for Windows Azure provide for a simplified way to laydown dependencies and Elastic Beanstalk is effectively dedicated to providing all sorts of different pre-baked configurations for your web apps written in .NET, Java, PHP, or any other customized setup you believe you can reuse.

So what then do you have to give up? ¬†That depends which platform you’re looking to push your code to. ¬†For the case of Azure I can tell you there is a lot of control that you will relinquish to plug into this model. ¬†The startup and underlying fabric that lives between your application and the host VM you’re running on can be a blessing … and it can be a curse. ¬†I have spent a solid 2 years working with folks that wanted to seamlessly migrate applications into those containers and it simply is not friction free.

For Amazon, ¬†you end up seeing a model that looks a whole lot more like, deploy it and then get out of the way and reuse the IaaS model under the hood. ¬†You won’t find yourself trying to understand which parts of the underlying cloud specific app container provide obstacles … but you will not get the benefits of higher security and greater isolation that you get with a platform like Azure. ¬†This is true of Google App Engine as well by the way, ¬†they have customized Python runtimes very similar to the app containers you deal with in Azure. ¬†These things will get in between your VM and your user code … often for the betterment of the overall health of the cloud infrastructure and not necessarily your specific application!

2. Reducing developers need to focus on horizontal application elements is good, a retailer’s dev team building complex caching frameworks or complex identity infrastructure is bad

I don’t think anyone that has done enterprise development for a significant amount of their career will argue this. I’ve spent most of my career in these roles as an enterprise developer looking for ways to add as much value as possible. ¬†The best way to do that is not to get caught in the weeds of horizontal application frameworks and infrastructure. ¬†There are a number of ways this happens when building line of business applications but the most common is poor technical leadership and a desire to do “gold plating” for your application. ¬†A good friend of mine who had spent many years in the enterprise dev space loved to say “great is the enemy of good”. I have to say I agree and live by this credo when working on designing enterprise solutions.

So when you look at something like PaaS in the cloud you will find that a huge benefit is the building blocks available for you to reduce the amount of code you have to write. ¬†There are elements that all vendors have like NoSql style storage repositories and then there are ways to store block and page level data in a highly durable/scaleable storage infrastructure. ¬†Then you’ll see identity and access infrastructure, messaging capabilities, relational databases, and caching or content distribution features. The list goes on and on and will probably change before this ink is even dry on this post. All of these have nominal charges considering the amount of code the app developer ultimately avoids having to write.

Now I have to say these features are absolutely wonderful if you’re envisioning a brand new application or looking at rewriting your application. That said, there is rarely any simple migration path for an existing on premise application. For example, if you want to look at something like Azure queues or Amazon’s Simple Queue Service because your current application has been using an on premise queue between tiers then how do you do that … well … you rewrite your code if you want to move to PaaS. ¬†With IaaS you’ll have other options but if you want to start using a PaaS feature it is very hard to do that from an existing application dependency like this.

The other big challenge is the multi-tenant nature of the PaaS features you’re going to consume. ¬†The most common one I hear folks struggle with is the relational database that they choose to use. ¬†In order for vendors to achieve better economies of scale it is typical to see the isolation boundaries blur some and thereby leaving the applications subject to transient problems. ¬†Kind of like having your neighbor suck up all the internet bandwidth on your street by hosting a really popular Justin Bieber fan site (no this has not actually happened to me, just thought it was a good illustration). The big difference here though is you can’t walk over to that neighbor and say … hey … cut that crap out or go put that freaking site up in the cloud man! In a PaaS scenario you have to hope the superintendent or home owners association rules come to the rescue. This type of issue is something you can design for if you’re building a new solution but if you’re simply migrating an existing application, well, that is really hard and often causes developers to stumble right out of the gate when moving to the cloud.

3. Config and app version consistency is good, environment drift is bad

PaaS is awesome for this, ¬†when you consider the basic model here you are almost forced to stay fairly clean with the footprint of your application on the VM itself. ¬†You are building an application “package” and scripting or building templates for any specific infrastructure dependencies. ¬†The promise of elasticity and resiliency means you need to be able to have your application stand back up from a bare bones VM quick and easy and consistently. ¬†A typical PaaS developer has to consider the environment evil and untrustworthy.

There are some specifics you’ll need to understand though. First of all, environment drift is still a potential risk in a PaaS environment because remote access to the VM is available from Azure and AWS (but not Google, the Google App Engine really is a pure black box … you don’t touch that environment at all). You will also find different rules about when the environment is blown away and when it isn’t. ¬†Is the environment state lost when we version our application? Is it lost when a hardware failure occurs? ¬†Is it lost when the OS image is updated? ¬†The answer is actually not consistent here and you need to explore the platform you’re using to fully determine how the state of your application might be cleaned up.

Now let’s address the topic of consistency. ¬†This is always a goal for application developers that are working with multi-node configurations. When I roll out a new version of my application I want to ensure that it gets done consistently and allows me to do it without any downtime. ¬†The best way to do this is by leveraging load balancers and virtual IPs with replica staging envrionments. ¬†Seems simple enough, in the Windows Azure space you get an option to execute a VIP swap from a staged environment and Amazon provides a similar option to control the LB routing from their portal with tagging options for staging VMs. ¬†If that were the end of the story then I probably would just say this is a no brainer … but it isn’t.

There are still certain types of solution changes and certain architectures that don’t work well with a direct staging to production swap. ¬†For those you’ll see varying solutions. ¬†In the case of Windows Azure you have a rolling upgrade option but this has a lot of complexity. ¬†It is baked into the PaaS model but you have to deal with a downgraded capacity and in any scenario with more than 2 instances you could have to handle running 2 versions of your application side by side! This is because the way upgrade domains cascade through your solution they shut down a percentage of your application for upgrade and then bring it back online while continuing to “walk the upgrade domains”. If you can stick with VIP/LB management you’re fine and I would still consider that part of the platform. ¬†It gets a bit more sticky as you look at smart fabric options.

Last but not least, ¬†the concept of high availability is baked into your cloud platform in a few different ways. ¬†Options like clustering on top of availability zones in AWS or availability sets in Azure can get very confusing. You have to understand how to dissect your application into tiers that can not be taken down at the same time. You may need to look at what internal dependencies exist between those tiers and you have to realize that your nodes may be taken down or fail over at any time due to hardware failures or host infrastructure updates. High availability is not something PaaS gives you for free but it is something to consider when you’re looking at versioning your application. You have some work to do here if you want to provide for continuous service availability.

4. Sticking with a dev platform you know and using tools you’re comfortable with provides for efficient cloud development and is therefore good

I have found this to be true for me but I’ve been a pure Windows, Visual Studio, .NET guy for the past 8-10 years. ¬†The tools we have for doing development are great, the tools we have for doing cloud development are good and continue to get better. ¬†They do provide for a lot of efficiency gain because you have local emulators and you have numerous SDKs that simplify the work against some of the more confusing APIs. The same can be said of the tools for Eclipse working with plug-ins to deploy to AWS (or even Azure from Eclipse). ¬†The Google tools work well too and have full blown local emulators. ¬†Amazon is the only one where I don’t see any local emulators today. ¬†Yes, they have the Eucalyptus bits that can do a form of private cloud and thereby give you a on-premise configuration but I think we’ve stepped out of the realm of dev emulators at that point.

The reason I focus on the emulators is because this whole post is about PaaS. ¬†How are you expected to write any application and test it locally without emulators to verify that it works! Moving the code back and forth to an actual public cloud set of instances is time consuming and ridiculously inefficient. ¬†There in lies one of the first ways the tools story starts to break down some. ¬†The emulators are good … they are not great. ¬†I’ve used all of them for a long time and you’re always going to find that they are not 100% consistent with the cloud platform you’re deploying too. They also have their own overhead associated with them so if you’re planning to run tests in those local emulators just plan for some extra coffee breaks.

I would also say that while tools are important for development they don’t necessarily make or break the platform. I’ve seen incredibly efficient developers that use Notepad++ and incredibly inefficient developers that use the most advanced dev tools ever seen. ¬†The real efficiency gains are in the approach and knowledge of the platform. So much of the gains you might see from the tool are vaporized when developers are not building good unit tests or not aware of the latest API wrappers that could have saved them 100’s of lines of code. If you’re basing your success on a tool alone when moving to the cloud then I’d say you’ve already fed the Mogwai after midnight.

5. Moving your application from a shared infrastructure to a dedicated one with horizontal and vertical scaling options is good, being locked in to a constrained set of shared infrastructure on premise is bad

You may have noticed I said “shared infrastructure” twice in that line. ¬†That was on purpose, what you often see in an on premise application that is trying to move to the cloud is an application that has no idea how to run in an isolated set of infrastructure. ¬†To be cost effective you want to run with the smallest set of VMs that can satisfy your user demand. Now, admittedly, this isn’t just a PaaS issue, ¬†it really is for any move into a hosted cloud infrastructure. Most on premise operations teams will run applications together on a VM. ¬†Sometimes I’ve seen this divided by business unit, sometimes I’ve seen it divided by some sort of costing model. ¬†Whatever it is, there is usually a lot of applications that run side by side with other applications. ¬†In addition, these servers are often loaded up with a set of enterprise standard software that was chosen 5 – 10 years ago.

I mention all of this because the goal of running in a cost effective manner means you have to determine your existing capacity requirements. These requirements are often skewed by what my good friend the¬†Samurai Programmer¬†likes to call “Shmutz” (I can’t take claim to using Yiddish … he’s taught me all the words I know … thanks Greg ūüôā ). That “Shmutz” is the stuff distracting us from finding out what the real capacity requirements are. ¬†I might have a bunch of CPU getting burned up by a virus scanner, I might have another application that gets a spike in demand every morning at 9 am and forces my application to throttle down, I might have background jobs eating up a bunch of memory, etc.

There is no arguing that getting to an isolated infrastructure is more flexible but is it the most cost effective model for your solution? What else could you consider? There are shared hosting models now in Windows Azure and with Amazon you’d have no problem setting up multiple sites and routing different DNS requests based on the incoming URL. ¬†High density hosting and pushing again for the right economies of scale can be a difficult decision depending on the type of application. At this point it does become more of a PaaS topic but really an even finer grained one than you may have originally thought. ¬†Are you going to be happy with PaaS shared or do you need isolated and how can you switch when you need to.

I’m very impressed with what we came up with for Azure Web Sites when you’re looking at this scenario but it is limited to web applications. ¬†I’m also impressed with what Amazon does for their “Spot Instances” pricing option. ¬†This isn’t really a model for shared hosting per say but it is a way for you to drive your costs down by bidding for compute space and if there is some available then Amazon will give it to you. ¬†Imagine if you have some flexibility in how long that computation job takes but you only have $X to get it done. You bid and wait and you likely get it done at some point. ¬†Priceline.com for cloud computing … where’s the Shatner commercial?

PaaS and Vendor Lock In

I can’t complete this topic of migrating to PaaS without being fair to this topic of vendor lock. After all, ¬†if you’re going to write code that targets a specific platform you have to wonder if the platforms are locking you into their model. ¬†In some cases the answer is yes but you can insulate yourself in a number of ways. ¬†If you’re planning to write code that is purely PaaS and leverages all of the tools and SDKs that are most mature for each vendor then yes, you’re likely to end up locked in. ¬†In fact some of the programming languages themselves are going to bind you to a specific platform (the Go language for Google App Engine or .NET for Windows).

What you can do however is look at options that allow for better portability. ¬†Languages like Python are being built up on Amazon, Google, and Microsoft’s clouds complete with SDKs for their platform. ¬†you could look to those runtimes and frameworks if you want to avoid getting locked in. ¬†This is an area I continue to explore in more depth and you can almost guarantee there will be more posts coming from me on this in the near future.

The other interesting area here is how to leverage the various private cloud data center management tools to provide for seamless back and forth work in the public and private cloud … also without too hard of a vendor lock in. ¬†Again, more to come on this as I’m focusing in this area extensively over the next 12 months. As it relates to pure portability I know Azure has a distinct advantage because they are the only cloud vendor with a developing private cloud solution. ¬†VMWare could certainly poke their head up in this area at some point but they are way behind with nothing but Cloud Foundry (in Beta) in the public cloud space today.

Summary

So have I completely changed my mind about PaaS being the “best” location for migrating your application? Honestly, I’d say I have changed my mind, mainly because I can’t see how it is feasible as a migration target given all the friction. ¬†That is not an indictment on the Google, Amazon, or Microsoft platforms. ¬†They are all fantastic in their own rights but as someone who has written a ton of on premise applications and attempted to help migrate them to the cloud for the past 2 years I can tell you the reality is a new application or a rewrite is likely going to be the only workable options to get to PaaS.

This is why IaaS continues to be an important bridge to the cloud.  Many solutions may use that bridge to let the air out of the balloon on their datacenter and start investing in new development in a platform of their choosing. The issues of compliance and cost have been vetted out.  They are not necessarily all consistent across the vendors but they will be at some point in the near future. I  personally believe that the 2 biggest questions should now be:

  1. Can i do this cloud app development on premise and have some flexibility in and out of the public cloud?  Will it be seamless?
  2. Does the vendor provide me a platform that will lock me in and how do I avoid that?

I hope you found some of these three parts interesting and maybe they made you ask some new questions you weren’t thinking about. PaaS does, at it’s core, provide for a simplification of some aspect(s) of the development of your cloud solution. The question for developers now is, how and when do I start to look at that because straight migrations are often a complete dead end.

Posted in PaaS, Windows Azure | 1 Comment

Moving Apps to the Cloud (Good, Better, Best) – Part 2

First, my apologies for the delay in getting part 2 written, my only excuse is that I really wanted to understand a broader set of options for building Hybrid solutions in the cloud. You know from part 1 that this whole series comes from the lessons learned putting on the TechEd talk in June 2012 which of course was exclusively about moving apps to Windows Azure and how IaaS can play a role in that. What about other cloud vendors? How are they addressing the hybrid solution space?

After a good amount of reading and researching I think I can at least give this topic proper theoretical coverage. Unfortunately, I’ve not yet had the opportunity to put it in to practice and look at these different options side by side but I do believe I can give readers a sense of what I think is interesting and unique about building hybrid solutions in the cloud across the key vendors. Of course I’ll also explain why this is better than pure IaaS and continue upon that theme. Enough with the sales pitch … let’s get it stated.

What are examples of “Hybrid” solutions?

This is the obvious place to start. We must attempt to get a baseline on the types of solutions that would be classified as hybrid. ¬†Saying something is hybrid probably gives folks a very specific initial reference concept. Hybrid means we are trying to do two things in a blended way … think about hybrid vehicles that use gas and electric or genetic hybrids like a¬†Mule¬†that is a hybrid species (horse + donkey). Effectively let’s try and take the best of multiple “things” and come up a better “thing” than what we had when it was pure or homogeneous.

1. PaaS and IaaS Public Cloud Hybrid

This is maybe the most common version of this because it takes things like stateless web tier components and uses stateful database instances. ¬†I see architecture diagrams like this all over the place with Amazon implementations and we’re starting to see it come about with Windows Azure and now Google and their Compute Engine behind the PaaS App Engine. ¬†These tend to happen very naturally based on how the tiers of the application were originally designed.

The web UI tier is usually relatively easy to write in a stateless way whereas the data tier is inherently not.  The Web Tier is typically very resilient as well because of things like load balancers and health probes allowing for simple redirections and retries. The data tier on the other hand is not, you can really see this in the implementations of multi-tenant relational databases in the cloud today. They introduce a lot of complexity and constraints for developers (throttling, size constraints, backup/restore limitations, transient failure conditions, etc.). This leads to a desire to move back to something more familiar which is where IaaS comes to the rescue. This is also exactly the architecture we look at with MSDN/TechNet in our TechEd talk (see diagram above). In this case we actually had a hard constraint blocker (multi TeraByte relational data footprint) which resulted in our need to build it as a hybrid.

2. On Premise and Public Cloud Hybrid

Tons of people I talk to love the idea of cloud computing. It is almost impossible to debate the economics of it for elastic work loads. They however are always challenged by the constraints around compliance and security. This really is the gorilla in the elevator for the cloud today and this hybrid model attempts to provide a solution. Don’t worry so much about compliance in the cloud, use it for those things that don’t have heavy compliance and security demands and connect back into your own data center for accessing ¬†sensitive data or running compliance bound transactions (like credit card transactions that need PCI for example).

Another example of this model is centered around securing and reusing private data. It is very common to find enterprises unwilling to move heavy sets of core data into the public cloud. ¬†Even if you could implement proper encryption at rest capabilities there is still the size and cost associated with moving that data. Add to that key management complexities and the potential on premise uses (like business reporting or existing intranet application consumption) that would be impacted. ¬†There are points of friction all over the place when moving large data sets but with a hybrid model like this you don’t have to choose. To use a golf metaphor … “play it where it lies”.

There are only two solutions that I’ve seen to date that address this and that’s the Site to Site networking capabilities in Windows Azure and the Direct Connect option with Amazon Web Services. ¬†The Amazon technology being the only one in the market today that can go up to 10 GB/sec but the Windows Azure approach (currently in preview) is nipping at their heels with something that may not provide as much bandwidth but still can provide a cost effective solution for a lot of workloads. Taking the concepts of virtual networking into this mode where you extend your corporate data center for on demand public cloud compute power while still mitigating the risk of compliance and security issues certainly has a lot of potential.

3. Private and Public Cloud Hybrid

This really comes down to how you want to leverage your existing virtualization with the public cloud. Many companies have either VMWare or HyperV capabilities today and being able to move those private virtual machines into the cloud is attractive for a number of reasons.  Two reasons I hear all the time are disaster recovery (aka business continuity) and geo-distribution.

Disaster recovery is a huge topic that I will save for another post but I will say that using private cloud VMs and moving them to the public cloud as a fail over can be much more cost effective than say standing up and owning your own stand by data center. I realize that is stating the obvious and definitely easier said than done. Would your security and compliance concerns be the same in a disaster scenario? Could you run in a degraded form and still satisfy your requirements in the event of a disaster? As I said, this is a huge topic but a common one that folks look at with hybrid cloud solutions. Often the driving goal is to blur the lines between the private and public cloud here.  For more on this topic watch the TechEd 2012 talk here.

Geo distribution is another thing that attracts folks to this model.  Assuming you do have private cloud capabilities in your DC what happens when your users start to spread out or even roam? Using the public cloud to stand up pieces of new infrastructure for your services when those changes come about is powerful as well. We cover a scenario in the TechEd talk (see part 1) where an inventory service needs to start servicing customers nationwide but it was blocked by some missing platform features for previous versions of Windows Azure.

Once again, IaaS to the rescue here with a hybrid solution architecture. ¬†The beauty with this in the Windows Azure IaaS model is that the VM format (hyper V .vhd) is preserved between the private and public cloud environments. You can literally set it up move it around and boot from it locally and in the cloud. There are no similar capabilities out there today from other vendors (I have to imagine the new VMWare public cloud offerings will provide this but there isn’t even a roadmap of those yet).

Why is this better than pure IaaS?

Let me get the easy one out of the way, ¬†moving to a hybrid model is better in some cases because you don’t have any other option. Basically as a solution architect you’re just trying to “break through the wall” if you will. Many of the hybrid solution scenarios above describe constraints that can not be done with PaaS or with public cloud in general. Moving to a hybrid model in a lot of cases is simply necessary because of that. ¬†Is that “better”? ¬†Well it is better than staying stuck in a purely on premise infrastructure model that costs you and your business way more than it should to respond to wildly varying demand for compute resources. ¬†It is also better than sitting around waiting for months maybe even years for all the compliance and platform features to exist.

The most interesting solution model to explore as a “better” model than pure IaaS is the one that does start to move pieces and parts to PaaS. This essentially brings up the question of why is PaaS better? The typical answer here is always “of course PaaS is better … I have less code to write”. That is a tough value vs. complexity cost trade off discussion actually. The different PaaS environments do in fact automate certain things for your app but they introduce lots of different constraints at the same time.

You also have to be very careful if you are going to attempt to¬†avoid lock in on any one PaaS vendor. The last part of this series will approach this whole idea that PaaS is the ultimate goal and why. As we do that I also want to look at options from¬†OpenStack¬†which promises freedom to move from vendor to vendor. For now let’s just agree that moving to something with more automation and less application specific horizontal concerns¬†can¬†be better.

Summary

The hybrid solution models vary quite a bit so you have to start by making sure you define what type of hybrid solution you want to build and why. ¬†Are you trying to overcome some sort of PaaS constraint or are you avoiding some type of cumbersome data migration effort? The technologies that exist for extending enterprise networks into the public cloud are starting to provide a lot of flexibility here and as the major vendors provide tooling and reference architectures for using IaaS and PaaS together I think we’ll see more and more of these hybrid architectures in practice. At a minimum, the hybrid solutions will be around for the next few years as interim steps to purely managed PaaS solutions or compliance constrained workloads.

Posted in Cloud Computing, IaaS, PaaS, Windows Azure | Leave a comment

Resiliency in the Cloud: If you build it … it WILL break

I felt compelled to take a quick break from my 3 part series on moving apps to the cloud to focus on a major topic with cloud solutions. That topic is resiliency and in light of the recent outage for Amazon Web Services (http://zd.net/MuXbjE) I wanted to get some of my ¬†thoughts written up. There are a number of articles that have been written recently about who’s fault it is when a service like Netflix or Pintrest goes down. Is it fair to blame the cloud vendor? ¬†Is it fair to blame the solution architects? Is it fair to blame the folks making the tough cost decisions?

These are the conversations that occur in the weeks that follow any major platform/infrastructure failure from the cloud vendors. It highlights one very important thing in my mind, cloud solutions are still in their infancy. So much is said about disaster recovery scenarios and the ability to geo-distribute services but when it comes down to actually seeing that work when the chips are down … well … we see what happened just 2 weeks ago. ¬†This was not the first failure and it is most likely not the last for any of the cloud vendors. The question we all have to ask is how do we mitigate the risk in a cost effective way? ¬†The rest of this blog will take a look at resiliency from a variety of different scopes and how to best insulate solutions from failure.

The type of solution matters … a lot!

There are a ton of different varieties of cloud solutions and each of them have a totally different profile for fault tolerance.  Lets look at the most common cloud topologies.

  1. Pure PaaS

This is not really the most common solution out there but when you do have a purely PaaS implementation you have a lot of options when it comes to resiliency.  Just to be clear, when I say pure PaaS I mean you are using the data platform of that PaaS offering (like entities in GAE or Table Storage with Windows Azure or DynamoDB/SimpleDB on AWS ) along with the traditional app container model.

There are still ways to do some bad things even in PaaS that will make your service far less resilient to fail over. For example, any service can choose to rely on a static IP in a DNS A record and the moment the host goes down it will take a serious amount of time to recover because of typical DNS change latency. Another common example is when developers choose to use a relational data store without any backup/restore capability. Yes, this still has to be done and you should not store your backups in the same datacenter where you have your primary service! Always think about worst case scenarios and recognize that even in PaaS the service availability is something the developer still has major influence over.

2. Pure IaaS

These are probably the most common solutions in the industry today as most folks look to the public cloud as a way to move their canned solutions “as is”. EC2 in Amazon being by far the most popular and that recent outage put a real spotlight on the type of impact an IaaS outage can have. The big difference here is the state of the VMs and how can you replicate those for fault tolerance. ¬†You still have some of the same issues I referenced in PaaS (static IPs and no data backups) but with IaaS things can get even more daunting because the entire VM itself has state that you need to replicate to get back up and running.

That leads me to a major cross cutting concern for resiliency … data of course. How do we ensure that our data is always synchronized and ready for recovery? This is often amplified in some of these cloud solutions because a lot of services are being designed for massive internet scale. ¬†It is not often that developers are thinking about how to restore their 50 MB backup to another datacenter during a failure. ¬†It’s more common that we are talking about gigabytes or terabytes of data. That is a very challenging and sometimes costly problem to solve. This really is a cost/benefit decision for service owners. You can find dozens of ways to keep data (even large amounts of it) in sync but you are essentially always going to pay for it.

But wait, I thought the cloud vendors would geo-replicate for me? Well, that is true and everything I’ve read and researched would suggest that your data is almost never going to be lost. ¬†I say almost because you have to understand the risk factors with the data platform architecture you’re relying upon for your VMs. ¬†Strong consistency versus eventual consistency is a huge factor here. Platforms like Azure storage use strong consistency within a DC but use an eventual commit across data centers. ¬†Amazon’s elastic block store (EBS) is a bit complex but it does appear to be mirrored across availability zones within a DC and if you opt in to it you can have it pushed to S3 and then replicated across regions. Google doesn’t have a lot of information available yet on how their IaaS VM disk durability will work but I’m certain it’ll be something unique. The lesson here is that every single one of these is different and has risk windows. Losing data is simply something you have to be prepared for even when you’re looking at IaaS!

3. Hybrid

The entire concept of hybrid cloud solutions is confusing right from the start. ¬†It tends to invoke a bunch of different ideas in developers minds. To some folks hybrid means using PaaS roles connecting to IaaS VMs. To some folks it means public cloud components used in conjunction with private cloud instances. ¬†For some it means you have connectivity between on premise assets and public cloud assets. ¬†Last but not least, to some it means you’re leveraging specific services in the cloud for increased reach or some niche need (ex. using a federation identity provider for security on top of mobile apps). This is the hard part about focusing on resiliency in a hybrid solution, you have to figure out what hybrid means to you first.

For the purpose of this topic I’m going to focus on the scenario where we are connecting back to on premise assets. I feel like this is the only scenario that has a genuinely unique fault tolerance component to it. ¬†The other ones are mainly the same factors you saw with pure PaaS and pure IaaS. ¬†The difference with this connectivity back to onprem resources is that you have to deal with a potentially disconnected recovery plan and in some cases you introduce single points of failure.

Let’s expand on that last point, ¬†single points of failure is almost always the first thing to smoke out of any architecture that has a high availability requirement. When we design a solution that connects back to on premise resources we’ve likely done that because of some friction point regarding that component. ¬†Perhaps it is highly sensitive data that can never leave your DC. ¬†Perhaps it is a very specialized hardware profile (like something with a hardware security module or very high tier 1 I/O requirements). ¬†In other words, something you simply can’t restore in any public cloud infrastructure and likely something that is very expensive. These things are very often not redundant in an on premise infrastructure so your hybrid solution that depends on these needs to be prepared to survive when/if there is a problem.

There is nothing really cloud specific here, these are disaster recovery topics that engineers have been wrestling with for years. ¬†The key take away though is that you’re availability is only as good as your weakest link. If you have no fail over capability or a single point of failure then do not publicize a 4 9’s SLA. ¬†Unless of course you can insulate yourself from that component being down. ¬†For example, ¬†if it is a highly sensitive database that must live on premise then can your database operations be done in an async way? At this point lets transition into talking about designing for fault tolerance and scope of failure.

Is Geo-replicating enough?

You would think that taking your service and running it in Virginia and Oregon on AWS would be enough to protect you from any outage … but … you would be wrong! The recent failure by AWS was certainly triggered by a weather event but it was exacerbated by a bug in their platform (http://zd.net/N5r1rT). Another example of this was the leap day bug in the Windows Azure platform (http://bit.ly/AfdqyL). If you think like I do, reading these two cases would immediately make you think, “gee … maybe I need to determine how to keep my stuff running regardless of the platform”.

That is certainly easier said than done. Again this becomes a matter of cost/benefit. ¬†It also may be the type of thing you can do strategically based on varying availability demands. ¬†This is important to keep in mind here, if you have critical windows when your availability needs to be high then you have to spend money to insulate yourself from failure. If you are a retailer and your service absolutely must be available during the holiday season then I would suggest you can’t rely on any one cloud vendor. ¬†You may very well want to run your service in Google’s DC, Amazon’s DC, and Microsoft’s DC. The likelihood of all three major vendors being down at the same time is extremely low. Once you get out of that window where you need the highest amount of availability you can turn your service off (some platforms will just turn off for you if there is no traffic … like GAE).

Now the really tricky part, ¬†how do you design a service that can run in all three of these DCs? Do they all support the same infrastructure and application platforms? The answer is … kind of … you can write services in Python or Java and run on Linux across all three but you’ll find different aspects to each that still make it pretty tough to be totally cloud neutral. ¬†This is a much larger topic and one I intend to cover at a later time. ¬†I need to spend a bit more time learning how Python operates across all three before I can do an adequate job presenting a common cloud design framework. I think it is something the industry is trying to achieve and fortunately all the vendors seem very open to the idea. (see recent announcement of Python on Azure)

Designing for Resiliency

We covered the different topologies and addressed the idea of being cloud vendor neutral for the highest level of resiliency. Those are key topics but they are not necessarily the easiest way to impact your availability in a cost effective way. ¬†The best way to implement a service that can avoid outages is to … well … design a fault tolerant service! Even in the face of the AWS outage there were some services that did not go down at all. ¬†Twilio specifically wrote up the details on their engineering blog (http://bit.ly/gCRRaW) as to how they felt their service was insulated from the outage based on their design.

At this point let’s start to get deep with the things in a cloud solution that you have to design for. Some of these are obvious and honestly, they’ve been talked about a lot. The only different spin I’ll put on it here is that I want to help stack rank the important level based on the availability impact each one will have. To do that I have to start by grouping important design decisions into some context.

1. Intra-Service

Services are often designed with multiple components.  Imagine a web UI that has background worker processes or cron jobs.  These are all part of the same service and without any one of them running the service is degraded or dead. Having a service degraded is far better than being down though. To achieve this you have to consider every piece of your service as a ticking time bomb. Queues and idempotency are basically your bomb defusing kit here. If you are able to queue up work between these components and if they fail or are lost then replay that work you are likely to insulate yourself from a ton of potential issues.

This was clearly covered in the Twilio write up but I think they did miss an important aspect of this design. ¬†It is critical that you have the components looseley coupled through queues and replay capability (aka idempotency) is crucial … but … how do we know that it failed in the first place? This requires a developer to do a bit more work. ¬†For example, each and every operation that changes the state of your data must be implemented with a multi-phased commit. If the operation is critical to you then you should always be thinking about babysitting it. Start with an initial ACID (Atomic, Consistent, Isolated, and Durable) operation and build generic health monitors that look for failed or orphaned operations. ¬†Without this I would argue you’ll never know if something needs to be replayed. ¬†Simply saying we have a queue and idempotent operation is not enough!

2. External dependencies

This comes in a number of different forms and the common guidance I hear is retry everything and backup everything. There is some truth to that but I would argue that a proper design for resiliency here should be much smarter. ¬†Let’s first look at the idea of retrying everything. I don’t agree with this as a universal piece of guidance because it leads to code that can be wasteful and potentially even damaging. Take for example one of the common transient conditions developers see in the Azure platform which is throttling. ¬†This can occur in any one of the multi-tenant features (ACS, SQL Database, ServiceBus, Storage, etc…) and I’d argue that this is a necessary evil with any shared service on any platform. Throttling is absolutely something we as developers need to build retry logic for because each service can take action to solve the problem on the fly. ¬†Your service should experience nothing but a hiccup in these situations.

That is a smart example of a retry, ¬†a poor example of a retry is when someone looks to design a framework for retrying every external call coming out of their service. ¬†Depending on the error or the state you are introducing bloat and inefficiencies. For example, don’t retry calls to an external service that returns a business fault. ¬†You know what they say about retrying the same thing with no changes and expecting different results … that’s right … it is insane!

3. Versioning

Applications change … I know this is probably not a shocker to anyone reading this but when you’re trying to implement a solution with high availability you have to think about the strategies for versioning your application and even more challenging … your data. ¬†The application itself can be fairly simple depending on the platform. ¬†Just make sure you really understand the concepts and underlying platform support for rolling out changes. ¬†GAE has a nice way of tagging versions in their config and providing for immediate rollback as part of the platform. Azure has two key concepts, first rolling upgrades, which on their surface seem like a nice way to go about making app updates. These rolling upgrades actually can introduce a major complexity if you use them for app upgrades because of the way they take down an instance and stand it back up before all instances have been upgraded. ¬†This is great when the code isn’t changing because you can maintain higher capacity in your service but when you are upgrading code you risk having two different versions of your code running at the same time! The better solution is VIP swaps in Azure but those sometimes have limitations. ¬†(Amazon is something I still haven’t worked with enough to understand the internals but rest assured I would/will when I start to deploy and update production services there).

The application isn’t usually the hardest component to version. ¬†Things like load balancers make it pretty easy. ¬†The harder part is your database and the schema for your database. ¬†This actually plagues developers with on prem solutions as much as it does with cloud services. ¬†You have to plan for this and it can be even more challenging when you consider solutions that leverage 100’s or 1,000’s of databases. ¬†I’ve seen really interesting custom designs in this space that roll out changes using operational queues. ¬†I’ve also seen solutions that leverage views with different versions of the views maintained to insulate code from underlying schema. ¬†Whichever way you decide to solve this problem just be prepared to take some time to think it through and have a way to evolve without downtime or maintenance windows. ¬†Users today just don’t understand web sites with construction jpeg’s on them anymore ūüôā

While application versioning may not seem like something that can impact your resiliency it really is. ¬†Mainly because the over arching requirement is availability and uptime and if you aren’t planning for how to handle this you’ll find yourself abandoning the 4 9’s soon after you rolled out your first or second hotfix.

Summary

Resiliency, fault tolerance, high availability, chasing the 9’s, whatever you want to call it. ¬†Cloud service developers have to understand a lot about the platform they run on and they need to consider a number of key design patterns in order to ensure they can keep their users happy. ¬†High availability is (or has become) a non negotiable requirement to users. ¬†If ¬†you think your cloud vendor is going to just give you this because they have an uptime SLA you are simply missing the finer points of service resiliency. Hopefully this post has given you some thoughts on how to design, deploy, and maintain your service to decrease the amount of time you are down and if you’re really good (and have deep pockets) you can get to the point where your SLA is as good or better than that of the vendor that hosts your solution.

Posted in Cloud Computing, IaaS, PaaS, Resiliency, Windows Azure | 1 Comment

Moving Apps to the Cloud (Good, Better, Best) – Part 1

We are a few weeks removed from TechEd 2012 in Orlando (June 11th – 14th) and I was reflecting on all of the stuff I learned as me and two of my colleagues put together our breakout session titled “How to Move and Enhance Existing Apps for Windows Azure“. ¬†The journey we took to come up with this topic was an interesting one. Initially, we were truly thinking purely about Platform as a Service (PaaS) because as a developer this is what has been the most compelling in the cloud. ¬†That all started to quickly change as we learned more and more about how Infrastructure as a Service¬†(IaaS) was going to open significantly more scenarios for subscribers.¬†

In fact, our initial outline for the session was not chosen for the event. We had to be willing to evolve our view of what it meant to “move apps” to the cloud and specifically Windows Azure. Our discussions with the Azure track experts gave us some great ideas. This idea of moving from good … to better … to best approaches of moving apps started to make a ton of sense. It was also quickly clear to us that making this story compelling would require some real world case studies and some truly unique implementation demo’s.

Migrating versus Moving

One of the things that had us scratching our head a bit when we first started reviewing our session submission was the desire of the Azure track folks to use the term “Move” instead of “Migrate”. I know this seems subtle but after 2 solid months of researching and preparing for this talk I think I understand why. When you say you are going to “migrate” something there is some perceived work involved. When a flock of geese migrate south for the winter that’s a good bit of work (or at least it looks like it is). ¬†When folks migrate between different countries there are lots of logistics and challenges.

Now when you think of the term “move” I think folks have different visceral reactions. Some of you think of moving your house which for most is not a trivial undertaking. It is however something a bit more common and often comes with tons of ways to make it less painful (hiring movers for example). There aren’t any shortcuts when you look at migrating. I realize some of you may be thinking, uh … really … they are synonyms. Yes and No. ¬†They are certainly something different and we did embrace this with our session for sure.

Moving Apps to the Cloud – The Good Way

So we’ve settled on the move term and we’re looking to cover what it is that would be considered a “good” approach. This is where Infrastructure as a Service offering fits in and the other term that is being popularized which is “forklifting”. The forklift metaphor works really well for moving apps into an IaaS model. In your head you should hopefully be picturing some large warehouse where a truck has backed in pallets of widgets. Those pallets are simply picked up and put in their expected resting spot. There should be no changes necessary!

This is one of the most critical differences between IaaS and PaaS and it is true of all vendors that have these offerings (Amazon, Microsoft, Google, Oracle, etc..). This was a huge challenge for subscribers that wanted to just get to the public cloud … friction free. I have personally spent the last 2 years attempting to help developers understand how to get to PaaS. It is not simple when you consider all of the complexities of existing applications. IaaS really is the way to do this … it really does feel like a “good” way to start getting to the cloud.

Let me give some more specifics.  So far this feels like a bit of an opinion piece. Lets look at three key reasons why PaaS is a difficult first step for an existing application.

1. You need to understand the platform

This is a big deal, whether you’re looking at Windows Azure or Google App Engine or Heroku or Elastic Beanstalk. You have to know the constraints and intricacies of the platform. How does scaling work? Can the platform auto-scale? What are the key constraints? What programming languages are supported? How do updates occur? What level of control and permissions do you have with the local machine resources? How do the diagnostics work? What type of multi-tenancy conditions can occur? Are there APIs to help automate? What features exist and most importantly … what features are missing?

This last question was the catalyst for my demo in the first section of the TechEd talk. Missing platform features plagues all of the PaaS models. It results in developers either abandoning plans to move to the cloud or looking for alternatives like IaaS. My example was the Microsoft Distributed Transaction Coordinator (MSDTC). I won’t go into detail on what that is or why it isn’t supported in Windows Azure PaaS … I’ll let you watch the session for that. ¬†Just know that it was not and likely will not be a feature in PaaS so IaaS is the only solution for applications that have a dependency on it. As a side note, I intend to do some deep exploration of the other leading PaaS offerings to see if this is a universal blocker … expect some blog entries in the near future on Google App Engine, Elastic Bean Stalk, and Heroku.

2. You have to really know your application

This diagram below represents the DTC dependent application that we designed to move to the cloud. This application was expected to handle a scenario where the users would greatly benefit from geo-distribution (just a fancy way of saying multi-datacenter deployment … please forgive my Fancy Nancy style of writing this note … I have young daughters ūüôā This application could also use existing platform features to handle central data repository sync so we were about 95% of the way to the cloud and PaaS but we hit a blocker.

This may seem like an obvious thing but with legacy applications there are a number of factors that can decrease the level of knowledge developers have. ¬†The first and most obvious one is that they simply haven’t worked on the app in years. There are also situations where the original developers managed to strike it rich and retired to an island somewhere and refuse to take any calls when you ask for help to understand the app. These represent knowledge leaks and they are very common.

3. Your application needed to be scalable and resilient by design

This is by far the hardest aspect and also the one I almost never see in existing applications that we want to move to the cloud. A developer who wrote an application 10+ years ago was not thinking about it going from their on premise 3 server farm to a 50 instance public cloud with no control over the load balancer, server affinity, ¬†or application state. Another trap we often fell into 10-15 years ago (not that we knew it was a trap at the time) was that we relied heavily on the ability of hardware to handle our scaling needs. Stated another way, we would be adding CPUs and RAM to solve any scale issue. I am not saying this was the case for every app written by every developer … but not everyone had the foresight to plan for huge scale out options.

Resiliency is another one that no developer (including myself) would have planned for in a legacy solution.  Heck, I can remember when I was an enterprise architect I would actually do finger wagging at any developer that wrote retry logic around their data access code. It felt unnecessary in a controlled on premise data center. In the PaaS world it is as important as oxygen for mammals.

Coding for resiliency comes in a ton of flavors. ¬†It starts with data access code but then it trickles into any stateful code or workflow code. ¬†How do you know your transaction failed and is it idempotent (fancy way of saying replayable). I think I could write a thousand words on resiliency in the cloud … and I plan to … so I’ll cut this off here.

Why are we calling IaaS good? I think it is freaking awesome!

That is a fair statement and those that have been moving apps to EC2 instances in Amazon for the last five years probably agree ūüôā However, there continues to be a lot of overhead with managing a solution that is using IaaS. The primary promise of the cloud is almost always about cost efficiency. ¬†Whether it is the Rent vs. Buy example for an elastic scenario or the reuse of world class management and deployment processes and software to handle your application.

If you’re not making an effort to get to PaaS then you are losing out on a lot of good stuff. Servicing of the underlying infrastructure being the most obvious. Rolling out patches to Linux and Windows is a mundane and well understood IT activity. Sticking your apps into IaaS will not get you out of that.

Additionally, consistent management of your infrastructure is one PaaS benefit but what about consistent management of your application? The configuration and the versions of the code? How do you handle app version roll outs on premise? Do you have a 0 downtime (aka no maintenance window) roll out? In some cases this is easy with load balancers and stateless solutions but it gets infinitely more difficult when you start dealing with a ton of instances and a consistent usage demand plus what about those shared resources like the DB, Cache servers, Queues, etc…

Summary

I decided to split this into three parts because I knew I had a lot of learnings as I went through creating this TechEd session. This first was about how we settled on “The Good” first step in to the cloud which is moving existing apps into IaaS. The main message here is that this is the start of your journey not the end. ¬†If you want the true benefits of the cloud you need to keep pushing yourself to think about all the app containers out there. The economies of it are simply awesome and we start to see that immediately when we move into hybrid IaaS and PaaS working together in a hybrid model. That’ll be the focus of Part 2.

Posted in Cloud Computing, IaaS, PaaS, Windows Azure | 2 Comments