Friday, September 12, 2014

Utilising the cloud for disaster recovery purposes: server and data replication





Introduction

Disaster Recovery is one of the most interesting use cases for adopting cloud technology. Almost every company has to deal with it, these pesky accountants asking for how you're dealing with potential disasters and how your IT is going to survive that.

The problem with DRA is that it used to be a very costly exercise, very hard to get it (and keep it) right and all that effort and money goes into something that you hope will never be used. Difficult business case indeed!

The Pay as you Go model of Cloud computing is an excellent option for this. You can prepare your entire backup solution in advance but, depending on your RTO and RPO only a minimal set of resources has to be active all the time, and hence only a minimal part of your backup infrastructure costs will be invoiced to you. Much better!

Disaster Recovery is much more than just IT, and even for the IT part designing an either cloud based or more traditional disaster recovery system is not to be underestimated. It definitely takes a lot more than a single blog post.

However, one of the usually tricky things is how to keep the configurations of your server infrastructure, as well as the data in sync between the different environments.

LogoI just got off a phone call with some people of Cloudleap where they demonstrated their product. This is quite an interesting product and I gathered it would be just as easy to capture my impressions in a blog post.

What is the problem?

Data is obviously not static. It changes all the time and in order to have a useful disaster recovery solution, the data in your DR solution must be as up to date as possible. Cloud providers, and quite a few technology providers provide solutions for. For instance, AWS Storage Gateway allows you to replicate on-premise data volumes to the cloud, so they can be used as backups.

However, the same, although to a lesser extent, applies to the servers itself. Servers are not static, they change because of updates, new software installed etc. and in order to have a reliable DR solution it is critical that you can rely on these configurations to be in sync with the primary systems.

Solutions

To deal with this aspect, different solutions are available. For instance, if you are using dynamic provisioning solutions such as Chef or Puppet things will be a lot easier. However, getting started with these kind of tools is not necessarily an easy thing to do, and automating everything that you have configured in the past is quite some task.

From that perspective, CloudLeap is solution requiring a lot less investments to get started as it allows you to mirror your server configurations as well as your data to a cloud provider of your choice. It does this by simply copying the disks block by block to your target provider. Obviously the mammoth AWS is supported, but quite a few others as well (Azure suport still on its way though).

CloudLeap can be used to do a one-time migration of server (and data) images, and optionally it can also be used to keep it in sync by periodically replicating it again. The big difference with the likes of AWS Storage Gateway is that it turns (where applicable of course) your mirrors into bootable volumes which can be launched upon request, basically providing a point-in-time copy of your server and data.

It requires the installation of an agent on the source server systems and uses a (single tenant) management console which controls these agents specifying by things such as which volumes to replicate and using what frequency.

This introduction still leaves quite a few questions unanswered but at first glance this definitely is an interesting piece of the puzzle, and deserves to be considered when designing a DR system. I need to get started with it.

Friday, May 16, 2014

Troubleshooting AWS' Elastic Load Balancers




Introduction

Amazon Web Services' Elastic Load Balancer are excellent, well maybe not excellent but still a very useful service. It does do the, well, load balancing, but in contrast with your own (e.g. HAProxy) load balancer it takes also care of scaling, fail-over and so on. 
So if you have multiple servers, with not too complex routing requirements, ELBs are a very convenient way forward. I've also found myself using ELBs when only exposing one single server, because it helps satisfying security requirements and it gives you some useful features on top of that. For instance it allows you to easily check the health of a server on an application level, and to monitor the roundtrip times of requests.
But anyway, that's not what I want to talk about. Using ELBs can also be a p@!n in the@$$. Troubleshooting connection problems have wasted quite a few days of my precious life, and I thought it would be good, at least for me, to have an overview of things you need to check. So here you go...

Btw, I am assuming a VPC environment as this provides the most broad choice of things that can go wrong.

Problems, possible causes

  1. Are the load balancers in the right subnet(s)?
    • Probably it's just me but I find the console and documentation of the ELB quite confusing. Even if the ELB is listed as internet facing, it is just possible that it lives in a private VPC subnet. Go fix it, otherwise it will never work.
  2. Are the listeners properly configured?
    • Do you have listeners for the ports you try to connect to? Make sure they exist and have the right protocol (see next);
  3. Which protocols did you use?
    • For the listeners you can select HTTP, HTTPS and TCP. Depending on how your web sites behaves you might want to switch between (e.g.) HTTP and TCP to see if that makes a difference.
    • Usually this is not something that troubleshoot connection problems, but might help in letting your application behave as it should.
  4. Do you have the proper ports opened in the Load Balancer's security group?
    • So for each listener, typically there should be a security group entry.
  5. Do you have the proper ports opened in the Server's security group?
    • Same applies for the Server's security group. For each protocol in the ELB listener and security group there should be a corresponding item in the server's security group.
    • However, the server's security group should typically limit access on these protocols to the load balancer only, in contrast with the security group for load balancer itself.
    • Note that this is not necessarily the same port, as the ELB's listener can map the incoming port to another outgoing.
  6. Do you have the health check properly configured? If not the servers will never come 'In Service'.
    • Check if you have the health check page installed at your web server.
      • And, the smarter this health check page is, the better of course...
    • Check if your health check page is accessible from the ELB. E.g. this might run on a different port.
    • Check the protocol of the health check. Preferably you use HTTP or HTTPS, but if the server runs HTTPS and you have the ELB configured in Pass-Through mode, you should use TCP for the health check configuration.
  7. Are the ELB's and the Server instances in the same Availability Zone
    • Until the (recent) announcement of Cross-Zone load balancing, you hade to make sure that both your load balancer and server instances live in the same AZ(s). If not it simply doesn't work.
    • This is quite frustrating, even more if you realise that the VPC wizard without explicit action quite often generates a public and private in different zones.
    • With Crosszone support this problem should not occur, but it might well be that you haven't activated this. So check it.
  8. Did you use the proper SSL configuration?
    • Just loading an SSL certificate (if you want to use that) is not enough. You need to disable SSLv2 and the weaker cipher suites to get a decent rating. Check http://www.ssllabs.com.
    • Again, this is not really a connectivity issue, but while we're at the subject...
So this was a quick brain dump, I'm sure I have forgotten a few cases. If you know them, please leave them in the comments.

Monday, March 17, 2014

EC2 Instance in public VPC subnet not being able to communicate with outside world

When provisioning systems within Amazon Web Services, I often use cloudformation templates that bootstrap a Chef client for further provisioning. For that bootstrapping part, we utilise the user-data mechanism.


I have this user-data bootstrapping script that works flawlessly for it seems eternity, but all of a sudden I got a call from a customer complaining about the script to fail.

It turned out that the apt-get update statement, which was one of the first statements in the script could not reach the repositories. And the strange thing was then when I ran the script manually it was ok.

It only happened in a particular topology where we had the server instance running in the public subnet and attached to an elastic IP. Usually, we don't use that topology, as I'd like to have the server in the private subnet and load distributed via an ELB, but for development purposes this topology is still useful.

Long story short, when running in the public subnet, the server needs to have an EIP attached, otherwise it cannot communicate with the outside world. The servers in the private subnet do not have that issue, as they rely on an NAT instance to transfer their communications.

It turned out that it takes a little while before that outgoing communication path for the server in the public subnet is established, and that the user-data script was sometimes (not always!) executed before the communication path was actually there. And then, obviously, it fails for things like apt-get update. A simple sleep, although it is a workaround, solved the problem.

Patience is, as always a virtue!

Tuesday, January 21, 2014

Best Practices for accessing AWS accounts - Quick Reference

Introduction

Now and then I run into the question what best practices should apply to providing access to your AWS accounts. To a large extent, this information is readily available in resources across the internet (such as this and this) but I thought it would be useful to provide a quick reference to that.

Providing access to AWS services

Practice
Rationale
Priority
Always use IAM users, do not use AWS Account user (or access keys)
This AWS Account User is considered the root user, providing access to all services and data. Normal operations should not be carried out using this account, instead an IAM account with the appropriate privileges is recommended.
High
Use different AWS accounts for production and non-production purposes.
This provides strong separation, and requires re-login when moving from one account to another. This reduces risk for un intended changes, and allows tailoring access rights for devops engineers.
High
Use RBAC practices using IAM groups, using least privilege model.
Using roles (groups) provides a more transparent and manageable access right model.
Medium
Enforce strong password policy
Important to prevent passwords from being guessed or cracked.
High
Enable multi-factor authentication
Enable multi-factor authentication for both (human) IAM users as well as the account owner to increase level of security, providing an additional layer of security.
High
Implement key rotation
It is advisable to ensure that access keys for IAM users are changed once in a while (say one per three months). For more info, see this blog.
Medium
Let users manage their own password
Ensure that users can manage their own passwords, and (strongly) encourage a password rotation schedule.
Medium
Use policy conditions for extra security
By means of policy, additional security can enforced. For instance, by defining that only particular users can delete a particular AWS resource. This is useful for fundamental services which failure would have great impact on service.
Medium
Enable Audit trails
Enabling AWS’ CloudTrail option enables security analysis, resource change tracking, and compliance auditing.
Medium

Providing access to the AWS hosted environments

Practice
Rationale
Priority
Always use Virtual Private Clouds
This is the default in newly created AWS Accounts, but should be used in all cases as it adds a lot of additional options from a security perspective. For certain applications, a dedicated VPC can be considered.
High
Use temporary credentials
When providing your server instances access to AWS services, always deploy temporary credentials in combination with IAM roles.
High
Consider use of VPNs
Consider using a VPN connection between your corporate network and the VPC, as it provides a strongly secured connection. However, ensure that only qualified staff have access to that particular network zone.
Medium
Use remote access bastions
In case there is no VPN connectivity between the VPC and the corporate network, use (SSH or RDP) Bastions to get access to your server instances. Only run these bastions when access is required.
High