Thoughts and ideas

Monday, June 1, 2015

The art of troubleshooting

Recently I had to spent a few hours (days, weeks) to fix a nagging issue that raised its head now and then.

As always, the issue was difficult to reproduce, sometimes it seemed to go away but unfortunately it always came back to drive users crazy. So it was a matter of time before shit hit the fan and the thing really got priority.

We got it fixed, but it still left me with a bit of a hangover. Why did it took so long and why did postponed it until we had so much frustration around and we actually got on with it.

After a bit of soul searching I'll try to define a few principles that could be followed when dealing with such issues. On purpose, I won't be talking about technology, even more, I guess this is useful in most disciplines.

Rule #1: Never ignore these recurring issues. Give it priority.
If the issue recurs once or twice, there is a very good chance that it will keep on raising it head, So acknowledge the issue and focus on finding the root cause for it, otherwise you'll be forced to wipe your agenda (typically when it doesn't suit you at all) after things have been escalated.

Rule #2: Focus
Of course you need to prioritize, but fixing problems is not a part-time job. Focus on it, wipe your agenda and do not give up until you found a solution.

Rule #3: Understand and co-operate across the entire chain
More and more, solutions are chains of services, and if you do not have the overview of the entire chain, there is a very big chance that fingers will be pointed between the various parties. You need to have the overview, work together and make sure everyone involved has the same sense of urgency.

Rule #4: Work in tandems
When working alone, there is a good chance you'll be overlooking things. When working in duos, you need to explain to your partner why something might or might not be an issue, which is a critical part of finding the root cause.

Rule #5: Brace yourself for the hangover
Fixing the problem is like a night out with friends. While you're doing it, it is great but there is a good chance it feels differently the day after. Why took it so long, why was it such a simple thing, these are the sort of questions (accusations?) you will be asking yourself. That's the way it is, I guess you just have to live with it.

Is this rocket science? Far from it! Would it be helpful to live a bit more by these rules? Most definitely, at least I'll give it a try next time :-)

Saturday, May 30, 2015

Architecting for the Cloud

Friday, September 12, 2014

Utilising the cloud for disaster recovery purposes: server and data replication

Introduction

Disaster Recovery is one of the most interesting use cases for adopting cloud technology. Almost every company has to deal with it, these pesky accountants asking for how you're dealing with potential disasters and how your IT is going to survive that.

The problem with DRA is that it used to be a very costly exercise, very hard to get it (and keep it) right and all that effort and money goes into something that you hope will never be used. Difficult business case indeed!

The Pay as you Go model of Cloud computing is an excellent option for this. You can prepare your entire backup solution in advance but, depending on your RTO and RPO only a minimal set of resources has to be active all the time, and hence only a minimal part of your backup infrastructure costs will be invoiced to you. Much better!

Disaster Recovery is much more than just IT, and even for the IT part designing an either cloud based or more traditional disaster recovery system is not to be underestimated. It definitely takes a lot more than a single blog post.

However, one of the usually tricky things is how to keep the configurations of your server infrastructure, as well as the data in sync between the different environments.

I just got off a phone call with some people of Cloudleap where they demonstrated their product. This is quite an interesting product and I gathered it would be just as easy to capture my impressions in a blog post.

What is the problem?

Data is obviously not static. It changes all the time and in order to have a useful disaster recovery solution, the data in your DR solution must be as up to date as possible. Cloud providers, and quite a few technology providers provide solutions for. For instance, AWS Storage Gateway allows you to replicate on-premise data volumes to the cloud, so they can be used as backups.

However, the same, although to a lesser extent, applies to the servers itself. Servers are not static, they change because of updates, new software installed etc. and in order to have a reliable DR solution it is critical that you can rely on these configurations to be in sync with the primary systems.

Solutions

To deal with this aspect, different solutions are available. For instance, if you are using dynamic provisioning solutions such as Chef or Puppet things will be a lot easier. However, getting started with these kind of tools is not necessarily an easy thing to do, and automating everything that you have configured in the past is quite some task.

From that perspective, CloudLeap is solution requiring a lot less investments to get started as it allows you to mirror your server configurations as well as your data to a cloud provider of your choice. It does this by simply copying the disks block by block to your target provider. Obviously the mammoth AWS is supported, but quite a few others as well (Azure suport still on its way though).

CloudLeap can be used to do a one-time migration of server (and data) images, and optionally it can also be used to keep it in sync by periodically replicating it again. The big difference with the likes of AWS Storage Gateway is that it turns (where applicable of course) your mirrors into bootable volumes which can be launched upon request, basically providing a point-in-time copy of your server and data.

It requires the installation of an agent on the source server systems and uses a (single tenant) management console which controls these agents specifying by things such as which volumes to replicate and using what frequency.

This introduction still leaves quite a few questions unanswered but at first glance this definitely is an interesting piece of the puzzle, and deserves to be considered when designing a DR system. I need to get started with it.

Friday, May 16, 2014

Troubleshooting AWS' Elastic Load Balancers

Introduction

Amazon Web Services' Elastic Load Balancer are excellent, well maybe not excellent but still a very useful service. It does do the, well, load balancing, but in contrast with your own (e.g. HAProxy) load balancer it takes also care of scaling, fail-over and so on.

So if you have multiple servers, with not too complex routing requirements, ELBs are a very convenient way forward. I've also found myself using ELBs when only exposing one single server, because it helps satisfying security requirements and it gives you some useful features on top of that. For instance it allows you to easily check the health of a server on an application level, and to monitor the roundtrip times of requests.

But anyway, that's not what I want to talk about. Using ELBs can also be a p@!n in the@$$. Troubleshooting connection problems have wasted quite a few days of my precious life, and I thought it would be good, at least for me, to have an overview of things you need to check. So here you go...

Btw, I am assuming a VPC environment as this provides the most broad choice of things that can go wrong.

Problems, possible causes

Are the load balancers in the right subnet(s)?

Probably it's just me but I find the console and documentation of the ELB quite confusing. Even if the ELB is listed as internet facing, it is just possible that it lives in a private VPC subnet. Go fix it, otherwise it will never work.

Are the listeners properly configured?

Do you have listeners for the ports you try to connect to? Make sure they exist and have the right protocol (see next);

Which protocols did you use?

For the listeners you can select HTTP, HTTPS and TCP. Depending on how your web sites behaves you might want to switch between (e.g.) HTTP and TCP to see if that makes a difference.
Usually this is not something that troubleshoot connection problems, but might help in letting your application behave as it should.

Do you have the proper ports opened in the Load Balancer's security group?

So for each listener, typically there should be a security group entry.

Do you have the proper ports opened in the Server's security group?

Same applies for the Server's security group. For each protocol in the ELB listener and security group there should be a corresponding item in the server's security group.
However, the server's security group should typically limit access on these protocols to the load balancer only, in contrast with the security group for load balancer itself.
Note that this is not necessarily the same port, as the ELB's listener can map the incoming port to another outgoing.

Do you have the health check properly configured? If not the servers will never come 'In Service'.

Check if you have the health check page installed at your web server.

And, the smarter this health check page is, the better of course...

Check if your health check page is accessible from the ELB. E.g. this might run on a different port.
Check the protocol of the health check. Preferably you use HTTP or HTTPS, but if the server runs HTTPS and you have the ELB configured in Pass-Through mode, you should use TCP for the health check configuration.

Are the ELB's and the Server instances in the same Availability Zone

Until the (recent) announcement of Cross-Zone load balancing, you hade to make sure that both your load balancer and server instances live in the same AZ(s). If not it simply doesn't work.
This is quite frustrating, even more if you realise that the VPC wizard without explicit action quite often generates a public and private in different zones.
With Crosszone support this problem should not occur, but it might well be that you haven't activated this. So check it.

Did you use the proper SSL configuration?

Just loading an SSL certificate (if you want to use that) is not enough. You need to disable SSLv2 and the weaker cipher suites to get a decent rating. Check http://www.ssllabs.com.
Again, this is not really a connectivity issue, but while we're at the subject...

So this was a quick brain dump, I'm sure I have forgotten a few cases. If you know them, please leave them in the comments.

Monday, March 17, 2014

EC2 Instance in public VPC subnet not being able to communicate with outside world

When provisioning systems within Amazon Web Services, I often use cloudformation templates that bootstrap a Chef client for further provisioning. For that bootstrapping part, we utilise the user-data mechanism.

I have this user-data bootstrapping script that works flawlessly for it seems eternity, but all of a sudden I got a call from a customer complaining about the script to fail.

It turned out that the apt-get update statement, which was one of the first statements in the script could not reach the repositories. And the strange thing was then when I ran the script manually it was ok.

It only happened in a particular topology where we had the server instance running in the public subnet and attached to an elastic IP. Usually, we don't use that topology, as I'd like to have the server in the private subnet and load distributed via an ELB, but for development purposes this topology is still useful.

Long story short, when running in the public subnet, the server needs to have an EIP attached, otherwise it cannot communicate with the outside world. The servers in the private subnet do not have that issue, as they rely on an NAT instance to transfer their communications.

It turned out that it takes a little while before that outgoing communication path for the server in the public subnet is established, and that the user-data script was sometimes (not always!) executed before the communication path was actually there. And then, obviously, it fails for things like apt-get update. A simple sleep, although it is a workaround, solved the problem.

Patience is, as always a virtue!