Friday, May 16, 2014

Troubleshooting AWS' Elastic Load Balancers




Introduction

Amazon Web Services' Elastic Load Balancer are excellent, well maybe not excellent but still a very useful service. It does do the, well, load balancing, but in contrast with your own (e.g. HAProxy) load balancer it takes also care of scaling, fail-over and so on. 
So if you have multiple servers, with not too complex routing requirements, ELBs are a very convenient way forward. I've also found myself using ELBs when only exposing one single server, because it helps satisfying security requirements and it gives you some useful features on top of that. For instance it allows you to easily check the health of a server on an application level, and to monitor the roundtrip times of requests.
But anyway, that's not what I want to talk about. Using ELBs can also be a p@!n in the@$$. Troubleshooting connection problems have wasted quite a few days of my precious life, and I thought it would be good, at least for me, to have an overview of things you need to check. So here you go...

Btw, I am assuming a VPC environment as this provides the most broad choice of things that can go wrong.

Problems, possible causes

  1. Are the load balancers in the right subnet(s)?
    • Probably it's just me but I find the console and documentation of the ELB quite confusing. Even if the ELB is listed as internet facing, it is just possible that it lives in a private VPC subnet. Go fix it, otherwise it will never work.
  2. Are the listeners properly configured?
    • Do you have listeners for the ports you try to connect to? Make sure they exist and have the right protocol (see next);
  3. Which protocols did you use?
    • For the listeners you can select HTTP, HTTPS and TCP. Depending on how your web sites behaves you might want to switch between (e.g.) HTTP and TCP to see if that makes a difference.
    • Usually this is not something that troubleshoot connection problems, but might help in letting your application behave as it should.
  4. Do you have the proper ports opened in the Load Balancer's security group?
    • So for each listener, typically there should be a security group entry.
  5. Do you have the proper ports opened in the Server's security group?
    • Same applies for the Server's security group. For each protocol in the ELB listener and security group there should be a corresponding item in the server's security group.
    • However, the server's security group should typically limit access on these protocols to the load balancer only, in contrast with the security group for load balancer itself.
    • Note that this is not necessarily the same port, as the ELB's listener can map the incoming port to another outgoing.
  6. Do you have the health check properly configured? If not the servers will never come 'In Service'.
    • Check if you have the health check page installed at your web server.
      • And, the smarter this health check page is, the better of course...
    • Check if your health check page is accessible from the ELB. E.g. this might run on a different port.
    • Check the protocol of the health check. Preferably you use HTTP or HTTPS, but if the server runs HTTPS and you have the ELB configured in Pass-Through mode, you should use TCP for the health check configuration.
  7. Are the ELB's and the Server instances in the same Availability Zone
    • Until the (recent) announcement of Cross-Zone load balancing, you hade to make sure that both your load balancer and server instances live in the same AZ(s). If not it simply doesn't work.
    • This is quite frustrating, even more if you realise that the VPC wizard without explicit action quite often generates a public and private in different zones.
    • With Crosszone support this problem should not occur, but it might well be that you haven't activated this. So check it.
  8. Did you use the proper SSL configuration?
    • Just loading an SSL certificate (if you want to use that) is not enough. You need to disable SSLv2 and the weaker cipher suites to get a decent rating. Check http://www.ssllabs.com.
    • Again, this is not really a connectivity issue, but while we're at the subject...
So this was a quick brain dump, I'm sure I have forgotten a few cases. If you know them, please leave them in the comments.