Monday, March 17, 2014

EC2 Instance in public VPC subnet not being able to communicate with outside world

When provisioning systems within Amazon Web Services, I often use cloudformation templates that bootstrap a Chef client for further provisioning. For that bootstrapping part, we utilise the user-data mechanism.


I have this user-data bootstrapping script that works flawlessly for it seems eternity, but all of a sudden I got a call from a customer complaining about the script to fail.

It turned out that the apt-get update statement, which was one of the first statements in the script could not reach the repositories. And the strange thing was then when I ran the script manually it was ok.

It only happened in a particular topology where we had the server instance running in the public subnet and attached to an elastic IP. Usually, we don't use that topology, as I'd like to have the server in the private subnet and load distributed via an ELB, but for development purposes this topology is still useful.

Long story short, when running in the public subnet, the server needs to have an EIP attached, otherwise it cannot communicate with the outside world. The servers in the private subnet do not have that issue, as they rely on an NAT instance to transfer their communications.

It turned out that it takes a little while before that outgoing communication path for the server in the public subnet is established, and that the user-data script was sometimes (not always!) executed before the communication path was actually there. And then, obviously, it fails for things like apt-get update. A simple sleep, although it is a workaround, solved the problem.

Patience is, as always a virtue!