Thoughts and ideas: September 2012

Saturday, September 29, 2012

Why I ditched my Apple MacBook and went back to Windows

I am the owner of a heterogeneous IT landscape. Windows, OSX, iOS, Android, Linux and even the good old Symbian are all still in the house and I'm perfectly fine with that.

About three years ago I bought my first MacBook Pro and I was assuming I would be very happy with it. Sure, it would take a bit of time to get used to the specifics of OSX but hey, that would make it just a bit more interesting.

And coming from a corporate laptop running Windows XP, completely bogged down by all the crapware running on it and suffering from the dreaded 'things are getting slower and slooower and slooooower' Windows problem, I was delighted by the speed and stability of OSX.

Rebooting? (Hardly) no need for it. Waking up? Almost instantly. Searching: Spectacularly useful and responsive. Hardware: Rock solid.

And still, after a while I noticed I still didn't feel entirely comfortable with it, which was strange given the wide range of OSes I am usually exposed to. The most used part of the OS is the interface to the file system, and the Finder turned out to be a big disappointment. It works clumsy, and is really no match for the Windows counterpart. Cutting and pasting a file? Forget about it (I know, you can buy an extension, but still). Create a new file when you have navigated to a particular directory? Nope. Minor things, but simply doesn't help in the overall experience.

The other thing I am using a lot (and who isn't?) is Office. I'd really liked to use something like Open Office, but given the fact that 99.99% of the world is using Microsoft Office that's what I settled with, trying to avoid conversion problems. Well, Office on the Mac is really rubbish when you compare it to the Windows counterpart. Obviously this is not Apple's fault, but still: it doesn't help getting comfortable on it.

Maximizing a screen: let OSX decide how big is big enough? Except that it doesn't work properly. (I know, I didn't upgrade from 10.6 and things were supposed to be better after that).

Making a print screen of a Windows: Command+Shift+4, then spacebar, then click a window. Are you kidding me? This is supposed to be simple and intuitive? I can go on, but I think you get the point.

But OSX is great as a development box, as it has SVN, SSH, Apache and all these kind of things out of the box. Yeah, I like the bash shell, and having native access to SSH but this is pretty much compensated by a lack of tools which are available on Windows only, like Tortoise SVN and VMWare Player (much better than VirtualBox). And the fact that if you sometimes need Windows only tools (e.g. SQL Server, Enterprise Architect and so on) which obviously requires a Virtual Machine on OSX, while it wouldn't on a Windows box doesn't help either.

After having worked only on my MacBook for almost three years, I started working on an assignment and got a Windows desktop which was (still) fast, running MS Office and the likes. And after a few weeks, using my laptop and this Windows desktop sort of side by side, I just couldn't do anything than admit: I simply like Windows better. OSX is rock solid but is still not there yet from a user experience perspective. And that's not only me, since I spoke the unspeakable, I ran into quite a few people that had to admit that they were struggling with their shiny MacBooks and didn't like it either.

So that's why I ended up back in the Windows world. I bought myself a new, flashy ultrabook and I am delighted that I'm rid of OSX. I do miss the hardware though, the three year old MacBook pro is still an excellent piece of equipment, outshining this new (Asus) ultrabook in many ways (as a matter of fact I am typing this on my MacBook running Windows 7 with bootcamp).

If a MacBook would ship natively with Windows, that would be my option, but as this never will happen I don't see myself buying a new MacBook anytime soon. Windows it is then, just need to find a way to keep it fast.

Friday, September 21, 2012

Building managed Service Stacks using Cloudformation

Infrastructure as Code

With the advance of cloud computing, we have reached a point where the hard infrastructure really can be treated as software, as code.

Networks, servers, firewalls, DNS registrations and so on does no longer require the physical handling it required not so long ago, it can be managed by running a script. Hardware goes soft!

This is a very interesting development, but brings challenges which will sound very familiar to the software development community. How to keep these software blocks maintainable and re-usable. How to deal with relatively quick changing versions of this infrastructure configuration. How to minimize dependencies? And so on.

For the developers among us, this sounds pretty straightforward but this is not necessarily the case for the guys or galls with no software or scripting background.

Amazon's Cloudformation

One of the areas I ran into quite a few times lately is how to structure these building blocks when using Amazon's (great) Cloudformation service. According to AWS, "AWS CloudFormation gives developers and systems administrators an easy way to create and manage a collection of related AWS resources, provisioning and updating them in an orderly and predictable fashion.".

Using Cloudformation, AWS Resources can be managed by specifying the needed resources (in a declarative way) in a JSON formatted template, and through the management console or command line interface, a so-called Stack can be instantiated by providing this declaration and a set of input parameters. This stack can be updated by providing an updated template or changing the parameters, and Cloudformation takes care of propagating these changes into the actual infrastructure components. Cloudformation even provides tools to configure (bootstrap) the actual server instances, but its main focus is on the infrastructure elements itself.

All very nice and dandy, but the questions I ran into is how to structure these cloudformation stacks. Make one big stack containing everything? Or each individual resource in its own stack? Or something in between.

This sounds remarkably similar to lots of discussion in the software engineering area. Remember Object Orientation? Component Based development? Service oriented architectures?

Well, in fact is bears a lot of resemblance and I truly believe we should embrace such best practices rather than inventing the wheel again.

Unfortunately we don't have the same level of sophistication as the real software development languages and tools provide, but given the options we have we can achieve a reasonable level of isolation and re-usability.

Meet the Managed Service Stacks.

The core of such an infrastructure is typically a server or a set of servers. Such a server implements one or more roles (e.g. web server, app server, database server) and can be used in different setups (e.g. development, test, production).

However, in order to let this server do what it is supposed to do, a lot more resources are needed. A few examples include:

We need a firewall configurations (security groups) to open up only the necessary ports to a particular set of clients.
We want this server to be accessible by name rather than IP, hence the DNS must be configured, including a fixed (elastic) IP address.
Maybe we do not want one server, but a flexible, automatically scalable set of servers. We're in the cloud after all, aren't we. And sure, we need to load balance these servers as well.
We want to monitor these servers for health and availability and want to be informed if things are getting out of hand.

Suppose we have a rather traditional system setup consisting of:

one or more web servers;
one or more app servers;
one database cluster.

I prefer to model the cloudformation stacks along these three different server types, rather than putting these all together in one stack. So each of these server types will have its own stack (the Service Stack) containing all elements needed for these servers to do what is needed.

The service stack for the web server group could look like this:

The stack contains all the resources needed for the web server to provide its core services. All changes (e.g. different scaling policy, additional DNS name) are managed by this service stack.

The overall environment (let's assume production environment) of this straightforward setup consists of a number of Service Stacks, one for each of the server type and possibly an environment stack that contains the setup of the network (VPC, subnets, NAT instance and so on).

Conclusion

Infrastructure as Code brings challenges which are very familiar to the software engineers among us. Following the concepts as pioneered in the good old days of components based development helps in splitting the configuration of Cloudformation stacks into manageable chunks. Unfortunately the tools are not very sophisticated yet, and lack core features such as to maximise code and configuration re-use but they still can be immensely useful in dealing with large scale deployments.

Wednesday, September 12, 2012

Setting up a central log analysis architecture (with syslog and splunk)

Introduction

The larger the systems, the more headache it gives when troubleshooting problems. One of the things that really helps and which is relatively easy to achieve is to make sure that all logs are accessible from a central place and can be analysed and queried in real-time.

Setting this up is easier than you might think and in this post I will talk you through this process using rsyslog (the native syslog daemon on many linux distros these days) and Splunk.

Overview

In the diagram below, I have shown a high level overview of the architecture.

Typically, there will be many (logging) clients in the solutions, examples of these are the Web and Application Servers, Database Servers, Management servers and so on. These clients typically run one or more applications which either can log their messages directly to the (local) syslog daemon and/or write them to one or more log files. The syslog daemon is by default configured to write incoming log messages to a number of local log files, but can easily be configured to submit these messages to a remote syslog server.

This log server takes these inbound messages and stores them in a convenient folder structure. These local log files then can be indexed by the Splunk server, which allows for very powerful analysis of this data through a web interface.

If the application on the clients can not write to the syslog daemon but writes it to local log files instead, the rsyslog daemon can be configured to monitor these log files and submit these as syslog messages to log server. Not all syslog daemons can do this though, and even the rsyslog daemon has limited capabilities in this regard, e.g. the name of the input log file must be static. Another (optional) way of forwarding messages to the Splunk server is by using the Splunk Forwarder. Personally I prefer using syslog though as I feel it is a more lean and proven method and all messages on the log server are handled in the same way, but it is always good to have an alternative, right?

Configuration

Let's start with setting up the central log server. We are assuming an Ubuntu 12.04 instance, which comes with rsyslog by default, but setup on other flavours should be identical or similar.

Accept inbound messages from remote servers

To accept inbound messages from remote servers, ensure that in /etc/rsyslog.conf the following configs are present:

### Load TCP and UDP modules

$ModLoad imtcp

$ModLoad imudp

Rsyslog knows the concepts of templates and rulesets, which allows you to specify how particular messages must be dealt with. In this case we make a distinction between the incoming messages from remote clients, and the local messages by defining a template for both cases.


### Templates

# log every host in its own directory

$template RemoteHost,"/mnt/syslog/hosts/%HOSTNAME%/%$YEAR%/%$MONTH%/%$DAY%/%syslogfacility-text%.log"



### Rulesets

# Local Logging

$RuleSet local
# Follow own preferences here....



# use the local RuleSet as default if not specified otherwise

$DefaultRuleset local 



# Remote Logging

$RuleSet remote

*.* ?RemoteHost

Then bind these rule sets to a particular listener:


### Listeners

# bind ruleset to tcp listener and activate it

$InputTCPServerBindRuleset remote

$InputTCPServerRun 5140

$InputUDPServerBindRuleset remote

$UDPServerRun 514

This is it, all messages coming in on the TCP or UDP listener now will be stored in its own directory structure, conveniently grouped by host and date.

The syslog daemons on the clients in turn need to be configured to send their messages to this syslog server.

In this case we follow the convention by adding configuration snippets in the /etc/rsyslog.d directory rather than modifiying the /etc/rsyslog.conf file. By default, all *.conf files in this directory will be included.

In order to read a number of log files and process them as syslog messages, we add the following config file:


File: /etc/rsyslog.d/51-read-files.conf

#  Read a few files and sent these to central server.

#



# Load module

$ModLoad imfile #needs to be done just once

# Nginx Access log

$InputFileName /var/log/nginx/access.log

$InputFileTag nginx-access:

$InputFileStateFile stat-nginx-access

$InputFileSeverity info

$InputFileFacility local0

$InputRunFileMonitor

# Nginx Error log

$InputFileName /var/log/nginx/error.log

$InputFileTag nginx-error:

$InputFileStateFile stat-nginx-error

$InputFileSeverity error

$InputFileFacility local0

$InputRunFileMonitor

This will pick up the nginx access and error log files and process them as syslog messages.

To forward these messages to the syslog server, we add the following file:


File: /etc/rsyslog.d/99-forward.conf

#  Forward all messages to central syslog server.

#

$ActionQueueType LinkedList    # use asynchronous processing

$ActionQueueFileName srvrfwd   # set file name, also enables disk mode

$ActionResumeRetryCount -1     # infinite retries on insert failure

$ActionQueueSaveOnShutdown on  # save in-memory data if rsyslog shuts down

*.* @@log.mydomain.com:5140 # Do the actual forward using TCP (@@)

This will forward all messages using TCP to the log server. UDP (using one @) or a high reliability protocol (RELP, using :relp:) can be used as well.

After restarting the syslog daemon, this basically takes care of collecting all log files in one central location. This in itself is already very useful.

But having Splunk running on top of this is even more powerful. Splunk can be downloaded for evaluation purposes and can be used as a free version (with a few feature limitations) up to an indexing capacity of 500 MB per day. Note it does not limit the total index size, but only the daily volume.

Once installed, it allows you to log in, change your password and add data to the Splunk indexer. After all, this is what you want to do.

After clicking Add Data, you'll be greeted with the following screen:

In this page you can select 'From files and directories'. This takes you to the Preview data dialogue, which enables you to see a preview of the data before you add it to a Splunk index. Select Skip preview and click Continue.

This takes you to the Home > Add data > Files & directories > Add new view. Select the default source (Continuously index data from a file or directory this Splunk instance can access) and fill in the data to your path.

Normally you would index everything in a particular subdirectory (e.g. /mnt/syslog/hosts/appsrv1/2012/09/12/*) of set van directories (e.g. /mnt/syslog/hosts/.../*). It might be useful to address the individual files one by one in order to define how they are dealt with by Splunk.

Now select More Settings.

This enables you to override Splunk's default settings for Host, Source type, and Index. To automatically determine the host based on the path in which the log file is stored, select 'segment in path' with value 4.

Note this has to match the value as specified in the rsyslog template definition.
RemoteHost,"/mnt/syslog/hosts/%HOSTNAME%/%$YEAR%/%$MONTH%/%$DAY%/%syslogfacility-text%.log"

What about the Source type and Index settings? The source type of an event tells you what kind of data it is, usually based on how it's formatted. Examples of source types are access_combined or cisco_syslog. This classification lets you search for the same type of data across multiple sources and hosts. The index setting tells Splunk where to put the data. By default, it's stored in main, but you might want to consider partitioning your data into different indexes if you have many types.

Click Save and then you are ready to go and open the search app. If you want getting your feet wet with this search app, have a look at this tutorial. Happy splunking!

Wednesday, September 5, 2012

The Economics of the Cloud

Two sides of a story

"We're gonna save lots of money in the cloud!".

Well, there you have it. If you want to save money (and who doesn't) and you have a one or more IT applications that can live in a cloudy world, then this is the way to go, isn't it? After all, if you look at the costs of (for instance Amazon Web Services) you can have a server instance for as little as $0.02 per hour. 2 pennies! Who could ever compete with that?

Of course this sounds very attractive but reality is usually a bit less rosy.

"Wow, this Amazon service is really expensive!"

This is a not unusual reaction you get after running an application for a few months in the cloud and the actual costs are becoming more visible. Running a full-fledged system 24/7 in the cloud is certainly not for free, and the costs associated with it are significant. That might come as a nasty surprise when the first bill comes in.

In my view, both extremes stem from a lack of insight in:

what exactly is needed to support a cloud-based application and/or

a lack of insight in the total cost of ownership for on-premise solutions.

Below, I will provide a few pointers (non-exhaustive) that might be helpful in comparing the costs of both directions.

Context

Before trying to answer these questions, first some context. Cloud computing is a very broad term and is applied to a wide variety of services. To keep things simple we focus on a specific type of cloud services, the type where you can rent computing capacity and storage usually known as Infrastructure as a Service. This is the type of Cloud computing with the lowest abstraction level, meaning you have to manage most of the stack yourself.

The good thing about IaaS though, and that is why this makes it the most popular option on the market today, is that it is very flexible and little migration is needed. Basically, what you run on-premise, you can typically run this in an IaaS cloud as well (at least, from a technical perspective).

So what about this rosy picture?

When moving to the cloud it is easy to get blinded by these stunning advertised costs, for instance $0.02 per hour for a Micro server instance. This is however just part of the story.

Nothing is for free

Taking Amazon as an example, it soon becomes evident that (almost) nothing is for free. You need storage? Pay for it. Use your storage? Here's the bill. Backup? This is what it costs. You want to restore your backup? Pay for it! Monitoring alerts? Well, you get the point.

All these individual cost components are priced very reasonably but still they add up and makes things significantly more expensive then you initially thought when you read about these 2 cents per hour.

So: understanding all cost components are needed for making a valid comparison.

A server instance is not a server

Probably one of the least understood things within AWS is how to compare these server instances to its real-life, physical counterparts. For example, AWS advertises the CPU capacity using the ECU, which is a fictive unit that is comparable with a real CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. Note, a 2007 unit, so hardly state of the art. IO capacity of the instance is even more obscure as this is vaguely indicated using as 'Moderate' or 'High' IO performance, without mentioning the actual bandwidth that comes with it.

In addition, you need to keep in mind that there the fact that a cloud platform is inherently multi-tenant might have some negative impact on the performance of your own server instances. This can be countered by allocating large chunks of capacity (rather than multiple smaller ones) or in case of AWS even request dedicated hardware but obviously this will have a cost impact.

The bottom line is that when you launch a server instance, you might not get the capacity you were anticipating, requiring you to upgrade or launch additional instances.

So: make sure you understand the actual capacity provided in comparison with physical servers.

Elasticity might be difficult to achieve

The single most cited benefit of cloud computing is the ability to size the capacity of your systems (regardless whether this is automatic or not) according to the demand. In contrast, traditional systems are often sized for the maximum peak in anticipated demand, resulting in dramatically under-utilized systems. Server virtualisation already started addressing this, and cloud computing is the supposed to be top-of-class in this regard. With dramatic cost savings as a result. Right?

Well, yes, it could be. Automatic scaling is a very useful feature but it might be difficult to fully exploit it. It comes with its own set of challenges, such as:

Is your application capable of dynamically spreading the load over multiple servers. For a straightforward web server this is typically not too much of an issue but more complex applications or servers holding state (e.g. databases) things usually get more complicated.
How do you provision these spontaneously launching server instances?
How much time is need to spin up a new server instance and configure it so that it can actually take part in processing the workload. Does this time needed to scale up match the actual peaks in the demand?
How do you keep a grip on the instances that are actually running? An error in the provisioning might result in server instances being launched while they never become active part of the system.
Auto-scaling is best served by relatively small server instances, however as discussed before these smaller instances come with their own drawbacks as well. A trade-off is needed.

So: make sure that you assess the applicability of auto scaling before reserving the cost savings.

Allocation vs. Usage

Cloud computing is typically associated with the Pay as you Go paradigm. You pay for something when you need and use it.

Not so fast. This applies to quite a few things but unfortunately not in all cases. For example online (block) storage is typically allocated beforehand and you really pay for what you allocate, and not for what you actually use. Another example is the use of reserved instances, which allows you to buy reserved and discounted capacity for a one or three year period. The more you pay now, the greater discount you get. However for the Heavy Uitlisation reserved instances you are charged for every hour of the month, regardless of the state of the server instance.

So: it is really necessary to understand the extent in which the Pay as you Go paradigm applies to the different cost components.

So that settles it, the cloud is way too expensive

Hold on, this is not the message I'm trying to convey. I am actually a strong believer in cloud computing and I sincerely believe this is a massive paradigm shift we are witnessing. Does this means it is applicable to all use cases? Of course not. And does it also mean that computing costs are all of a sudden a fraction of what they used to be? No way.

However, when this unexpectedly high bill is coming in at the end of the month, it is easy forget about the service that has been delivered and difficult to compare apples and pears.

What is the total cost of ownership of the alternative then?

As mentioned before, almost everything comes at a cost at Amazon, which might be considered as a blessing in disguise. In the end, Amazon is providing the service to make some money and being one of the largest IT infrastructure exploiters in the world you can expect them to have a good understanding of the cost components of such an infrastructure.

By charging you these individual line items, it does provide you with an insight about the total cost of ownership (TCO) which you might not have had before. And as long as it is not clear what the TCO of the alternatives are, you cannot state that the one of the these is too expensive.

In my view, there are plenty of use cases where an on-premise or co-located solution is more economical but the differences won't be spectacular and you surely have to have a very well-organised IT organisation to achieve these benefits.

So: ensure you understand the alternative's TCO when comparing it with a cloud based solution.

Apples and Apples?

It is an, among IT pros, popular pastime to compare the costs of an off-the-shelf server with what Amazon is charging you for the equivalent of that. And boy, what does Amazon suffer then.

Except that this is not a valid comparison. To start with it usually doesn't end with this server alone. You need additional infrastructure such as storage, backup equipment and so on. The server must be housed, cooled, mounted in a rack, power must be supplied, physical installation is needed.

And what happens when this server dies (which will happen)? Well, at best a spare server is available or a sufficient service contact is in place, but even in these cases more often than not it takes a significant amount of time before the replacement is ready to go. How much is it worth then that in a cloud computing use case that replacement can be fired up (even fully automatically if needed) within minutes? And that the solution can be migrated to a disaster recovery sites without huge upfront costs and without intervention of the cloud provider?

Basically you are comparing a service with a piece of equipment, which is at best only a part of the solution. Apples and pears.

So: calculate the cost of the on-premise IT service rather than a piece of equipment when comparing costs.

Benefits of capacity on demand.

As discussed before, automatic scaling of the solution's capacity to meet the demand might be more complicated than it sounds. That's very true, but it still doesn't mean that it has no value. It certainly does, big time! A system designed for elasticity used for fluctuating load profiles is able to save significant costs, no doubt about it. In short: the more spiky the demand, the better it suits in a cloud use case.

But there are many more use cases where the capacity of demand proves to be a real winner. What about setting up a temporary test system? Running a load test on a representative system setup? Testing a disaster recovery procedure? Scaling up capacity as there is a large data migration going on which normally would last for days or even weeks? And very often for very little costs as this typically runs for hours or weeks rather than months or years.

It is evident that this flexibility is extremely useful and more often than not quite a few of these things simply wouldn't be possible with on-premise systems.

So: try to calculate the value of this flexibility into account before making up your mind.

OPEX vs. CAPEX

While talking about costs it is tempting to easily compare the sum of the monthly costs over a three year period with a scenario having a large initial investment in combination with smaller monthly costs. Except that this is not the same. Every financial specialist (which I am not) will be able to explain the benefits of Operational Expenses (OPEX) vs. Capital Expenses. It is very attractive to spread the payment of a large sum of money over three years rather than paying the majority of that money upfront, a principle firmly exploited by credit card firms.

So: ensure to take capital costs into account.

The bottom line

On purpose I tried to focus on the economical aspects of cloud computing vs. more traditional alternatives, while assuming a bit of a Devil's advocate role. But economy is just one part of the equation. Rather than managing your own data centres with dedicated equipment, cloud computing allows for more focus on your core business. From that perspective, whether or not utilising cloud computing is much more a strategic choice rather than something just based on numbers.

Sometimes it works, sometimes it certainly does not but in all cases a thorough deep understanding of both sides of the story is needed to make a qualified decision.

A few references

Googling for economic benefits of the clouds will result in a huge number of hits, but a few of them I found very interesting. One article that raised quite a stir was the AWS vs. Self-Hosted article, including the response from Amazon's Jeff Bar. Another interesting, although less quantified is the article that comes with the intriguing name Is cloud computing really cheaper? Finally, a nice interactive spreadsheet that aims at comparing Cloud vs. Colo cost is worth to have a look at.