Tuesday, May 14, 2013

Quick guide to AWS Database options

Introduction

Amazon Web Services is widely known as the leading Infrastructure as a Service provider, and let's be clear: they are by far the most powerful option in this area.

However, it would be unwise to believe that AWS is only active in this particular IaaS space, and over time they have build up a significant number of services that can be classified as Platform as a Service.

A few weeks ago I visited the AWS summit in London and attended a few (big) data sessions, and I figured it would be helpful to list the various database platform services AWS is currently providing, and give a quick introduction and guide to how and when to use these. Note, both SQL and NoSQL option will be discussed, but it will not go into the details of these two options (not to mention the various SQL and NoSQL flavours that exist).

SimpleDB

According to AWS, Amazon SimpleDB is a highly available and flexible non-relational data store that offloads the work of database administration. Developers simply store and query data items via web services requests and Amazon SimpleDB does the rest. It provides a very flexible query interface, sports multi-data center replication, high availability, and offers rock-solid durability. And yet customers never need to worry about setting up, configuring, or patching their database.

The main drawback however is that its scalability is limited and performance not always predictable. It also has strict storage limitations of 10GB hence partitioning is probably required for larger systems.

So when you need the flexibility of a non-relational database and your scalability requirements are modest, this could be a good option.

DynamoDB

DynamoDB is AWS' other NoSQL database offering. It is a tremendously scalable and fast service and provisioning and scaling of databases (tables) is straight forward. The underlying technical architecture is state of the art, for instance it runs on SSD rather than good old hard drives. It integrates nicely with other AWS features such as S3, Data Pipeline and RedShift.

So when you are in the NoSQL game and really need high scalable low latency solution, DynamoDB is the way to go.

RDS

Relational databases are known territory for most of us, and the likes of Oracle, SQL Server and MySQL are big players in this area. It is perfectly possible to run your own (relational) database on AWS EC2 but if you don't want to be bothered with configuration issues such as storage configuration, backup schedules and read-write copies, RDS is certainly an option. Even more that the extra costs of RDS over running your own database on EC2 is very modest.

So realising that NoSQL, although a nice and shiny option is certainly not always the preferred option, AWS' RDS option is certainly worth a look at. Especially for the transactional processing systems as for the analytical processing AWS has...

RedShift

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It is optimized for datasets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.
RedShift is accessible through a standard ODBC or JDBC API as it implements the PostgreSQL interface (but note: it is not PostgreSQL under the hood!).

So when large amounts of data must be analysed and (in real-time fashion) reported on, RedShift is a very attractive option.

EMR

AWS' Elastic MapReduce option is technically not really a database, but is still very applicable for processing large amounts of data. EMR is a fully configured Hadoop setup, including tools such as Pig and Hive. It is tremendously powerful although also a bit low level, and typically operates in a batch-oriented (rather than real-time) fashion.

EMR is a very good option for processing incoming data streams and move the results to other database options such as discussed before in this post.

Conclusion

This is only the tip of the iceberg and an in-depth comparison of the various options takes a lot more space. However, I do hope it provides a first impression and guidance when having to choose between the various options

Note the options discussed are really platform services, so a lot of the complexity is abstracted away from you. You may or may not like that, if you are in the latter camp rest assured that you can still use the IaaS options and run your own database solution just the way you want it utilising services such as EC2 and EBS.