Amazon EC2 Performance Drops – Too Many Users

Alan Williamson recently wrote about his companies long-term (multi-year) experience with Amazon EC2 and sums up the degrading experience as:

Again, great services just as long as you don’t use them too much!

Alan clarified that they started deploying on EC2 about 2 years ago, initially using the “SMALL” instances for most DB instances and Web front-ends where needed, but noticed last year needing to upgrade a handful of those instances to “High-CPU MEDIUM” instances just to maintain the same performance that the SMALL was giving them the year prior.

The author goes on to mention that besides poor individual VM performance, one common problem they are seeing is internal EC2 network congestion that is killing their application performance. You can imagine in any larged/scaled out app you will have your Database(s) on separate machines and your Web server(s) on other machines that connect back to the DB’s for data. Unfortunately Alan and his team, during especially hard-fails of the EC2 performance, were seeing 7+ second internal network lag.

7 seconds is a long time; too long to wait for a website to load, let along packets along an internal private network to travel between your web and DB server.

He continues to point out the similar failings in performance between Amazon EC2 of recent and failures that caused him and his company to leave Flexiscale (now Flexiant) years prior.

Alan notes that cloudkick did their own analysis of seemingly poor EC2 instance performance and corroborated his findings as well with horrible inter-network-communication-latency:

He also links to a study that Eran Tromer did about the Amazon EC2 network architecture and how bad-neighbors can really put the screws to you in the EC2 cloud environment.

Both Wire Turf and Alan pointed at not all underlying commodity hardware running EC2 is equal — some of the servers your VM might get created on are many years old and perform like garbage while others are brand new beefier servers that will host up your images with minimal issue.

Alan mentioned a (horrible but necessary) workaround to the performance problems that they were having during one particularly bad firestorm; he was simply sitting at the AWS Management Console, starting and killing EC2 instances until he would get one on a server that was performing well enough to push out into his production circle of servers.

That is exactly what you aren’t suppose to be doing with “cloud computing”, the idea is that you pay for the convenience of not needing to think about “Servers” anymore… you just think about “Resource requirements”.

Given some of the available usage statistics out there about Amazon EC2 load, it may not be surprising that the existing underlying hardware is starting to buckle under the load — everyone is hopping on the cloud-wagon and sinking it into the mud. I assume Amazon is well aware of the performance issues on their end and working hard to not only get better load-balancing software and hardware in place to make better use of idle cycles in their compute and network infrastructure, but also better “bad neighbor” protection policies and hardware upgrades to aging infrastructure.

This does give anyone looking into cloud computing an opportunity to consider performant alternatives that come highly recommended though.

, , , , , ,

2 Responses to Amazon EC2 Performance Drops – Too Many Users

  1. Dual Screen Laptops January 16, 2010 at 3:18 am #

    Reality check for the sales guys aside, I don’t think it matters whether you scale vertically or horizontally, eventually you are going to run into the hard problems. I mean how much faster disks, processors, or additional RAM can you even get? That stuff gets pretty expensive and there is a limit to how far it can go. Meanwhile, “share-nothing” is usually a misnomer because you always have to share something (usually a database). However, depending on the nature of your application, you may able to cache a majority of your traffic, in which case throwing more cheap hardware at it is more effective than climbing up the exponential price scale of server hardware. “share-nothing” doesn’t solve the scaling problem for you, but it gives you a head start.

  2. Riyad Kalla January 16, 2010 at 9:22 am #

    DSL — I can’t tell if this is a real human reply or a spam-approximate-reply just to push your URL into wordpress comments… the article is pretty clear that the problem has nothing to do with scaling but degrading performance over the last 3 years with the same level of service from EC2 as well as failing performance (7 second lag) between internal nodes in the EC2 cloud… that’s not “Scaling”, that’s “Failing”

Leave a Reply