Alan Williamson recently wrote about his companies long-term (multi-year) experience with Amazon EC2 and sums up the degrading experience as:
Again, great services just as long as you don’t use them too much!
Alan clarified that they started deploying on EC2 about 2 years ago, initially using the “SMALL” instances for most DB instances and Web front-ends where needed, but noticed last year needing to upgrade a handful of those instances to “High-CPU MEDIUM” instances just to maintain the same performance that the SMALL was giving them the year prior.
The author goes on to mention that besides poor individual VM performance, one common problem they are seeing is internal EC2 network congestion that is killing their application performance. You can imagine in any larged/scaled out app you will have your Database(s) on separate machines and your Web server(s) on other machines that connect back to the DB’s for data. Unfortunately Alan and his team, during especially hard-fails of the EC2 performance, were seeing 7+ second internal network lag.
7 seconds is a long time; too long to wait for a website to load, let along packets along an internal private network to travel between your web and DB server.
He continues to point out the similar failings in performance between Amazon EC2 of recent and failures that caused him and his company to leave Flexiscale (now Flexiant) years prior.
He also links to a study that Eran Tromer did about the Amazon EC2 network architecture and how bad-neighbors can really put the screws to you in the EC2 cloud environment.
Both Wire Turf and Alan pointed at not all underlying commodity hardware running EC2 is equal — some of the servers your VM might get created on are many years old and perform like garbage while others are brand new beefier servers that will host up your images with minimal issue.
Alan mentioned a (horrible but necessary) workaround to the performance problems that they were having during one particularly bad firestorm; he was simply sitting at the AWS Management Console, starting and killing EC2 instances until he would get one on a server that was performing well enough to push out into his production circle of servers.
That is exactly what you aren’t suppose to be doing with “cloud computing”, the idea is that you pay for the convenience of not needing to think about “Servers” anymore… you just think about “Resource requirements”.
Given some of the available usage statistics out there about Amazon EC2 load, it may not be surprising that the existing underlying hardware is starting to buckle under the load — everyone is hopping on the cloud-wagon and sinking it into the mud. I assume Amazon is well aware of the performance issues on their end and working hard to not only get better load-balancing software and hardware in place to make better use of idle cycles in their compute and network infrastructure, but also better “bad neighbor” protection policies and hardware upgrades to aging infrastructure.
This does give anyone looking into cloud computing an opportunity to consider performant alternatives that come highly recommended though.