Amazon S3 outage rains on cloud-computing parade

Daily Newsletters

Sign up to ZDNet UK's daily newsletter.

COMMENT

Amazon.com's Simple Storage Service, or S3, struck a pothole on the road to the glorious cloud-computing future on Sunday when an outage took the storage system offline for several hours.

The outage may not come as a surprise, as the computing industry is making up what is known as 'cloud computing' as it goes along, often with a server and networking architecture that's one part improvisation to two parts proven best practice.

Computing practices tend to adopt one of two stances.One is tight control, higher prices and high reliability. The other is openness, lower cost, but some degree of unreliability. High-end mainframes and Unix servers can handle transaction loads that would crush most machines using Intel or AMD x86 processors, but they cost more and are less adaptable. Most of the cutting-edge, large-scale action in the internet — including various cloud-computing efforts — is happening with the more free-wheeling technology.

One company operating at colossal scale, Google, has concluded it is better to buy cheap x86 servers and write software that automatically paves over hardware failures. The bigger problem comes when a large system composed of many interacting components loses track of its self-conception, and rebooting a single system or swapping out a hard drive isn't sufficient.

Essentially, Amazon had to reboot S3. Here's how the company described its S3 problem in a statement: "As a distributed system, the different components of S3 need to be aware of the state of each other. For example, this awareness makes it possible for the system to decide which redundant physical storage server to route a request to.

"We experienced a problem with those internal system communications, leaving the components unable to interact properly, and customers unable to successfully process requests. After exploring several alternatives, the team determined it had to take the service offline to restore proper communication and then bring service online again. These are sophisticated systems and it generally takes a while to get to root cause in such a situation," Amazon said. "We will be providing our customers with more information when we've fully investigated the incident."

Om Malik, analysts with GigaOM, described cloud computing as frail: "The S3 outage points to a bigger [and a larger] issue: the cloud has many points of failure — routers crashing, cable getting accidentally cut, load balancers getting misconfigured, or simply bad code."

Yet there are three things that shouldn't be overlooked before writing cloud computing off as a failure:

  • First, you should compare the problems of cloud computing to the alternatives, including running computing services in-house. Corporate datacentres also have crashing routers, bad code and misconfigured load balancers.
  • Second, you can expect reliability to increase as the companies providing cloud infrastructure and services figure out explore the terra incognita.
  • Third, don't confuse Web 2.0 with the foundational elements of cloud computing. A website that uses an online application at another site to mash up data from some other sites then present it using a service from yet another site is indeed susceptible to numerous points of failure. But a single-purpose infrastructure such as Amazon S3 is, at least in theory, a more tightly controlled, single-purpose utility that can offer higher reliability.

That's not to excuse Amazon's outage or gloss over the effect it had on business partners reliant on it. But S3 is the sole part of Amazon Web Services that comes with a service level agreement to promise customers reliability.

A little silver lining to this particular cloud problem is that Amazon is setting expectations at the right level. It said in a statement: "Any downtime is unacceptable, and we won't be satisfied until it is perfect."

Talkback

Despite what many pundits have to say, reliability issues will not be the downfall of cloud computing. Using cloud computing does not mean neglecting to architect solutions that meet their business requirements, including reliability requirements.

I wrote more about this idea here:

Cloud Computing and Reliability
<a href="http://faseidl.com/public/item/212584">http://faseidl.com/public/item/212584</a>

faseidl 10 September, 2008 17:35
Reply

This post has been removed by a moderator.

Post your comment

In order to post a comment you need to be registered and logged in.

You can also log in with Facebook. Log in or create your ZDNet UK account below

  • Login

Will not be displayed with your comment

By signing up for this service, you indicate that you agree to our Terms and Conditions and have read and understood our Privacy Policy. Questions about membership? Find the answers in the Community FAQ

Get ZDNet UK's daily newsletter

Enter your email address to sign up

ZDNet UK Live

annonymous2

If Piratebay is a crime then so is borrowing a dvd you purchased to a family member or a friend. Why should we not be aloud to share. Most of the...

2 hours ago by annonymous2 on UK ISPs ordered to block Pirate Bay website
NanWag

File Services For Macintosh was causing Excel to prompt for Overwriting changes or Save Another Copy because it was changing the timestamp on the...

2 hours ago by NanWag on Windows Server 2008 drops the ball for Mac compatibility
Regis Machado

creative cloud $48/month in the USA, £48/month in the UK ($79). good for the competitors

4 hours ago by Regis Machado via Facebook on Adobe move promotes piracy
Tom Espiner

Hello KosGirl, Good question. I've asked Belfius for a response. The latest post I can find on Pastebin about it is here:...

4 hours ago by Tom Espiner on Hackers hold bank to ransom over stolen data
KosGirl

Have there been any further updates to this story? I can't find any information on whether the hackers released the data or not.

5 hours ago by KosGirl on Hackers hold bank to ransom over stolen data
SandJ

I have done 7 speed tests this morning on different speed test tools. They tell me my download speed is: 12.3, 12.3, 12.3, 11.1, 12.7, 12.7, 11.7...

6 hours ago by SandJ on Watchdog: TalkTalk's broadband speed test misled users
Jack Schofield

@Mary Microsoft could always send Mozilla a spec sheet and oblige them to meet the same standards as IE. Then Mozilla can spend millions of...

9 hours ago by Jack Schofield on Windows RT browsers and the point of Windows RT
goth1csnake3

Not before time, that people making films,dvd's get whats coming to them. Well done, Virgin Media.

11 hours ago by goth1csnake3 on Virgin Media: Spotify deal will bring down piracy
Simon Bisson and Mary Branscombe

Apex - the question then is what about letting the user choose to have a tablet where they don't have to have that responsibility? why can't the...

21 hours ago by Simon Bisson and Mary Branscombe on Windows RT browsers and the point of Windows RT
Simon Bisson and Mary Branscombe

Moley, Apex, thanks; I think there's an interesting other dimension of choice - the choice to have a platform that is 'locked down' in the sense...

21 hours ago by Simon Bisson and Mary Branscombe on Mozilla accuses Microsoft of shutting Firefox out of WOA
Yellowcave

Not surprised. I once used the methods to let my firewall just notify me of breaches. Not one single logged event was genuine. Once, we all...

1 day ago by Yellowcave on Mobile porn filters catch innocent content, says report
duplex

live realy sucks in facebook becuase people hack your profile

1 day ago by duplex on Irish watchdog: Facebook privacy still falls short
Ed Macnair

If only it was that simple. When you start accessing Cloud applications you are stuck with the security model the vendor provides...........unless...

1 day ago by Ed Macnair via Facebook on IT security? You're doing it wrong!
Phil at Cloud4

Another good updaet, I have enjoyed going on the journey reading this series on SharePoint 2010 and have learned alot. Great writing.

1 day ago by Phil at Cloud4 on Designing a SharePoint farm: Tiers before bedtime
muteen

roumers of an ipad Mini, isnt that just an iTouch!?

1 day ago by muteen on Apple rebrands iPad 4G as 'Wi-Fi + Cellular' for UK
apexwm

Thanks for this article and bringing this issue to light. Unfortunately this type of activity is common not only with Adobe, but many other...

1 day ago by apexwm on Adobe move promotes piracy
Andy Bolstridge

there's a very thin line between tax avoidance and tax efficiency - earning £850 a month and claiming dividends to bring my income up to normal...

1 day ago by Andy Bolstridge via Facebook on The Idle Self-employed
Andy Bolstridge

I see that they are happy to announce these numbers.. but no-one will take any notice until they start announcing sales numbers too.

1 day ago by Andy Bolstridge via Facebook on Microsoft's score card for Smoked by Windows Phone
AndyPagin

I saw a Windows phone about a year ago, haven't seen once since, and quite a few people own phones in the City of London.

1 day ago by AndyPagin on Microsoft's score card for Smoked by Windows Phone
helice041

Well said. You can add the change differences between US $ and Euro for the adobe cloud subscription and the very clouded informations about when...

2 days ago by helice041 on Adobe move promotes piracy