Gmail: When efficiency equals death

Daily Newsletters

Sign up to ZDNet UK's daily newsletter.

LEADER
Yesterday, Gmail went down. For nearly two hours, users of the world's most visible cloud service were stuck with a very old-fashioned "502 – unable to reach server" error message. That's apt: it was a very old-fashioned problem. Google had been hit by cascading failure.

The company is being unusually open about what went wrong: an upgrade to part of the system rendered that part unable to handle its usual traffic. It signalled this to the rest of the Gmail cloud, and handed over the requests it couldn't serve. Unfortunately, that triggered similar problems in the parts of the system handling the overflow: they in turn shut down and passed on the original overflow, plus their own traffic — and so on, and so forth.

Cascading failure is well known. In the past, it has brought down telephone systems, power distribution grids and, most recently, came close to toppling the global financial system. It happens in human biology. It is, in short, a classic problem, characterised by the speed at which it develops — Google knew about it within seconds of it kicking off — and the totality of its consequences.

In the case of Gmail, it even leaped the species barrier to completely independent services. For a while Twitter, now the world's back channel for cloud error reporting, was overloaded by people asking "Is Gmail down?".

Expect more of the same, as we build ever more complex and interconnected systems. The irony is that the major cause of the problem is good engineering, as Google admits: the upgrade that triggered the meltdown was designed to improve the very thing that went wrong.

Efficiency, normally a touchstone of proper design, is the enemy. With large systems, any over-engineering is very expensive, so the tendency is to plan for the worst case and build for that and no more. But any worst case can be made more so if the system itself starts to fail. There is no slack to soak up the sudden increase in demand on the remaining components, and the cosmos falls apart.

There is no cure. There is a smart way to prepare for the problem. Have classically inefficient systems that are relaxed about overloads. Don't try to offer everything to everybody — Gmail's alternative, non-web, access methods, far less popular, carried on working. Have separately engineered control and monitoring pathways that keep on going when the core functions are broken. Diversity, flexibility, inefficiency and the expectation of failure: these are the hallmarks of reliable distributed systems.

For those of us who look to the cloud for the next generation of computing, these lessons are essential. Fortunately, we have a good example that's been working for 40 years — the internet itself — and we still live in a heterogeneous world, where a diversity of options are bolstered by open standards that ensure flexibility.

It is essential that in every step we take into the cloud we ask ourselves "What happens when this goes wrong?" and expect a sensible reply before going further. Those who claim to have all the answers will end up being the universal problem.

 

Talkback

I'm pleased you posed the question "What happens WHEN this go wrong?" rather than "What happens IF this goes wrong?" However I would go further and say the mark of a true engineer (hardware or software) would be, "What do we do when SEVERAL things go wrong?"

I think this has actually happened at a good time. Enough people have been inconvenienced for it to be a real Wake-Up call, but not enough for it to be a disaster. We might not be so lucky next time.

Tezzer 2 September, 2009 15:35
Reply

From the graphic on this piece and the headline I thought you might extrapolate into the implications for the Conservatives' recent Electronic Medical Record plan which looks like it could be based around Google Health or Microsoft's Health Vault offering?

Gmail going down is one thing but health records crashing? Obviously I am being alarmist as there are a lot of hoops to jump through before the Conservatives have the chance to try out the plan which should give the IT industry the time to make Cloud apps rigorous enough to do the job.

Andrew Donoghue 2 September, 2009 18:01
Reply

As an occasional Gmail user who accesses it over IMAP, I can't say I noticed. Goodness, that sounds smug. Perhaps it is.

But the lesson surely is that you cannot afford to rely absolutely on a system that's free. Paid-for systems are not necessarily more reliable - it'd be a fool who claimed that, I feel, although some research might not go amiss here - but at least users have some form of redress in the form of an SLA. And if service providers know their mortgages are resting on the service's continuity, it concentrates the mind wonderfully...

manek 3 September, 2009 11:19
Reply

Post your comment

In order to post a comment you need to be registered and logged in.

You can also log in with Facebook. Log in or create your ZDNet UK account below

  • Login

Will not be displayed with your comment

By signing up for this service, you indicate that you agree to our Terms and Conditions and have read and understood our Privacy Policy. Questions about membership? Find the answers in the Community FAQ

Get ZDNet UK's daily newsletter

Enter your email address to sign up

ZDNet UK Live

Leslie Satenstein

Mozilla has threatened to stop supporting Linux. I guess that UBUNTU is going with another browser. I indicated that if Mozilla stops supporting...

1 hour ago by Leslie Satenstein via Facebook on Firefox rapid release improves Fedora Linux
Andy Bolstridge

Much as I abhor Microsoft's licensing practices, this is almost certainly down to purchasing IT equipment via 3rd party consultants - you get the...

1 hour ago by Andy Bolstridge via Facebook on 6 million wasted licences and £1,200 PCs: welcome to government IT
Jack Schofield

@openhgs Windows users have had multiple desktops since Linus started writing Linux. They just haven't shipped as standard because not enough...

18 hours ago by Jack Schofield on Windows 8 could speed multi-monitor uptake
Jack Schofield

@Phil at Cloud4 What, Microsoft gets £1,200 per PC and £1,622 per server? Gosh, I'm amazed....

18 hours ago by Jack Schofield on 6 million wasted licences and £1,200 PCs: welcome to government IT
craigsc

You guys have no idea what is going on at Autonomy. Autonomy could have been a much more profitable organization. The sales operations at Autonomy...

20 hours ago by craigsc on HP cuts 27,000 staff as Autonomy chief Lynch leaves
Moley

How does this impact on dual or multi booting? Seems to me to more or less prohibit this, from Windows 8 anyway. Will Grub 2 recognise Windows 8,...

20 hours ago by Moley on Windows 8 start-up speed forces USB boot workaround
apexwm

I don't understand why there cannot be a slight pause during the boot process so the user can press a key. Many operating systems do this, even if...

21 hours ago by apexwm on Windows 8 start-up speed forces USB boot workaround
Gavin Goodman

You can now buy the Xi3 modular computer in the UK at http://www.ocdistribution.com . This can be bought with the Tand3m software, pricing and...

21 hours ago by Gavin Goodman on CES 2012: Xi3 microSERV3R
Phil at Cloud4

I agree: Mike Lynch can clearly build a business and manage strategy. I suspect the exit of Mike is more likely the end of a planned handover...

1 day ago by Phil at Cloud4 on HP cuts 27,000 staff as Autonomy chief Lynch leaves
Phil at Cloud4

This is unbeleivable government wastage with only one winner... Microsoft 1 - Tax payer Nil!

1 day ago by Phil at Cloud4 on 6 million wasted licences and £1,200 PCs: welcome to government IT
Mispam

So what do you do when you can't boot into windows? Why can't I just hold Shift while I power up instead of having to boot into windows and click a...

1 day ago by Mispam on Windows 8 start-up speed forces USB boot workaround
apexwm

I've also seen that Mac OS X for Intel machines is supposed to run in VirtualBox, which would also be a nice solution. I've never tried it though.

1 day ago by apexwm on xTreme Triple Booting: Linux, Mac & Windows
dave heasman

What I wonder is why when companies are caught bang to rights in not providing contracted services, people bend over to smear the customers? Surely...

1 day ago by dave heasman on Virgin throttles broadband for high-speed customers
pjc158

Strange statement from HP regarding Mike Lynch and not capable of scaling a company. Autonomy was a $7bn purchase which started as a small company...

1 day ago by pjc158 on HP cuts 27,000 staff as Autonomy chief Lynch leaves
lojolondon

Or - possibly, they will destroy business by ensuring people do not invest where there is no return. Another socialist idea, well beyond it's...

1 day ago by lojolondon on Open Data Institute will act as biz incubator
J.A. Watson

Good stuff Jake, very interesting. Thanks. jw

1 day ago by J.A. Watson on xTreme Triple Booting: Linux, Mac & Windows
openhgs

"the cost of a second LCD screen is about the same as one day of an office worker's time, so this should soon be recouped in extra productivity."...

1 day ago by openhgs on Windows 8 could speed multi-monitor uptake
Thomas Gellhaus

I also installed the KDE version; I also will probably try out razorqt since I really haven't had a chance to before. I'm looking forward to the...

2 days ago by Thomas Gellhaus via Facebook on Mageia 2 Released
francisabigail

Acquiring when reinvention/cannibalization is too challenging for a large organization can be an excellent strategy- still, so many mergers stumble...

2 days ago by francisabigail on Ariba buy parks SAP on Oracle's cloud turf
apexwm

All of the feedback regarding using a touch monitor for a desktop PC is right on. Several months ago, we installed a "demo" multitouch all-in-one...

2 days ago by apexwm on Windows 8 could speed multi-monitor uptake