Google: 'At scale, everything breaks'

Q&A

Google operates technology that is expected to be reliable in the face of major traffic demands.

To scale its services, the company has developed many systems, such as MapReduce and Google File System, that have since been made open source by Yahoo and worked into the popular Hadoop data-analytics framework.

However, behind the scenes, the company is fighting a constant battle against the twin demons of cascading failovers and the increasingly challenging levels of complexity that massively scaled services bring.

Urs Hölzle was Google's first vice president of engineering. Before joining Google he worked on high-performance implementations of object-orientated languages, contributed to Darpa's national compiler infrastructure project, and developed compilers for Smalltalk and Java.

According to Hölzle, "at scale, everything breaks", and Google must walk a tightrope between increasing the scaling of its systems while avoiding cascading failovers, such as the outage that affected Gmail in March this year.

Q: Apart from focusing on physical infrastructure, such as datacentres, are there efficiencies that Google gains from running software at massive scale?
A: I think there absolutely is a very large benefit there, probably more so than you can get from the physical efficiency. It's because when you have an on-premise server it's almost impossible to size the server to the load, because most servers are actually too powerful and most companies [using them] are relatively small.

[But] if you have a large-scale email service where millions of accounts are in one place, it's much easier to size the pool of servers to that load. If you aggregate the load, it's intrinsically much easier to keep your servers well utilised.

What are Google's plans for the evolution of its internal software tools?
There's obviously an evolution. For example, most applications don't use [Google File System (GFS)] today. In fact, we're phasing out GFS in favour of the next-generation file system that is very similar, but it's not GFS anymore. It scales better and has better latency properties as well. I think three years from now we'll try to retire that because flash memory is coming and faster networks and faster CPUs are on the way and that will change how we want to do things.

One of the nice things is that if everyone today is using the Bigtable compressed database, suppose we have a better Bigtable down the line that does the right thing with flash — then it's relatively easy to migrate all these applications as long as the API stays stable.

How significant is it to have these back-end systems — such as MapReduce and the Google File System — spawn open-source applications such as Hadoop through publication and adaptation by other companies?
It's an unavoidable trend in the sense that [open source] started with the operating system, which was the lowest level that everyone needed. But the power of open source is that you can continue to build on the infrastructure that already exists [and you get] things like Apache for the web server. Now we're getting into a broader range of services that are available through the cloud.

For instance, cluster management itself or some open-source version will happen, because everyone needs it as their computation scales and their issue becomes not the management of a single machine, but the management of a whole bunch of them. Average IT shops will have hundreds of virtual machines (VMs) or hundreds of machines they need to manage, so a lot of their work is about cluster management and not about the management of individual VMs.

Often, if computation is cheap enough, then it doesn't pay to...

Talkback

This post has been removed by a moderator.

This post has been removed by a moderator.

Very good article, and insightful about the challenges Google faces. It also explains why the banks in Australia are experiencing a run of service failures in their systems, which has never happened before - because the complexity has got beyond them. It's not going to go away and the effort to fix it will be bigger then the effort it took to create the instability they currently have. With a skill base a fraction of Google's the Aussie banks are facing dire consequences from out of control complexity.

Walter @adamson

Walter Adamson via Facebook 27 June, 2011 03:03
Reply

This post has been removed by a moderator.

This post has been removed by a moderator.

This post has been removed by a moderator.

Post your comment

In order to post a comment you need to be registered and logged in.

You can also log in with Facebook. Log in or create your ZDNet UK account below

  • Login

Will not be displayed with your comment

By signing up for this service, you indicate that you agree to our Terms and Conditions and have read and understood our Privacy Policy. Questions about membership? Find the answers in the Community FAQ

Get ZDNet UK's daily newsletter

Enter your email address to sign up

ZDNet UK Live

dede0202

Hello ALL USERS OF THE PIRATE BAY I WOULD PUT AN EXPLANATION ON PIRACY Story Idea ILLIGALE AND SHARING THOSE THAT NET Dissent NOT WELL BUT TO CA...

5 hours ago by dede0202 on The Pirate Bay infringes copyright, High Court decides
Sungwoo

do You know that? it can install 4G Ram. So i buy 4g and install It work! I can run call of duty 4,6,7 [Modern war... 1,2,3] Call of duty 1 was...

5 hours ago by Sungwoo on Loose Ends - Upgrading the Aspire One 522
itsajob

2. Bad idea. Making up patch cables loses you your commission from the cable supplier. 3. If you tidy up, other people can understand where the...

11 hours ago by itsajob on Ten IT jobs to save up for those rare lulls
Roberto_Store

Now On Sale, Unlocked iPhone 4S / Galaxy Note In Factory Box. Roberto-Techie(UK) ”Now on Sales” Smartphone, Android,Tablets,Gadget &...

15 hours ago by Roberto_Store on Samsung Galaxy S III lined up for sale
Paul Smyth

Is this classic FUD? One thing I would definitely have notice is a Mozilla threat to stop supporting GNU/Linux.

17 hours ago by Paul Smyth via Facebook on Firefox rapid release improves Fedora Linux
UnderINK

I agree with the previous commenter wholeheartedly. I couldn't say it better myself. This is very 'Big Brother'. And while I agree with protecting...

21 hours ago by UnderINK on European e-identity plan to be unveiled this month
Simon Bisson and Mary Branscombe

Nice to see that Turing's idea of a general purpose computer doing once-hardware-powered tasks in software is now universal ;-) Mary

1 day ago by Simon Bisson and Mary Branscombe on Software with everything
Jason Burchell

seriously now. I've only bothered to read a small bit of the comments. do me and the rest of the world a favour. stop saying it does not work or...

1 day ago by Jason Burchell via Facebook on Music industry negotiating over 24-bit downloads
Philip Charles Cohen

Read about it and weep, John Donahoe ... In addition to Visa’s V.me, there is now MasterCard’s PayPass digital wallet soon to arrive; another...

1 day ago by Philip Charles Cohen via Facebook on PayPal takes phone-based payments to the high street
apexwm

Leslie Satenstein : Where have you ever seen Mozilla even mention this? Firefox is the most popular browser in the GNU/Linux OS, so I don't see...

1 day ago by apexwm on Firefox rapid release improves Fedora Linux
songmaster

SHleG: Do you remember building a clockwork scorpion kit (I'm pretty sure I have a photo of it somewhere) — I think it was called something like...

2 days ago by songmaster on Software with everything
Chris Wortman

Good I love Yahoo! Their search engine is getting better than Google as of late. I find more of what I want on the first page, and usually within...

2 days ago by Chris Wortman via Facebook on Linux Mint 13 ramps up for KDE release
PatrickG

openhgs has made the point for Windows 8 multiple monitors without realising it! With Windows 7 you have to switch the mouse and so your focus...

2 days ago by PatrickG on Windows 8 could speed multi-monitor uptake
Leslie Satenstein

Mozilla has threatened to stop supporting Linux. I guess that UBUNTU is going with another browser. I indicated that if Mozilla stops supporting...

2 days ago by Leslie Satenstein via Facebook on Firefox rapid release improves Fedora Linux
Andy Bolstridge

Much as I abhor Microsoft's licensing practices, this is almost certainly down to purchasing IT equipment via 3rd party consultants - you get the...

2 days ago by Andy Bolstridge via Facebook on 6 million wasted licences and £1,200 PCs: welcome to government IT
Jack Schofield

@openhgs Windows users have had multiple desktops since Linus started writing Linux. They just haven't shipped as standard because not enough...

2 days ago by Jack Schofield on Windows 8 could speed multi-monitor uptake
Jack Schofield

@Phil at Cloud4 What, Microsoft gets £1,200 per PC and £1,622 per server? Gosh, I'm amazed....

2 days ago by Jack Schofield on 6 million wasted licences and £1,200 PCs: welcome to government IT
craigsc

You guys have no idea what is going on at Autonomy. Autonomy could have been a much more profitable organization. The sales operations at Autonomy...

2 days ago by craigsc on HP cuts 27,000 staff as Autonomy chief Lynch leaves
Moley

How does this impact on dual or multi booting? Seems to me to more or less prohibit this, from Windows 8 anyway. Will Grub 2 recognise Windows 8,...

2 days ago by Moley on Windows 8 start-up speed forces USB boot workaround
apexwm

I don't understand why there cannot be a slight pause during the boot process so the user can press a key. Many operating systems do this, even if...

3 days ago by apexwm on Windows 8 start-up speed forces USB boot workaround