Google: 'At scale, everything breaks'

Q&A

Google operates technology that is expected to be reliable in the face of major traffic demands.

To scale its services, the company has developed many systems, such as MapReduce and Google File System, that have since been made open source by Yahoo and worked into the popular Hadoop data-analytics framework.

However, behind the scenes, the company is fighting a constant battle against the twin demons of cascading failovers and the increasingly challenging levels of complexity that massively scaled services bring.

Urs Hölzle was Google's first vice president of engineering. Before joining Google he worked on high-performance implementations of object-orientated languages, contributed to Darpa's national compiler infrastructure project, and developed compilers for Smalltalk and Java.

According to Hölzle, "at scale, everything breaks", and Google must walk a tightrope between increasing the scaling of its systems while avoiding cascading failovers, such as the outage that affected Gmail in March this year.

Q: Apart from focusing on physical infrastructure, such as datacentres, are there efficiencies that Google gains from running software at massive scale?
A: I think there absolutely is a very large benefit there, probably more so than you can get from the physical efficiency. It's because when you have an on-premise server it's almost impossible to size the server to the load, because most servers are actually too powerful and most companies [using them] are relatively small.

[But] if you have a large-scale email service where millions of accounts are in one place, it's much easier to size the pool of servers to that load. If you aggregate the load, it's intrinsically much easier to keep your servers well utilised.

What are Google's plans for the evolution of its internal software tools?
There's obviously an evolution. For example, most applications don't use [Google File System (GFS)] today. In fact, we're phasing out GFS in favour of the next-generation file system that is very similar, but it's not GFS anymore. It scales better and has better latency properties as well. I think three years from now we'll try to retire that because flash memory is coming and faster networks and faster CPUs are on the way and that will change how we want to do things.

One of the nice things is that if everyone today is using the Bigtable compressed database, suppose we have a better Bigtable down the line that does the right thing with flash — then it's relatively easy to migrate all these applications as long as the API stays stable.

How significant is it to have these back-end systems — such as MapReduce and the Google File System — spawn open-source applications such as Hadoop through publication and adaptation by other companies?
It's an unavoidable trend in the sense that [open source] started with the operating system, which was the lowest level that everyone needed. But the power of open source is that you can continue to build on the infrastructure that already exists [and you get] things like Apache for the web server. Now we're getting into a broader range of services that are available through the cloud.

For instance, cluster management itself or some open-source version will happen, because everyone needs it as their computation scales and their issue becomes not the management of a single machine, but the management of a whole bunch of them. Average IT shops will have hundreds of virtual machines (VMs) or hundreds of machines they need to manage, so a lot of their work is about cluster management and not about the management of individual VMs.

Often, if computation is cheap enough, then it doesn't pay to...

Talkback

This post has been removed by a moderator.

This post has been removed by a moderator.

Very good article, and insightful about the challenges Google faces. It also explains why the banks in Australia are experiencing a run of service failures in their systems, which has never happened before - because the complexity has got beyond them. It's not going to go away and the effort to fix it will be bigger then the effort it took to create the instability they currently have. With a skill base a fraction of Google's the Aussie banks are facing dire consequences from out of control complexity.

Walter @adamson

Walter Adamson via Facebook 27 June, 2011 03:03
Reply

This post has been removed by a moderator.

This post has been removed by a moderator.

This post has been removed by a moderator.

Post your comment

In order to post a comment you need to be registered and logged in.

You can also log in with Facebook. Log in or create your ZDNet UK account below

  • Login

Will not be displayed with your comment

By signing up for this service, you indicate that you agree to our Terms and Conditions and have read and understood our Privacy Policy. Questions about membership? Find the answers in the Community FAQ

Get ZDNet UK's daily newsletter

Enter your email address to sign up

ZDNet UK Live

J.A. Watson

Good stuff Jake, very interesting. Thanks. jw

21 minutes ago by J.A. Watson on xTreme Triple Booting: Linux, Mac & Windows
openhgs

"the cost of a second LCD screen is about the same as one day of an office worker's time, so this should soon be recouped in extra productivity."...

2 hours ago by openhgs on Windows 8 could speed multi-monitor uptake
Thomas Gellhaus

I also installed the KDE version; I also will probably try out razorqt since I really haven't had a chance to before. I'm looking forward to the...

12 hours ago by Thomas Gellhaus via Facebook on Mageia 2 Released
francisabigail

Acquiring when reinvention/cannibalization is too challenging for a large organization can be an excellent strategy- still, so many mergers stumble...

14 hours ago by francisabigail on Ariba buy parks SAP on Oracle's cloud turf
apexwm

All of the feedback regarding using a touch monitor for a desktop PC is right on. Several months ago, we installed a "demo" multitouch all-in-one...

20 hours ago by apexwm on Windows 8 could speed multi-monitor uptake
191706

anyone wanting to triple boot *their* own Mac

21 hours ago by 191706 on xTreme Triple Booting: Linux, Mac & Windows
SoapyTablet

Cont.. Biggest Bugbear: Win7's stop-animate-go approach to work, you develop a staggered (not in the above alchohol sense of the word) approach to...

21 hours ago by SoapyTablet on Windows 8 could speed multi-monitor uptake
SoapyTablet

Ah the joys of Windows 8 Consumer Preview... If Windows 7 was 'Vista with Lipstick', whats Windows 8? Vista with Lipstick, the morning after?...

21 hours ago by SoapyTablet on Windows 8 could speed multi-monitor uptake
daveveej

Though the metro look is quite cool on the windows mobile platform I think that think that microsoft ARE MESSING THINGS UP because what has they...

21 hours ago by daveveej on Windows 8 could speed multi-monitor uptake
Custonian

I agree, we have a few touch screen monitors in work but as Windows7 and the applications we use are not touch screen friendly (the size of the...

22 hours ago by Custonian on Windows 8 could speed multi-monitor uptake
archerthom

I find it amusing that Microsoft added the mouse, which was deemed awkward, but people were forced to use it so it stuck, and now they're saying,...

1 day ago by archerthom on Windows 8 could speed multi-monitor uptake
BrownieBoy

Agree with other comments. Nobody's going to start reaching out to start tapping their desktop monitors with their fingers. Their arms would tire...

1 day ago by BrownieBoy on Windows 8 could speed multi-monitor uptake
Random_Error

The only way a touch monitor would be any good is if it were horizontal on the desk, with a virtual keyboard so you could do away with that as well...

2 days ago by Random_Error on Windows 8 could speed multi-monitor uptake
JBDragon

This is just dumb! Forget that I think Windows 8 will bomb, but really, people are going to go out and buy touch Monitors now??? Just pretend...

2 days ago by JBDragon on Windows 8 could speed multi-monitor uptake
Jake Rayson

@Andy Bolstridge > Unfortunately, we need the majority to work 9-5 And therein lies the lie. I work very hard indeed for my idleness, early starts...

2 days ago by Jake Rayson on The Idle Self-employed
Burn-IT

What happens when one hosting platform "acquires data" from another? If I forced the first one to remove it, who is responsible for chasing the...

2 days ago by Burn-IT on Google picks holes in EU's 'right to be forgotten'
JohnTalich

iSpring Pro is a nice tool, that allows PowerPoint to SCORM conversion. They also have free tool, that also generates SCORM compliant courses.

2 days ago by JohnTalich on How To Convert PowerPoint To SCORM Compliant Course
aaron.sloman

I think the answer to the question requires a deeper analysis of where the income can come from who else is now competing for it, who else will be...

2 days ago by aaron.sloman on The three big questions about Facebook's IPO
Brent Pieczynski

Your correctness about Government websites not being compliant with their own websites is correct. Most criticism of other people takes so many...

3 days ago by Brent Pieczynski on Privacy watchdog to chase big companies over cookie law
Kelvyn Taylor

802.11ac does promise some tricks to improve range & reliability, but not sure how these will work in practice until I get real products to play...

3 days ago by Kelvyn Taylor via Facebook on Next-generation 802.11ac routers