Microsoft is not alone in chip woes

Daily Newsletters

Sign up to ZDNet UK's daily newsletter.

COMMENT

Recently, Microsoft's problem with the Xbox's infamous Red Ring Of Death resulted in a billion-dollar bill. The consoles just died after a while; an issue that seemed to be linked to heat, but the company was reluctant to disclose exactly what.

Now we know — the graphics chip, designed in-house, chronically overheated and eventually gave up the ghost.

It can seem hard to believe that a company with so many resources can make such an expensive mistake. Yet in electronics design, there is no shortage of hidden problems that can elude every reasonable effort to find them before launch. Chip design is not the exact science you might imagine.

I've been there myself. Here's how it can go wrong. In the late 1980s, I worked for a small company with big ambitions. We started off by building a cheap PC network — non-standard, but built around a few low-cost off-the-shelf chips used in an ingenious way. That sold well enough that it was decided to make a higher-performance version around a custom chip design. Our hardware designer (and co-owner) was a very experienced, creative and effective engineer, one of the most capable people I know: the project seemed very doable.

The prototyping went well. At the time, chips were designed in four main stages. First, you design the actual circuit in a CAD package, which output a netlist — effectively a script that describes which logic gates to use and how they're connected. Then, you run the netlist through a software simulator that applies electrical rules as if the circuit were running: you feed it a file of fake signals and check the output against what you expect.

Because simulators are always very slow compared to hardware, you can only check a small subset of possible conditions before building a hardware prototype. This can be a collection of many, perhaps hundreds, of standard logic chips wired together by hand to mimic your design's internals: it's slow to build, hard to get it exactly right, and difficult to make multiple copies; let alone plug it into a PC.

Or you can take the fast and expensive path and go for an e-beam lithography prototype: this is a way of building a full custom chip by firing a carefully steered beam of electrons at a properly prepared bit of silicon. You feed your netlist into the e-beam process at one end and end up with a fully functioning (you hope) real working prototype, same size and speed as the final part.

These are far too expensive for production — e-beam is the equivalent of hand-lettering an illuminated manuscript, as opposed to the printing press of standard chip fabrication — but a great way of creating final test systems that work exactly as the finished design.

Our e-beam litho prototypes came back from the fab, we plugged them in, held our breath, turned on the PCs and loaded the software. There's absolutely nothing like that moment; months of work past and an entire future hangs on it.

It worked just fine. All we had to do then was send the netlist to a company that made proper Asics (Application Specific Integrated Circuits). These are made in large numbers very cheaply; they cost a lot more to set up than e-beam litho, but when that's done you can churn them out like so many sausages. We knew the circuit worked; the Asic was just another way to build something we'd now tested in many different ways.

And at first, all went to plan. The chips were made, the network boards produced, software finished (well, I say finished...), the product launched and we started to take the punters' money.

Then reports started to come in from the field that there was an uncommon but far too frequent failure mode where PCs locked up solid in mid-network transaction. We were still a small company with very limited resources: it doesn't matter how smart you are, once things start going wrong you can only do so much firefighting. But time is tight: it's at this point that you learn by heart the number of every local late-night fast food delivery service.

At first, we couldn't even replicate the problem; everything ran fine in the lab. It transpired after a while that certain kinds of PC were more vulnerable than others: we collected examples. The next problem was finding out a way of making the error happen repeatedly and often enough for us to investigate it. That took a while: our collection of Sancho's pizza boxes grew to mountainous proportions before we had a sequence of network transactions that could crash the bleeder on command. There didn't seem to be anything special about that sequence, but at least we could hook up our rather meagre collection of test equipment and start gathering real data.

It's worth remembering what the state of PC hardware was in the late 1980s, when the 8086 and 80286 ran the show and the 80386 was just coming onto the market. There were hundreds of different brands, many of them with custom motherboards, each trying with more or less success to emulate the IBM PC standard. Compatibility was a big issue: most (but by no means all) clones worked well out of the box. What happened when you plugged in an expansion card was a different matter.

The original IBM PC design was remarkable for a largely forgotten fact: hardware and software, it was open source. PC-DOS wasn't: that was Microsoft's. But a listing of the Bios and all the circuit diagrams were available...

Talkback

You could be excused since yours was a small company without the type of resources that Microsoft has, also these were the early days of computing, most people would've been happy just to have a computer. now-a-days, we expect seemless performance!

Microsoft's mistake in this case is applying the same approach to hardware that they've used in software for years.
What I mean by that is; in many cases Microsoft products come with their fair share of bugs, they spend the first post-release year sending out patches/service packs etc. On this front, delivery is easy and cost free, your computer simply downloads them.

You can't remotely fix hardware, Microsoft are learning this the hard way. Next time they'll have to spend more time testing because unlike software, with hardware you have to get it right the first time!

harpless 16 June, 2008 16:51
Reply

Give this man a pay-rise

rimbaud 17 June, 2008 10:53
Reply

Post your comment

In order to post a comment you need to be registered and logged in.

You can also log in with Facebook. Log in or create your ZDNet UK account below

  • Login

Will not be displayed with your comment

By signing up for this service, you indicate that you agree to our Terms and Conditions and have read and understood our Privacy Policy. Questions about membership? Find the answers in the Community FAQ

Get ZDNet UK's daily newsletter

Enter your email address to sign up

ZDNet UK Live

apexwm

Bill Goodrich : Just as al_langevin pointed out, with Windows Server 2008 there is no Services for Macintosh anymore. It's gone, not available....

5 hours ago by apexwm on Windows Server 2008 drops the ball for Mac compatibility
txtrainguy

Replying to an old topic that I'm currently facing with my CEO (who is on a Mac). Our servers are primarily Windows Servers, office is about...

12 hours ago by txtrainguy on Windows Server 2008 drops the ball for Mac compatibility
k0tcs3

Sure, that makes perfect sense. Pay wrong-doers money and thank them for breaching your security and pointing out your flaws, that would surely...

12 hours ago by k0tcs3 on US indicts Romanian over NASA climate change hack
Random_Error

I think he's referring specifically to Android apps, as Apple do regulate their App Store, but Google seem to let any old crap onto the Android store!

12 hours ago by Random_Error on RIM: BlackBerry will keep 'garbage' apps out of store
Paul Fezziwig

Keep the crap apps out?! How will they compete with Android and Apple's claim to fame of having so many life changing apps? I wonder if the media...

18 hours ago by Paul Fezziwig via Facebook on RIM: BlackBerry will keep 'garbage' apps out of store
Aigars Mahinovs

It has been shown time after time that if there is an author store that sells the songs at even 1$ per song and gives you a high-quality digital...

19 hours ago by Aigars Mahinovs via Facebook on Copyright isn't working, says European Commission
awbMaven

""As a result of Butyka's alleged conduct, researchers were unable to use the computers for more than two months while NASA removed the malicious...

21 hours ago by awbMaven on US indicts Romanian over NASA climate change hack
subhorup

It simultaneously worries me and uplifts me that a self-proclaimed group of internet activists name themselves after Indian mythical figures....

1 day ago by subhorup on Anonymous activists release PCAnywhere source code
naviathan

It's actually far easier to work anonymously on the internet than you think. With tools like Tor bouncing your traffic around the world before...

1 day ago by naviathan on Anonymous activists release PCAnywhere source code
Agnostic_OS

1000272134 and bluedalmatian with you both there but then I'm still in 10.04 land (and happy with it)

1 day ago by Agnostic_OS on Ten factors that make Ubuntu 11.10 a hit
apexwm

Interesting article and definitely see your points on the products mentioned. One of the top products for our Help Desk (approximately 20% of all...

2 days ago by apexwm on Ten flawed products that derail productivity
Paul Hutchinson

Absolutely - this should obviously not be handled my isp - but handled by their hosting operator. What's been suggested here is that my isp police...

2 days ago by Paul Hutchinson via Facebook on MPs urge ISPs to take down terrorist material
Techs UK

Looks like a great phone. I don't notice any deficiencies in WP7. used IOS before, that's pretty good. I don't spend much time in Apps, all i need...

2 days ago by Techs UK on Nokia pins US 're-entry' hopes on Lumia 900
Larry Bloggy

Now with the help of these apps you are always synced with MS outlook while on the move. Just download apps like xobni or outlookreflex and get...

2 days ago by Larry Bloggy via Facebook on Outlook Social Connector beta 2 and the LinkedIn connector
mike40g123

Your details are wrong. The version currently being made is the one with 2 USB ports, 256MB RAM and a network port. This is the Model B. The...

2 days ago by mike40g123 on Raspberry Pi boards set to go on sale
Moley

The thing that has been puzzling me for quite a while is how Anonymous can remain anonymous whilst not only being active on the Internet but also...

2 days ago by Moley on Anonymous activists release PCAnywhere source code
Don Dilly

If what Semantec is saying is rue, that is even worse and shows a complete disregard for thier users. If what Anonymous claims is true and the...

3 days ago by Don Dilly via Facebook on Anonymous activists release PCAnywhere source code
MattChurchy

Didn't seem particularly biased to me either. Oh though you might have mentioned some other competitors with free search and email services...

3 days ago by MattChurchy on Time for an evil umpire: Google, Microsoft & privacy
Simon Bisson and Mary Branscombe

James - exactly as much as anyone paid you for your comment; I don't feel that I need to say that I'm independant and unbiased, but just for you...

3 days ago by Simon Bisson and Mary Branscombe on Time for an evil umpire: Google, Microsoft & privacy
Carl White

Once they realise symantec are willing to pay real money, they will simply keep extorting, unless of course symantec/authorities can use the...

3 days ago by Carl White via Facebook on Symantec offered hackers $50k in source code sting