Blue Gene/L is an exercise in powers of two, starting with each of the 65,536 compute nodes. Each of the dual processors on the compute node has two "floating point units," engines for performing mathematical calculations. Each node's chip is 121 square millimetres and built on a manufacturing process with 130-nanometre features, Pulleyblank said. That compares with 267 square millimetres for IBM's current flagship processor, the Power4+ used in its top-end Unix servers. The small size for Blue Gene's chips is crucial to ensure the chips don't emit too much waste heat, which would prevent engineers from packing them densely enough. Two nodes are mounted onto a module; 16 modules fit into a chassis; and 32 chasses are mounted into a rack. A total of 64 racks will be installed at the Livermore lab by the end of 2004, with the first 512-node half-rack prototype to be built this autumn at IBM's Thomas J. Watson Research Center. "We're going to have first hardware this year. We are actually fabricating chips for this machine," Pulleyblank said. All nodes are created equal, but 1,024 of them will have a more important task than the rest, Pulleyblank said. These so-called input-output, or I/O, nodes, will run an instance of Linux and assign calculations to a stable of 64 processor nodes. These underling nodes won't run Linux, but instead a custom operating system stripped to its bare essentials, he said. When they have to perform a task they're not equipped to handle, they can pass the job up the pecking order to one of the I/O nodes. "It will look like it has 1,024 I/O nodes, each of which manages a gang of 64 compute nodes," Pulleyblank said. Running Linux, a move made possible by using the comparatively ordinary 440GX processor, was crucial to make the system useful. "It was absolutely clear by making it run Linux, we were opening it up to a broad range of applications we couldn't get otherwise," Pulleyblank said. Of the two processors on each node, one will be devoted to number-crunching and the other to communicating with the rest of the system. In this configuration, the system should be able to perform at a rate of 180 teraflops, or 1 trillion calculations per second. In some cases where minimal communication between nodes is required, both processors of each node can concentrate on maths, bringing the system performance to 360 teraflops, Pulleyblank said. Communication among the nodes is a challenge IBM tackled by employing two primary networks. The first network is a mesh that connects each node to every other one, with a message travelling from one node to another having to hop across a maximum of 64 nodes in between. The second network is a branching tree structure that can quickly deliver messages to the entire collection of nodes or gather information from them. When a message needs to be sent, "we automatically decide the better way to route it," Pulleyblank said. "Also interesting is that if one network fails, we can still completely run with the other network, but slower." In addition, a third network uses a conventional 1-gigabit-per-second Ethernet technology. There are two management networks besides, one to help boot nodes and one to monitor and control them. Blue Gene has some unusual features, but IBM has tried as much as possible to anchor the system to more mainstream technology. Staying on the beaten path is the best way to take advantage of technology that's improving fastest, Pulleyblank said, and it also makes it easier to create products out of the Blue Gene research. "Our direction has been as much as possible to exploit these standard components," he said.





