Journal North: Home | Sports | Opinion | Obits | Entertainment
Sunday, April 18, 2004
Paying Too Much for a Bad Machine
By Chris Mechels
Guest Commentary
In a March 15 column in the Journal North, John Morrison, leader of Los Alamos National Laboratory's computer division, argued that a recent story raising concerns about the performance of ASCI Q supercomputer was off the mark, and that Q was really quite successful in many ways. He made claims that "The Q system is delivering everything it's been asked to do," and that outside experts "endorsed Q's accuracy, performance and reliability." Unfortunately, Mr. Morrison relied on argument and hyperbole, not facts, and he did not provide any data or cite any documents which would allow us to check his claims. I have recent information which serves to challenge many of the claims that he made.
First, let's examine Morrison's claim that Q's hardware problems are "routine in every experimental, high-performance computer system." A commonly used figure of merit for hardware reliability for large supercomputers such as the 20 tera-flop Q is 50 hours mean time to failure. (One teraflop equals one trillion computer operations per second). In fact the upcoming Sandia 40 tera-flop "Red Storm" computer, scheduled for 2004 delivery, has such a requirement. The LANL contract for Q, which I have, has such a requirement. The 2003 figure for the Q is nowhere near 50 hours mean time to failure; in fact it is 3.82 hours. This information is from LANL's own records, obtained through a public records request. This incredibly bad reliability is driven by processor failures that average 107 failures per month. To put this in perspective, the LANL Blue Mountain machine, Q's predecessor, has more processors than Q but averaged about one processor failure every other month. Morrison claims that this extreme failure rate is due to the large number of Q processors and "Los Alamos' high altitude." This is simply not true. The Pittsburgh Supercomputing Center machine uses the same processor as Q but has a mean time to failure rate of 11 hours, showing the same order of reliability problems as Q. Pittsburgh is not at a high altitude.
Compensation costs
The failures are due to a Compaq design decision, which Morrison mentions. LANL knew, very early on, that Q would be unreliable due to the design but elected to proceed with the procurement. While one can compensate for this bad design, the compensation itself costs performance and is not foolproof. The result is that computing tasks are lost due to failures, and some large tasks become impractical due to unreliability. For more understanding of the Q problems, see http://www.cs.sandia.gov/SOS7/presentations/morrison.ppt, a presentation by Morrison himself. It contains a frank discussion of Q's problems, quite unlike Morrison's comments in the Journal.
Let's look at the claim that "as the home of the first Cray machines," LANL has experience in dealing with the problems of "essentially unique machines." I was the Cray employee most responsible for the first Cray at LANL in 1976, before I joined LANL, and I observed that for the first four years the machine was essentially useless for its intended uses due to poor or unavailable software. It was sort of like trying to use your PC without Windows. It was not until LANL's sister lab, Lawrence Livermore, got their first Cray in 1980 that the LANL machine became useful because we brought in software from Livermore. So the LANL expertise, which Morrison lauded, did not measure up. Livermore saved the day...
Not a unique machine
As to Morrison's claim that problems are "inevitable in machines built to run applications of unprecedented size and complexity," this does not apply to Q. First, Q is not unique. Two previous installations (at facilities here and in France) showed up the problems of this machine before LANL's was installed. There was time to change horses, but LANL chose not to. Pacific Northwest Laboratory took delivery of a supercomputer from the same vendor, Hewlett Packard, but with a different processor type, and it seems to work quite well. The Livermore supercomputers, from IBM, and the Sandia "Red" supercomputer, from Intel, do not have the reliability problems exhibited by Q.
The JASON report, which Morrison claims gave Q glowing reports, didn't. Go look for yourself at http://www.fas.org/irp/agency/dod/jason/asci.pdf. In fact, one of the JASON members, in an interview, thought it remarkable that LANL was getting useful work from Q "given its poor reliability." Section 5.5.3 of the report is critical of another serious LANL shortcoming, in software development: "There was a striking difference between the high quality of software engineering at Sandia as compared to Los Alamos National Laboratory" and "at Los Alamos in particular, better ways ... must be found." Los Alamos has long resisted modern software development practices, such as software engineering and software configuration management.
It has always been risky to accept LANL claims about supercomputing successes, and this continues with Q. For four decades, LANL has made claims which cannot be supported by data. The U.S. taxpayers did not get what they paid for in Q; quite the contrary. LANL paid $168 million for a very unreliable machine, delivered over one year late... Almost immediately after completing the Q acquisition, in the summer of 2003, LANL bought another 11 tera-flop supercomputer for $10 million. This machine is also much more reliable and usable. I conclude that LANL spent $150 million more than necessary for a machine of Q's capability, and got a very unreliable machine in the bargain. We can hope that LANL, and Morrison, learned from the experience, but their history offers little encouragement. We are left to wonder how much money they will waste on the next supercomputer.
Mechels is a retired Los Alamos National Laboratory employee.