Once every minute and for no good reason, a bit flips in a supercomputer at Los Alamos National Laboratory, causing an error. All of a sudden, say, 1 + 1 = 3.
Bits are the basic currency of all digital information. They come in two flavors, zeroes and ones. As a computer does its work, bits are called from disk storage, zip through processors and park temporarily in memory. When a bit randomly jumps from 0 to 1, it might alter a calculation or hide a piece of information. Computer engineers call it a single-event upset or a fault.
These upsets are tripping computers of all sizes more frequently, not just at the lab, but in the broader computing world, too. For Los Alamos, with a dozen-plus supercomputers running jobs vital to national security and other important science missions, single-event upsets are a fact of life because of the density of components.
A fault can play out a few ways. Sometimes nothing happens – the hardware corrects itself. Other times, the program or even the entire system crashes in a detectable event. It’s like a flat tire, wasting lots of time as system administrators restore programs and data. In the worst case, an upset goes unnoticed. A scientific calculation might come back with the wrong answer, but nobody knows. That’s rare, but it happens.
Upsets can result from excessive heat, a voltage spike, or – get ready for it – particles from outer space. Those particles are neutrons, and a team of physicists, space scientists and computer engineers at Los Alamos are researching just how much trouble they cause. Are cosmic-ray neutrons a major culprit or a minor irritant? Understanding that will help the team create strategies for best managing the upsets.
Normally, protons and neutrons stick together to form the nucleus at the center of an atom. The trouble starts when high-energy cosmic rays – mostly protons from remote cosmic cataclysms – knock neutrons and other particles loose from atoms in the atmosphere. Every hour, eighty-some cosmic-ray neutrons strike a surface the size of a computer’s central processing unit, or CPU.
Most neutrons miss the nuclei of atoms in a CPU and pass right through. Eventually, though, a neutron hits a nucleus. If it’s a very high-energy, or fast, neutron, it bounces the struck nucleus right out of its home in the silicon chip. An upset occurs, corrupting data. Upsets are more likely to happen in supercomputers because they are densely packed with tens of thousands of CPUs – that’s what makes them super.
As the backbone of the nation’s Stockpile Stewardship Program, the largest Los Alamos supercomputers hum away night and day in a data center the size of a football field. Their primary job is running the physics simulations related to assuring the nuclear stockpile is a safe, secure and effective deterrent in the absence of nuclear testing, but they also run jobs for a wide range of other scientific disciplines. Engineers track the errors. Often when an upset strikes during the billions of calculations happening every second, complex engineering in the hardware and software corrals the problem. But that can mean lost time and diminished productivity.
As the Los Alamos team began studying single-event upsets, they saw more faults at the top of the vertically stacked component racks than at the bottom. They wondered, was it because the computers are cooled from the bottom to the top? Or are the top racks exposed to more fast neutrons while also, in effect, shielding the lower racks? Is it both?
The Lab team approached the problem from a few angles. They needed to measure and benchmark error rates caused by fast neutrons, so they bombarded the computer parts in the neutron beam at the Los Alamos Neutron Science Center (LANSCE). Another part of the team purchased neutron detectors the size of a one-liter soda bottle and will soon start using them in the supercomputing center to measure the background fast-neutron rates. Others on the team are applying a computer code developed for modeling nuclear physics to study how cosmic rays interact with computers and buildings, which will help understand how the neutrons from outer space interact with the supercomputing center and the computers in it.
Using data from the detectors, the team will compare the number of neutrons hitting the computers to the number of faults in the componentry. Information from the LANSCE tests will tell the team how many of those faults were likely caused by fast neutrons. Having a more complete picture of the neutrons and their impact will support developing new ways of detecting the faults, blocking them and cleaning up the computer systems afterwards.
Keeping the Los Alamos supercomputers running at peak efficiency directly supports national security. These single-event upsets will be everyone’s problem someday as miniaturization increases the density of CPUs and memory chips. The upward trending curve of errors gets steeper as computers become more widespread in our phones, tablets, smart-house systems, airplanes, cars, all the controllers in the internet of things – the list seems endless. Armed with a better understanding of this neutron bombardment from space is a strong first step to keeping our digital world humming along.
Suzanne Nowicki is a nuclear physicist in the Space Science and Applications group at Los Alamos who designs instruments for spacecraft. Nathan DeBardeleben is a computer engineer in High Performance Computing-Design who studies resilience and radiation effects in high-performance computing systems.