A slow neutron beats a flipping fast bit - Albuquerque Journal

A slow neutron beats a flipping fast bit

Once every minute and for no good reason, a bit flips in a supercomputer at Los Alamos National Laboratory, causing an error. All of a sudden, say, 1 + 1 = 3.

Uh-oh.

Bits are the basic currency of all digital information. They come in two flavors, zeroes and ones. As a computer does its work, bits are called from disk storage, zip through processors and park temporarily in memory. When a bit randomly jumps from 0 to 1, it might alter a calculation or hide a piece of information. Computer engineers call it a single-event upset or a fault.

These upsets are tripping computers of all sizes more frequently, not just at the lab, but in the broader computing world, too. For Los Alamos, with a dozen-plus supercomputers running jobs vital to national security and other important science missions, single-event upsets are a fact of life because of the density of components.

A fault can play out a few ways. Sometimes nothing happens – the hardware corrects itself. Other times, the program or even the entire system crashes in a detectable event. It’s like a flat tire, wasting lots of time as system administrators restore programs and data. In the worst case, an upset goes unnoticed. A scientific calculation might come back with the wrong answer, but nobody knows. That’s rare, but it happens.

Upsets can result from excessive heat, a voltage spike, or – get ready for it – particles from outer space. Those particles are neutrons, and a team of physicists, space scientists and computer engineers at Los Alamos are researching just how much trouble they cause. Are cosmic-ray neutrons a major culprit or a minor irritant? Understanding that will help the team create strategies for best managing the upsets.

Normally, protons and neutrons stick together to form the nucleus at the center of an atom. The trouble starts when high-energy cosmic rays – mostly protons from remote cosmic cataclysms – knock neutrons and other particles loose from atoms in the atmosphere. Every hour, eighty-some cosmic-ray neutrons strike a surface the size of a computer’s central processing unit, or CPU.

Most neutrons miss the nuclei of atoms in a CPU and pass right through. Eventually, though, a neutron hits a nucleus. If it’s a very high-energy, or fast, neutron, it bounces the struck nucleus right out of its home in the silicon chip. An upset occurs, corrupting data. Upsets are more likely to happen in supercomputers because they are densely packed with tens of thousands of CPUs – that’s what makes them super.

As the backbone of the nation’s Stockpile Stewardship Program, the largest Los Alamos supercomputers hum away night and day in a data center the size of a football field. Their primary job is running the physics simulations related to assuring the nuclear stockpile is a safe, secure and effective deterrent in the absence of nuclear testing, but they also run jobs for a wide range of other scientific disciplines. Engineers track the errors. Often when an upset strikes during the billions of calculations happening every second, complex engineering in the hardware and software corrals the problem. But that can mean lost time and diminished productivity.

As the Los Alamos team began studying single-event upsets, they saw more faults at the top of the vertically stacked component racks than at the bottom. They wondered, was it because the computers are cooled from the bottom to the top? Or are the top racks exposed to more fast neutrons while also, in effect, shielding the lower racks? Is it both?

The Lab team approached the problem from a few angles. They needed to measure and benchmark error rates caused by fast neutrons, so they bombarded the computer parts in the neutron beam at the Los Alamos Neutron Science Center (LANSCE). Another part of the team purchased neutron detectors the size of a one-liter soda bottle and will soon start using them in the supercomputing center to measure the background fast-neutron rates. Others on the team are applying a computer code developed for modeling nuclear physics to study how cosmic rays interact with computers and buildings, which will help understand how the neutrons from outer space interact with the supercomputing center and the computers in it.

Using data from the detectors, the team will compare the number of neutrons hitting the computers to the number of faults in the componentry. Information from the LANSCE tests will tell the team how many of those faults were likely caused by fast neutrons. Having a more complete picture of the neutrons and their impact will support developing new ways of detecting the faults, blocking them and cleaning up the computer systems afterwards.

Keeping the Los Alamos supercomputers running at peak efficiency directly supports national security. These single-event upsets will be everyone’s problem someday as miniaturization increases the density of CPUs and memory chips. The upward trending curve of errors gets steeper as computers become more widespread in our phones, tablets, smart-house systems, airplanes, cars, all the controllers in the internet of things – the list seems endless. Armed with a better understanding of this neutron bombardment from space is a strong first step to keeping our digital world humming along.

Suzanne Nowicki is a nuclear physicist in the Space Science and Applications group at Los Alamos who designs instruments for spacecraft. Nathan DeBardeleben is a computer engineer in High Performance Computing-Design who studies resilience and radiation effects in high-performance computing systems.

 


Albuquerque Journal and its reporters are committed to telling the stories of our community.

• Do you have a question you want someone to try to answer for you? Do you have a bright spot you want to share?
   We want to hear from you. Please email yourstory@abqjournal.com

Nativo Sponsored Content

taboola desktop

MORE ARTICLES LIKE THIS

1
Sunport has second-busiest stretch in 20 months
ABQnews Seeker
65,000 holiday travelers reported over 5-day ... 65,000 holiday travelers reported over 5-day period
2
New trial begins for Rio Arriba sheriff
ABQnews Seeker
James Lujan is charged with helping ... James Lujan is charged with helping a friend avoid arrest in 2017
3
Elk poached at Bandelier National Monument
ABQnews Seeker
Park rangers are seeking the public's ... Park rangers are seeking the public's help in finding those responsible for that killing, as well as the death of a mule deer
4
NMSU has nutty idea for citations
ABQnews Seeker
Donations of peanut butter will be ... Donations of peanut butter will be accepted in lieu of some parking fines
5
1 killed in fight south of UNM
ABQnews Seeker
Homicide detectives are investigating after a ... Homicide detectives are investigating after a man died in a fight in an alley near the University of New Mexico Monday night. Officer Daren ...
6
Police say driver in fatal DWI crash traveling at ...
ABQnews Seeker
19-year-old was arrested four days earlier ... 19-year-old was arrested four days earlier in Roswell drunken-driving wreck
7
NM weighs use of federal money for broadband
ABQnews Seeker
Stimulus, infrastructure bills offer chance, legislators ... Stimulus, infrastructure bills offer chance, legislators told
8
NM has no omicron cases, but struggles with delta ...
ABQnews Seeker
Signs show virus surge could be ... Signs show virus surge could be slowing in state
9
DA removed from homicide case
ABQnews Seeker
FARMINGTON – A New ... FARMINGTON – A New Mexico judge has ruled that a district attorney's office must be replaced ...