Using 1 trillion files helps scientist find a needle in a haystack

A Los Alamos National Laboratory scientist having trouble solving a stubborn research problem needed some help – his scientific simulations had generated a sea of data, but it took so long to search the data that he couldn’t find the information he needed. He found himself looking for the proverbial needle in a haystack. At the same time, the lab’s storage research team had been hard at work on another classic big data problem: creating massive numbers of files as quickly as possible. The day the team met with the scientist, you could say that Big Science and Big Data put their heads together – and now they’re making history.

A modern laptop computer typically has just short of a million files in all of its folders. And the Trinity supercomputer can create a million files in just about 5 seconds. But in extreme scale simulation, science researchers often deal with quantities far beyond a million. In fact, in this physicist’s simulation, he needed to generate trillions of particles – a million times larger than a million – and then look at the trajectory of only a few of them. Imagine you’re standing in the Sahara looking at trillions of grains of sand around your feet. Your challenge is to locate just one of them, then track its every movement as a dust devil whips through.

If the lab scientist tried to create a file for each of those trillion particles using Trinity, it would take 57 days just to create the files in that folder – and the supercomputer wouldn’t be doing anything else during that time. Trinity is too important to the lab’s stockpile stewardship mission to simply perform this one task for 57 days. A typical day in the life of Trinity supports multiple scientists, each pursuing important research projects in materials science, plasma physics, fluid dynamics – you name it. There had to be a better solution.

The Ultrascale Systems Research Center in the lab’s High Performance Computing Division is tasked with realizing the next generation of supercomputing. With efforts in storage research, novel computer architectures and extreme scale platform management, the center is uniquely positioned to tackle these seemingly impossible computing challenges. In particular, a collaboration between the center and Carnegie Mellon University had developed an experimental file system designed to support unprecedented numbers of files and folders. It wasn’t obvious it would work, but it seemed like a chance worth taking.

In February and March this year, the scientist began using the experimental file system to track particles on Trinity. It had been a long journey with many obstacles to overcome, but success was finally in sight. Still, it was not until May 2018 that Trinity churned out a trillion files in about two minutes for the first time. That staggering rate translates to about 7 billion files a second, approximately 20,000 times faster than running on Trinity without the new file system. Days later, the pace had jumped to two trillion files in two and a half minutes. The team never set out to create a trillion files. They simply wanted to improve data management for scientists. But when they looked down and saw the trillion files, they felt a brief moment of satisfaction: High-performance computing at Los Alamos continues to lead the way on extreme scale science.

In the high-performance computing universe, speed and efficiency in handling mind-boggling amounts of information are everything. Supercomputers enable previously impossible science, turning lifetimes of data-gathering into minutes. With new tools, scientists can manage ever-growing data streams faster and more efficiently than ever before. And the future of research depends on it.

Next challenge? So-called exascale computing. Running 50 times faster than today’s fastest supercomputers, exascale machines will help scientists simulate complex natural and engineered systems that range from the atomic to the cosmic. That research will include grand challenges in biology, astrophysics, materials and earth systems. Projects like the trillion-file effort are steps toward that exascale goal, and it’s almost here. The U.S. Department of Energy Exascale Project plans to have the next superfast generation of computers running by 2021.

Stay tuned.

Bradley Wade Settlemyer is a systems programmer and leads the storage systems research efforts at Los Alamos National Laboratory’s Ultrascale Systems Research Center. The team that enabled the trillion-file milestone includes Settlemyer and Gary Grider from Los Alamos, and collaborators from Carnegie Mellon University. This article was provided by LANL.


Share Your Story

Nativo Sponsored Content

Ad Tango

taboola desktop


Officer shot last month calls for everybody to 'do ...
ABQnews Seeker
Sgt. Sean Kenny says: 'Everybody's failing. ... Sgt. Sean Kenny says: 'Everybody's failing. APD is failing. Everybody, we're failing.'
Biden pick confirmed as federal NM judge
Trump nominees were sidelined by runup ... Trump nominees were sidelined by runup to the 2020 election
Airman on trial in death of Mennonite woman
Defendant could get up to life ... Defendant could get up to life in prison; evidence is largely circumstantial
Feds authorized to seize $3M in Stapleton case
It's unclear how much money remained ... It's unclear how much money remained in accounts and was available for seizure
Missing women too often in the news
Tara Calico's disappearance in 1988 is ... Tara Calico's disappearance in 1988 is among New Mexico's most notorious and perplexing unsolved mysteries
Sandia Labs' Z machine is world's most powerful and ...
ABQnews Seeker
Machine studies planets, nukes, black holes ... Machine studies planets, nukes, black holes and fusion energy
Vaccine trial for very young children to get underway
From the newspaper
Moderna KidCOVE study will test the ... Moderna KidCOVE study will test the effectiveness of the mRNA-1272 COVID-19 vaccine in children between 6 months and 12 years of age
Editorial: House Democrats' crime package deserves a serious look ...
With the homicide statistics in Albuquerque ... With the homicide statistics in Albuquerque reaching critical mass, the Democratic leadership in sta ...
Skeeter Pena hits three homers, and the last one ...
Featured Sports
'TOPES WEDNESDAY: Off TUESDAY: Sugar Land's ... 'TOPES WEDNESDAY: Off TUESDAY: Sugar Land's Jeremy Pena smacked three home runs, including a two-run shot with two outs in the bottom of the ...