At the Thomas Jefferson National Accelerator Facility, data scientists and developers from the U.S. Department of Energy are testing the latest artificial intelligence (AI) techniques to make high-performance computing more reliable and cost-efficient.
Their focus? Training artificial neural networks to monitor and predict the performance of scientific computing clusters—massive systems where enormous amounts of data are constantly being processed.
The objective is clear: help system administrators detect and resolve problematic computing jobs faster, minimizing downtime for scientists who rely on these systems to analyze experimental data, Newswise reports.
But instead of a typical software deployment, this effort takes on the flair of a competition. These machine learning (ML) models are put through their paces in a head-to-head challenge to determine which one best adapts to the ever-changing demands of experimental datasets.
However, unlike America’s Next Top Model and its international spin-offs, this contest doesn’t take an entire season to declare a winner. Here, a new “champion model” is selected every 24 hours based on its ability to learn and adapt to the latest data.
“We’re trying to understand characteristics of our computing clusters that we haven’t seen before,” said Bryan Hess, scientific computing operations manager at Jefferson Lab and one of the study’s lead investigators. “It’s looking at the data center in a more holistic way, and going forward, that’s going to be some kind of AI or ML model.”
While these AI models may not be gracing magazine covers anytime soon, the project has gained recognition in the research community. It was recently featured in IEEE Software, a peer-reviewed scientific journal, as part of a special edition focused on machine learning applications in data center operations (MLOps).
AI Meets Big Science
Large-scale scientific instruments—such as particle accelerators, light sources, and radio telescopes—are essential for groundbreaking discoveries. Facilities like Jefferson Lab’s Continuous Electron Beam Accelerator Facility (CEBAF), a DOE Office of Science User Facility, serve a worldwide community of over 1,650 nuclear physicists.
At Jefferson Lab, experimental detectors capture subtle traces of tiny particles produced by CEBAF’s electron beams. Since the accelerator operates 24/7, it generates an enormous amount of data—tens of petabytes per year, enough to fill an average laptop’s hard drive every minute.
Processing these vast amounts of information requires high-throughput computing clusters, which run specialized software tailored to each experiment.
With so many complex jobs running simultaneously, failures are inevitable. Some computing tasks or hardware issues can cause anomalies, such as fragmented memory or input/output (I/O) bottlenecks, which can delay scientists’ ability to process and analyze their data.
“When compute clusters get bigger, it becomes tough for system administrators to keep track of all the components that might go bad,” explained Ahmed Hossam Mohammed, a postdoctoral researcher at Jefferson Lab and an investigator in the study. “We wanted to automate this process with a model that flashes a red light whenever something weird happens.”
“That way, system administrators can take action before conditions deteriorate even further.”
DIDACT: AI for Data Center Management
To address these challenges, the team developed a machine learning-based management system called DIDACT (Digital Data Center Twin). The name is a clever play on didactic, a term meaning something designed to teach—in this case, AI models learning about computing systems.
Funded by Jefferson Lab’s Laboratory Directed Research & Development (LDRD) program, DIDACT aims to detect system anomalies and pinpoint their causes using an AI technique called continual learning.
In continual learning, ML models are trained on incoming data in an incremental fashion—similar to how humans and animals learn over time. The DIDACT team applies this method by training multiple models, each capturing different system behaviors, and selecting the best-performing one based on the latest data.
These models are built using unsupervised neural networks known as autoencoders. One of them integrates a graph neural network (GNN), which examines relationships between various system components.
“They compete using known data to determine which had lower error,” said Diana McSpadden, a Jefferson Lab data scientist and lead on the MLOps study. “Whichever won that day would be the ‘daily champion.’”
This approach has the potential to significantly reduce downtime in data centers and optimize critical resources—leading to cost savings and improved efficiency in scientific research.
The Next Top AI Model
To train these models without disrupting daily computing operations, the DIDACT team developed a dedicated test environment known as the sandbox. Think of it as a proving ground where models are evaluated based on their learning capabilities.
The DIDACT system integrates open-source and custom-built software to develop and manage ML models, monitor the sandbox cluster, and log key data. All this information is displayed on an interactive dashboard for easy visualization.
The system operates through three main ML pipelines:
- Offline Development – A testing phase, akin to a dress rehearsal.
- Continual Learning – The real-time competition where models battle for top performance.
- Live Monitoring – The best-performing model becomes the main system monitor—until it’s dethroned by the next day’s champion.
“DIDACT represents a creative stitching together of hardware and open-source software,” said Hess, who is also the infrastructure architect for the High Performance Data Facility Hub being developed at Jefferson Lab in collaboration with DOE’s Lawrence Berkeley National Laboratory. “It’s a combination of things that you normally wouldn’t put together, and we’ve shown that it can work. It really draws on the strength of Jefferson Lab’s data science and computing operations expertise.”
Looking ahead, the DIDACT team plans to extend their research to explore how machine learning can optimize data center energy usage. This could involve reducing water consumption in cooling systems or adjusting processor activity based on real-time computing demands.
“The goal is always to provide more bang for the buck,” Hess said, “more science for the dollar.”
With AI-driven automation, the future of scientific computing is looking smarter, faster, and more efficient than ever.