Back

 Industry News Details

 
Machine Learning Helps Diagnose Supercomputer Problems Posted on : Dec 07 - 2017

Engineers are leveraging machine learning to both uncover problems with supercomputers and fix them, all without human intervention.

Computer scientists and engineers from Sandia National Laboratories and Boston University recently earned the Gauss Award at the International Supercomputing conference. They were honored for their work automatically diagnosing problems and potentially fixing them in supercomputers using machine learning.

It turns out that supercomputers, which are relied on for everything from forecasting the weather to cancer research to ensuring U.S. nuclear weapons are safe and reliable, can have bad days. They contain a complex collection of interconnected parts and processes that can go wrong. For example, parts can break, previous programs can leave “zombie processes” running that gum up the works, network traffic can cause bottlenecks, or a computer code revision can instigate problems. These problems often result in programs not running to completion and wasting valuable supercomputer time.

So the team came up with a list of issues they have encountered when working with supercomputing and then wrote code to re-create those problems or anomalies. They ran a variety of programs with and without the anomaly codes on two supercomputers, one at Sandia and a public cloud system operated by Boston University.

While the programs were running, researchers collected data on the process, monitoring how much energy, processor power, and memory was used by each node. Monitoring more than 700 criteria used less than 0.005% of the supercomputer’s processing power, and this is where machine learning comes in.

Machine learning is a broad collection of computer algorithms that find patterns without being explicitly programmed on the important features. The team wrote several machine learning algorithms that detect anomalies by comparing data from normal program runs and those with anomalies. They tested the algorithms to see which was best at correctly diagnosing the anomalies. For example, one technique, called Random Forest, was particularly adept at analyzing vast quantities of the data monitored and deciding which metrics were important, then determining if the supercomputer was being affected by an anomaly. View More