Creating Safe, Secure and Intelligent Systems

Diagnosing Defective Data When Developing AI Models

Research findings recently published by an NC State Ph.D student and their professor demonstrate a novel technique that can help correct common errors in AI models even when you don’t know the cause.

April 2, 2025 Matt Simpson 2-min. read

Three wood blocks with a magnifying glass sitting on top of them. The blocks read check sign, AI, and X. — AI concept with a magnifying glass and right, wrong, symbols in the wooden cubes, accuracy information from AI, advantages and disadvantages, right way to use AI

If an artificial intelligence model isn’t acting all that intelligently, spurious correlations could be the culprit. Spurious correlations — when an AI model makes decisions based on unimportant, potentially misleading information — generally result from simplicity bias.

For example, say you’re training an AI model to identify photographs of dogs. Throughout the training process, the model will look for specific features dogs have, which it can then use to identify them. But what if a bunch of the dogs pictured in the training set happen to be wearing collars? Since collars are typically less complex than features such as ears or fur, the AI model might mistakenly use collars as a simple way to identify dogs.

“And if the AI uses collars as the factor it uses to identify dogs, the AI may identify cats wearing collars as dogs,” says Jung-Eun Kim, an assistant professor of computer science at NC State University.

If you know what’s causing the spurious correlations — collars, in the case of our example — then it’s possible to correct the problem. It’s not always that easy, though.

Findings from researchers at NC State show that it’s sometimes impossible to determine the cause of spurious correlations, which effectively renders conventional solutions useless. In other words, unless you know which specific features were behind the spurious correlations, you’re pretty much out of luck.

Until now, that is.

Thankfully for fellow AI practitioners, the new research from NC State also demonstrated a novel technique that “can be used even when you have no idea what spurious correlations the AI is relying on,” according to Kim, who’s a corresponding author of a paper on the work.

“Our goal with this work was to develop a technique that allows us to sever spurious correlations even when we know nothing about those spurious features,” Kim says.

The research team showed that their new technique led to improved performance even in comparison to previous work on models where the spurious features were identifiable.

“If you already have a good idea of what the spurious features are, our technique is an efficient and effective way to address the problem,” Kim says. “However, even if you are simply having performance issues, but don’t understand why, you could still use our technique to determine whether a spurious correlation exists and resolve that issue.”

The first author of the peer-reviewed paper, titled “Severing Spurious Correlations with Data Pruning,” is NC State Ph.D. student Varun Mulchandani. The paper will be presented at the International Conference on Learning Representations (ICLR), taking place from April 24-28 in Singapore.

This article is based on a news release from NC State University.