AI Researchers Tackle Longstanding ‘Data Heterogeneity’ Problem for Federated Learning

July 11, 2022 Matt Shipman 5-min. read

two people are facing away from the camera and looking at their computer screens — Image credit: charlesdeluvio

For Immediate Release

Researchers from North Carolina State University have developed a new approach to federated learning that allows them to develop accurate artificial intelligence (AI) models more quickly and accurately. The work focuses on a longstanding problem in federated learning that occurs when there is significant heterogeneity in the various datasets being used to train the AI.

Federated learning is an AI training technique that allows AI systems to improve their performance by drawing on multiple sets of data without compromising the privacy of that data. For example, federated learning could be used to draw on privileged patient data from multiple hospitals in order to improve diagnostic AI tools, without the hospitals having access to data on each other’s patients.

Federated learning is a form of machine learning involving multiple devices, called clients. The clients and a centralized server all start with a basic model designed to solve a specific problem. From that starting point, each of the clients then trains its local model using its own data, modifying the model to improve its performance. The clients then send these “updates” to the centralized server. The centralized server draws on these updates to create a hybrid model, with the goal of having the hybrid model perform better than any of the clients on their own. The central server then sends this hybrid model back to each of the clients. This process is repeated until the system’s performance has been optimized or reaches an agreed-upon level of accuracy.

“However, sometimes the nature of a client’s personal data results in changes to the local model that work well only for the client’s own data, but don’t work well when applied to other data sets,” says Chau-Wai Wong, corresponding author of a paper on the new technique and an assistant professor of electrical and computer engineering at NC State. “In other words, if there is enough heterogeneity in the data of the clients, sometimes a client modifies its local model in a way that actually hurts the performance of the hybrid model.”

“Our new approach allows us to resolve the heterogeneity problem more efficiently than previous techniques, while still preserving privacy,” says Kai Yue, first author of the paper and a Ph.D. student at NC State. “In addition, if there is enough heterogeneity in the client data, it can be effectively impossible to develop an accurate model using traditional federated learning approaches. But our new approach allows us to develop an accurate model regardless of how heterogeneous the data are.”

In the new approach, the updates clients send to the centralized server are reformatted in a way that preserves data privacy, but gives the central server more information about the data characteristics that are relevant to model performance. Specifically, the client sends information to the server in the form of Jacobian matrices. The central server then plugs these matrices into an algorithm that produces an improved model. The central server then distributes the new model to the clients. This process is then repeated, with each iteration leading to model updates that improve system performance.

“One of the central ideas is to avoid iteratively training the local model at each client, instead letting the server directly produce an improved hybrid model based on clients’ Jacobian matrices,” says Ryan Pilgrim, a co-author of the paper and former graduate student at NC State. “In doing so, the algorithm not only sidesteps multiple communication rounds, but also keeps divergent local updates from degrading the model.”

The researchers tested their new approach against industry-standard data sets used to assess federated learning performance, and found the new technique was able to match or surpass the accuracy of federated averaging – which is the benchmark for federated learning. What’s more, the new approach was able to match that standard while reducing the number of communication rounds between the server and clients by an order of magnitude.

“For example, it takes federated averaging 284 rounds of communication to reach an accuracy of 85% in one of the test data sets,” Yue says. “We were able to reach 85% accuracy in 26 rounds.”

“This is a new, alternative approach to federated learning, making this exploratory work,” Wong says. “We’re effectively repurposing analytical tools for practical problem-solving. We look forward to getting feedback from the private sector and from the broader federated learning research community about its potential.”

The paper, “Neural Tangent Kernel Empowered Federated Learning,” will be presented at the 39th International Conference on Machine Learning (ICML), which is being held in Baltimore, Md., July 17-23. The paper was co-authored by Richeng Jin, a former postdoctoral researcher at NC State; Dror Baron, an associate professor of electrical and computer engineering at NC State; Huaiyu Dai, a professor of electrical and computer engineering at NC State; and Ryan Pilgrim, a former graduate student at NC State.

-shipman-

Note to Editors: The study abstract follows.

“Neural Tangent Kernel Empowered Federated Learning”

Authors: Kai Yue, Richeng Jin, Chau-Wai Wong, Dror Baron and Huaiyu Dai, North Carolina State University; Ryan Pilgrim, independent scholar

Presented: 39th International Conference on Machine Learning (ICML), Baltimore, Md., July 17-23

Abstract: Federated learning (FL) is a privacy-preserving paradigm where multiple participants jointly solve a machine learning problem without sharing raw data. Unlike traditional distributed learning, a unique characteristic of FL is statistical heterogeneity, namely, data distributions across participants are different from each other. Meanwhile, recent advances in the interpretation of neural networks have seen a wide use of neural tangent kernels (NTKs) for convergence analyses. In this paper, we propose a novel FL paradigm empowered by the NTK framework. The paradigm addresses the challenge of statistical heterogeneity by transmitting update data that are more expressive than those of the conventional FL paradigms. Specifically, sample-wise Jacobian matrices, rather than model weights/gradients, are uploaded by participants. The server then constructs an empirical kernel matrix to update a global model without explicitly performing gradient descent. We further develop a variant with improved communication efficiency and enhanced privacy. Numerical results show that the proposed paradigm can achieve the same accuracy while reducing the number of communication rounds by an order of magnitude compared to federated averaging.