Large-scale computer hosting infrastructures offer a variety of services to computer users, including cloud computing – which offers users access to powerful computers and software applications hosted by remote groups of servers. But when these infrastructures run into problems – like bottlenecks that slow their operating speed – it can be costly for both the infrastructure provider and the user. New research from North Carolina State University will allow these infrastructure providers to more accurately predict such anomalies, and address them before they become a major problem.
“Previously, something bad would happen and you’d be left trying to figure out what took place. Often, you’d be unable to recreate the exact conditions that created the problem,” says Dr. Xiaohui (Helen) Gu, an assistant professor of computer science and co-author of a paper describing the new research. “However, if you can predict an anomaly, you are able to track the exact conditions that are leading up to a problem, diagnose what is wrong and put corrective actions into place much more quickly.”
At issue are anomalies, or problems, that can affect hosting infrastructures that support services like cloud computing or data centers. These anomalies can result in slowed response times, lower user capacity and host failures – all of which are bad news for a host’s clients. This can create significant problems for the host company as well, since violations of their service agreements can lead to financial penalties or a loss of clients.
In order for a program to accurately predict an anomaly, it has to know what constitutes normal behavior. That can be tricky for large-scale hosting infrastructure. These infrastructures host a variety of different applications for their clients, and many of these applications are operating in dynamic contexts.
For example, one application may be hosting a Web site that can go from being very busy to essentially idle. And, because hosting infrastructures serve multiple clients simultaneously, the computing resources available to a specific client are also variable – depending on the number of clients using the infrastructure at any given time and what those clients are trying to do.
These variables make it difficult for a program to predict abnormal behavior, because normal behavior can be so variable.
In order to accurately predict abnormalities, the researchers crafted a collection of models that examine system activity in a variety of different contexts. In other words, the models are able to determine what constitutes normal behavior under a lot of different circumstances. Since the models do a good job of defining normal behavior, they are able to accurately identify abnormal behavior.
“Our ‘context aware’ prediction approach improved our accuracy significantly,” says Gu. “We were 50 percent more accurate at predicting anomalies than any existing programs, and had an 80 percent lower rate of false alarms.”
The research, “Adaptive System Anomaly Prediction for Large-Scale Hosting Infrastructures,” was co-authored by Gu, NC State Ph.D. student Yongmin Tan, and Haixun Wang of Microsoft Research Asia. The work was funded by the National Science Foundation, the U.S. Army Research Office and IBM. The paper will be presented July 27 at the ACM Symposium on Principles of Distributed Computing in Zurich, Switzerland.
NC State’s Department of Computer Science is part of the university’s College of Engineering.
Note to Editors: The presentation abstract follows.
“Adaptive System Anomaly Prediction for Large-Scale Hosting Infrastructures”
Authors: Yongmin Tan, Xiaohui Gu, North Carolina State University; Haixun Wang, Microsoft Research Asia
Presented: July 27, 2010, at the ACM Symposium On Principles Of Distributed Computing, Zurich, Switzerland
Abstract: Large-scale hosting infrastructures require automatic system anomaly management to achieve continuous system operation. In this paper, we present a novel adaptive runtime anomaly prediction system, called ALERT, to achieve robust hosting infrastructures. In contrast to traditional anomaly detection schemes, ALERT aims at raising advance anomaly alerts to achieve just-in-time anomaly prevention. We propose a novel context-aware anomaly prediction scheme to improve prediction accuracy in dynamic hosting infrastructures. We have implemented the ALERT system and deployed it on several production hosting infrastructures such as IBM System S stream processing cluster and PlanetLab. Our experiments show that ALERT can achieve high prediction accuracy for a range of system anomalies and impose low overhead to the hosting infrastructure.