Predicting IT Infrastructure Issues

Many organizations have entire departments responsible for monitoring operations to find and fix issues. Without sufficient tools and the insights needed to get ahead of these problems, companies are forced to purchase redundant systems to avoid affecting their customers and employees. The truth is all of the information needed to eliminate downtime exists within the machine-generated log files located in the many databases of operational data, but most companies have no idea how to find what they need.

The trouble with logs is that they can be overwhelming - simply too much information. Rather, it's too much irrelevant information and too little in a useful form. People can't wade through terabytes of logs to identify the parts they need if they don’t know what they are looking for. In their raw form, logs have low visibility and observability. The information is there, but it's very hard to find. Historically, this has limited their usefulness in diagnosing problems. When the log level is pushed up to DEBUG, the log volume becomes huge.

Even with text search tools like grep, developers and testers have to spend hours pulling relevant clues out of the logs. The goal is to find not just single data points, but patterns. If a problem is intermittent, the trick is to discover the difference between the times it occurs and the times it doesn't. Putting together the log data that explains such situations is a challenge.

SliceUp has developed a system that automatically finds patterns in log files. With that capability, we are able to quickly find the static and variable parts of a log, parameterize variables and identify anomalies without any human intervention. An anomaly is only a determination of something that is different, so we add sentiment analysis and other historical findings to narrow down anomalies to only the ones that matter.

SliceUp can analyze up to 200,000 logs/second with our on-premise solution (a cloud version also exists). Within the first week of operations, we have enough data to train our ML/DL models, find anomalies and confirm our findings with Subject Matter Experts. Our typical results show a 96% true positive anomaly detection rate, with an extremely low false negative rate. All this allows you to focus on the problems that matter.