One of today’s most popular software development and operations methodology, DevOps aims to streamline and seamlessly integrate software engineers and IT operations specialists to provide maximum value to the business.
In the process of DevOps implementation, large amounts of data are generated that can be used to simplify workflows, orchestration, monitoring, troubleshooting, and other tasks. The problem is that there is too much data. Server logs alone can accumulate hundreds of megabytes per week. If monitoring tools are used, then megabytes and gigabytes of data are generated in a short period of time.
The result is predictable: programmers do not analyze the data as they are, but they set threshold values. So they are looking for exceptions rather than doing data analytics. But even with the help of modern analytical tools, you still need to know what to look for in your data sets.
Most of the data created in the process of DevOps is related directly to the application deployment. Application monitoring replenishes server logs, generates error messages, and transaction tracing. The only reasonable way to analyze this data and make the right conclusions in real time is to use machine learning (ML).
Although ML can take a while to be implemented, if the algorithms and network architectures are correctly aligned, the machine learning system will start producing the results that correspond to the actual ones. In essence, the neural network “learns” or models the relationship between data and results. This model can then be used to estimate future data.
ML as a rescue ranger for DevOps
Algorithms of machine analysis and learning allow you to monitor information objects (e.g., databases, applications, etc.) and build the profiles of an adequate (errorless) system operation. In case of any deviations (anomalies), for example, when the response time increases, the application freezes or transactions slow down, the system records this situation and sends a notification about it, which allows you to take measures to prevent such anomalies going forward.
How difficult is it to train such a system, how long does it take and how many efforts should be put into it? Basically, no training is needed! The system learns itself on data sets with no programming needs and can predict the relationship between data sets. It makes it possible to avoid the “human factor”, thereby speeding up the system by eliminating manual processes (such as identification of data correlations, dependencies, etc.).
The system determines itself how the objects should function adequately, and for additional adjustments, parameterization mechanisms would suffice. However, although machine learning is a very powerful tool, it needs to accumulate data. Over time, the number of false positives decreases. You can also decrease their number by slight “fine tuning”.
Adjustment mechanisms help make the algorithms more accurate, as well as adapt them to specific needs. Thus, over time, the accuracy improves due to the accumulated statistics.
Algorithmic approaches aim to identify anomalies, clustering and data correlation, and improve forecasting. They help find answers to many questions like: What is the cause of the problem? How to prevent it? Is this behavior normal or abnormal? What can be improved in the application? What should I look for immediately? How to balance the load?
As far as DevOps goes, ML can have many use cases.
Machine Learning Use Cases in DevOps
Application tracking
DevOps tools activity data (for example, Jira, Git, Jenkins, SonarQube, Puppet, Ansible, etc.) provides transparency in the software delivery process. Using ML can reveal anomalies in this data such as large amounts of code, long build times, extended release times and code checks, and identify many “deviations” in software development processes, including inefficient use of resources, frequent task switching, or slowing down the process.
Application quality assurance
By analyzing the test results, ML can help identify new errors and create a library of test patterns based on such discovery. This helps ensure thorough testing of each release, and improve the quality of the applications delivered (QA).
User behavior patterns
Patterns of user behavior can be as unique as fingerprints. Applying ML behaviors to Dev and Ops can help identify anomalies that constitute malicious activity. This includes abnormal patterns of access to important repositories or users who intentionally or accidentally use known “bad” patterns (for example, with backdoors), unauthorized code deployment, or theft of intellectual property.
Operation management
Analyzing an application during operation is an area where machine learning can really prove itself, since you have to deal with large amounts of data, a huge number of users, transactions, etc. DevOps specialists can use ML to analyze user behavior, use of resources, throughput transaction capabilities, etc. for the purpose of subsequent detection of “abnormal” patterns (for example, DDoS attacks, memory leaks, etc.).
Notification management
Simple and practical use of ML is to manage the mass flow of warnings (alarms) in the systems being operated. This may be due to a common transaction identifier, a common set of servers or a common subnet, or the reason is more complex and requires systems to “learn” over time to recognize “known good” and “known bad” warnings. This allows you to filter warnings.
Troubleshooting and analytics
This is another area where modern machine learning technologies perform well. ML can automatically detect and sort “known problems” and even some unknown ones. For example, ML tools can detect anomalies in “normal” processing, and then further analyze the logs in order to relate this problem to a new configuration or deployment. Other automation tools can use ML to alert operations, open a ticket (or chat window), and assign it to an appropriate resource. Over time, ML can even offer a better solution.
Preventing disruptions during operation
ML allows you to go far beyond simple resource planning to prevent system crashes. For instance, It can be used to predict the best configuration to achieve the desired level of performance; the number of customers who will use the new feature; infrastructure requirements, etc. ML reveals “early signs” in systems and applications, allowing developers to start fixing or to avoid problems in advance.
Business impact analysis
Understanding the code release impact on business goals is crucial to DevOps success. By synthesizing and analyzing actual utilization rates, ML systems can detect good and bad models for implementing an “early warning system” when applications have issues. For example, ML can report the increased frequency of the shopping basket abandonment cases or certain impediments in the customer’s journey.
Machine learning allows you to use large data sets and helps make informed conclusions. Identifying statistically significant anomalies makes it possible to identify abnormal behavior of infrastructure objects. In addition, machine learning makes it possible to identify not only various anomalies in the processes but also illegal actions.
Recognizing and grouping records based on common templates helps you focus on important data and cut off background information. The analysis of the records preceding and following the error increases the efficiency of finding the root causes of issues, while the constant monitoring of applications for issues identification leads to their quick elimination during operation.
The following data types have a predictable format and are perfectly suited for ML: user data, diagnostics and transactions data, metrics (e.g. of apps, virtual machines, containers, servers, etc), infrastructure data, and so on.
Improving DevOps with ML
Regardless of whether you buy a commercial application or build it from scratch, there are several ways to use machine learning to improve DevOps.
From thresholds to predictive analytics
Since there is a lot of data, DevOps specialists rarely view and analyze the entire data set. Instead, they set thresholds, i.e. conditions for some actions. In fact, they discard most of the data collected and focus on deviations. Machine learning applications are capable of more. They can be trained on all data, and while in run mode, these applications can view the entire data streams and make conclusions. This will help apply predictive analytics.
Search for trends, not errors
It follows from the above that, when learning from all data, a machine learning system can show not only the issues identified. Analyzing data trends, DevOps experts can identify what will happen over time, that is, observe trends and make predictions.
Analysis and Correlation of Data Sets
Most of the data is a time series, and one variable is easy to trace. But many trends are a consequence of several factors. For example, the response time may decrease when multiple transactions simultaneously perform the same action. Such trends are almost impossible to detect “with the naked eye” or with the help of traditional analytics. But properly trained applications will accommodate these correlations and trends.
Historical data context
One of the biggest problems in DevOps is to learn from mistakes. Even if there is a strategy of constant feedback, then, most likely, this is something like a wiki, which describes the problems we encountered, and what we did to investigate them. A common solution is to restart the server or restart the application. Machine learning systems can analyze data and clearly show what happened yesterday, last week, month, or year. You can see seasonal or daily trends. At any time, they will give us a real-life picture of your application’s performance.
Correlation between monitoring tools
In DevOps, several tools are often used simultaneously to view and process data. Each of them controls the application performance in different ways but lacks the ability to find the relationship between this data from different tools. Machine learning systems can collect all of these disparate data streams, use them as raw data, and create a more accurate and reliable picture of the state of your application.
Orchestration efficiency
If there are metrics for the orchestration process, machine learning can be used to determine how effectively this orchestration is performed. Inefficiency can be the result of incorrect methods or poor orchestration, so studying these characteristics can help both in the choice of tools and in the organization of processes.
Optimization of a specific metric
Are you looking to increase uptime? Maintain performance standards? Or reduce the time between deployments? An adaptive machine learning system can help. Adaptive systems are systems without a specific answer or result. Their goal is to obtain input data and optimize certain characteristics.
For example, air ticket sales systems try to fill in airplanes and optimize revenues by changing ticket prices up to three times a day. You can similarly optimize DevOps processes. The neural network is trained to maximize (or minimize) the value, rather than achieve a known result. This allows the system to change its parameters during operation in order to gradually achieve the best result.
Your ultimate goal is to improve, in a measurable way, DevOps methods from concept to deployment and decommissioning.
Implementing Docker, microservices, cloud technologies and APIs for deploying applications and ensuring their high reliability requires new approaches. Therefore, it is important to use smart tools so DevOps tool vendors integrate smart features into their products to further simplify and speed up software development processes.
Of course, ML is no substitute for intelligence, experience, creativity, and hard work. But we already see ample opportunities for its use and even greater potential in the future.