March 1, 2016


Continuous Testing: A use case on Automated Bug Detection using Machine Learning

Organizations adopting agile software development and DevOps are seeing increased quantities of releases. Continuous delivery (CD) is an approach to automate the delivery aspect and focuses on bringing together different processes to execute them more quickly and frequently. For instance, Paddy Power, an Irish organization offering services in regulated markets through betting shops, phones, and the Internet, adopted the CD approach and an application which used to be released once every one to six months, now releases once a week on average [1]. Continuous delivery model is not achieved in its full attainment without investing in continuous testing. A source states that a bug caught during the production phase may be ten times costlier than the one caught during the continuous testing phase itself [2]. This makes the case for continuous testing even more powerful.

Testing of the code in large scale engineering projects becomes tedious, time consuming, and resource intensive. In the context of Paddy Power mentioned above, the percent of workforce used to fix bugs before adopting the CD approach was approximately 30% [1]. There is opportunity cost of time involved in spending long hours in manual testing [3]. If it takes too much time on testing the code, it will hinder the objective of attaining quick time to market. Or else, it may also lead to more frequent releases, but with increased defects. Hence, there is a need to streamline continuous delivery by bringing in more efficient automated bug detection in the continuous deployment releases. This is a step towards making delivery automation a success.

The need for automating more and more of manual testing activities is real, but far from perfect. Most of the tools and technologies available are not matured enough to implement continuous testing on big large scale software engineering development projects. Another important point is how accurate would the testing output be by automating some of the manual testing processes [4]. If the output involves a lot of ‘false positives’ – implying predicting more cases of bug when actually it is not – this will dilute the purpose of automating the testing process. This is because it may come down to spending an equal amount of time in debugging the cases flagged as ‘positives’ by the automation system, as with actually performing manual testing.

Against this background, we solve the problem of automated bug detection using machine learning (ML) based approach and it assigns a probability score to each of the file changes available in the form of ‘diff ‘ file fetched from GitHub. A snapshot of a diff file is shown in Figure 1 below. The generated probability score indicates the possibility of diff file potentially containing a bug. This will aid the testing teams to prioritize code changes which requires further manual intervention and which requires little or no manual intervention. Also, this will provide useful information to the project manager of a code repository by knowing ahead of time, if a repository is at increased risk of having open ‘issues’. This, in turn, will assist in complying with tight timelines of agile sprint delivery framework and also help in achieving delivery automation.


Figure 1: A snapshot of sample diff file from ‘pandas’ repository

We extracted reasonably enough number of diff files on the ‘pandas’ repository to make accurate and generalizable predictions [5]. The first approach that we tried was to select the most frequent twenty thousand word n-grams (combination of unigrams, bigrams, and trigrams) derived from the diff files. In ML, n-grams is a contiguous sequence of ‘n’ items from a given sequence of text. Unigrams, bigrams, trigrams correspond to n values of 1, 2 and 3 respectively. For instance, for the text ‘ML in CD’, there are three unigrams, {‘ML’, ‘in’, ‘CD’}, two bigrams, {‘ML in’, ‘in CD’}, and one trigram, {‘ML in CD’}. This approach results in three density matrices (standard row by column matrix) with twenty thousand n-grams as number of columns and number of diff files as number of rows. We extended this further by making three types of density matrices which depicts information on code changes in the form of code additions, code deletions, and both. The code addition and deletion is depicted by the first character being ‘+’ or ‘-‘ in the diff file respectively.

The n-grams approach is somewhat contextual, and hence in an alternative approach we perform feature engineering which gives information on the semantics of the code. As in the n-grams approach, first we filter out the content of diff file and fetch only the code content having first character as either ‘+’ or ‘-‘. We generate features on the filtered code content which deciphers information on the count of tokens (words) having only alpha characters, numeric characters, non-alphanumeric characters, lowercase characters, uppercase characters, ASCII characters, or non-ASCII characters. In addition, we also capture information on the count of tokens having 1 to 10 characters each, tokens having lowercase and numeric characters, tokens having upper case and numeric characters, and other permutations and combinations. This exercise is done separately for code changes in the form of ‘additions’, ‘deletions’, and also combined. We have built custom functions on R programming language which are reusable and accept inputs in the form of any diff file and generate these features.

The above exercise is performed to demonstrate different dimensions which can potentially capture the information on how well the code is written and if it is correlated with potentially producing a bug. This exercise is called feature engineering in ML. As a next step, we applied different supervised learning classifiers on the features generated. We had information on which of the diff files extracted actually contains a bug which aid in applying supervised learning classifiers. The output of applying ML classifier is a numeric probability score varying between 0 and 1, showing possibility of diff file potentially containing a bug. The score of 0 and 1 shows null and 100% possibility of bug presence respectively. This needs to be noted that the process of automated bug detection does not obviate, but certainly reduces, the need for manual testing as any ML classifier is based on certain assumptions and has different accuracy performance in different contexts. In our context, we were able to achieve the accuracy of around 80%. As a next step, the classifier performance can be further tested to check if higher probability score got from ML classifier and active ‘issues’ on GitHub are correlated. We believe a higher correlation between the two posits well for the existing solution to be generalizable to other contexts beyond ‘pandas’. However, for the contexts where the solution built is shown as not being generalizable, we have built a reusable pipeline which takes the input in the form of any diff file and performs feature engineering, an exercise which constitutes more than 2/3rd of the time spent in building any supervised ML classifier. This exercise is considered as a step in the direction of making continuous delivery model more robust, automated and streamlined.


  1. href="