515
Views
0
CrossRef citations to date
0
Altmetric
Articles

Predicting higher order mutation score based on machine learning

, ORCID Icon &
Pages 57-70 | Received 21 Apr 2023, Accepted 21 Aug 2023, Published online: 31 Aug 2023

ABSTRACT

In software testing, the quality of the test suite plays a very important role for not only the effectiveness of the testing but also the quality assurance of software. Mutation testing is considered as the usable, automatic and very effective technique in detecting mistakes of the set of test cases such as missing test cases, redundant test cases···  However, when using the mutation testing technique in practice, the generation of a large number of mutants has led to very high computational costs. This raises the question of whether we can reliably and accurately predict this mutation score without running mutants or not. If we can do this, it will save a lot of time and effort but still ensure the effectiveness of mutation testing. In this paper, we propose the approach using machine learning to perform mutation score cross-prediction for software which are new and completely different from the software used to generate test data (mutants) in model training and testing. The experimental results have shown that our proposed approach has achieved the positive results and is highly feasible. Thus, we believe that the approach can be applied to significantly reduce the cost of mutation testing.

1. Introduction

Software testing plays a very important role in detecting all potential software defects to make the requests for fixing and control this fixing before it is put into use to ensure that the software has met all the specified requirements. During software testing execution, designing and creating test cases including test data generation phase is considered a key step. The success of testing, it can be said, depends entirely on the quality of the test data set (also known as is input data) and expected results. In which, the quality of the test data set will determine the developed software defect detection ability. A good test data set will cover all possible failure cases of the software, in addition to verifying that the software has functioned according to the specified requirements or not. In other words, the quality of the software is directly proportional to the quality of the used test data set. The mutation testing technique is proposed as a method to evaluate the generated test data set for a specific programme to answer the question whether the set of test data is good enough or not. This technique proposes to insert defects into the original programme by using mutation operators to produce faulty versions (called mutants) of the original programme. The technique of generating mutants with a difference from the original programme is called First Order Mutation Testing (FOMT) and the mutants are called First Order Mutants (FOMs) (Acree et al., Citation1979; DeMillo et al., Citation1978; Hamlet, Citation1977). Meanwhile, mutants with two or more differences (called Higher Order Mutants – HOMs) from the original programme are generated in Higher Order Mutation Testing (HOMT) (Jia & Harman, Citation2008).

Since it was first introduced in the years 1977–1979 (Acree et al., Citation1979; DeMillo et al., Citation1978; Hamlet, Citation1977), mutation testing in general and higher order mutation testing (first defined in Jia and Harman (Citation2008)) have demonstrated that this is an efficient, highly automated technique for evaluating the quality of test datasets by evaluating a ratio called Mutation Score (MS) (DeMillo et al., Citation1978). Later, Madeyski (Citation2007) proposed another variation of this ratio and named: MSI – Mutation Score Indicator. In the meta-analysis research papers about mutation testing, Jia and Harman (Citation2011), as well as Vu and Madeyski (Citation2014) showed that one in three major problems in applying mutation testing is the very high computational cost because the original programme and all its generated mutants must be run on all test cases.

Nowadays, Python has become one of the popular programming languages and is used a lot in the development of software, especially in artificial intelligence applications. The main limitation of Python is its slow execution speed. Therefore, testing in general and mutation testing in particular for Python software requires very high computational costs, and it can be said that this cost will be higher than the software written in other programming languages.

Basically, the mutation testing process is started when the programme under test is confirmed to be bug-free through execution on the test suite which has been built specifically for that programme. And then, the mutation testing process has the following three main steps:

  • Step 1: Proceed to generate mutants, which are variations of the original programme. Each of these variants has one modification (first order mutation (Acree et al., Citation1979; DeMillo et al., Citation1978; Hamlet, Citation1977)) or multiple modifications (higher order mutation (Jia & Harman, Citation2008)) from the original programme.

  • Step 2: Execute, in turn, the original programme and all its mutants against a given set of test cases. The computational cost of mutation testing is very high due to the executions in this step.

  • Step 3: Calculate the mutation score (MS or MSI) and then evaluate the quality of the given set of test cases based on the mutation score. The mutation score value lies from 0 to 1. A mutation score of 1 is ideal, meaning that all changing in the original programme are detected. On the contrary, a value of mutation score of 0 means that the quality of the test suite is not good, as no changes are detected.

In this paper, we pose the research problem as to how can calculate (predict) the mutation score without performing the above-mentioned second step. This will lead to a reduction in execution time and computational cost but still ensure the effectiveness of mutation testing as expected. The details of the contents and results of our research, applying machine learning into mutation score prediction for Python programmes, are presented in the following sections of this paper. Previous studies on this issue and our research questions are presented in the Section 2. The selection of supporting tools, the algorithms used to train the model and forecast as well as the software under test will be presented briefly in the Section 3. meanwhile, the proposed procedure, which is used to conduct the experiment, will be presented in the Section 4. The recording of experimental results as well as the making of comments on those will be clarified in the Section 5. Finally, the discussions, conclusions as well as suggestions for future research directions to confirm the effectiveness of the proposed approach will be presented in the last Section 6.

2. Related work and our research questions

According to Derezinska and Halas (Citation2014), their study was the first to consider for evaluating the effectiveness of mutation testing of Python programmes. This study has shown that the execution speed of programmes written in Python as well as mutation testing execution time for the software is very slow. Therefore, they proposed a solution to detect ‘incompetent mutants’ and remove these mutants at the programme's execution stage (runtime). Incompetent mutants, as defined by the authors, are the mutants which can be detected (killed) from the original programme by static analysis at compile stage. They consider it unnecessary to include these mutants in mutation testing execution. Their experimental results showed that this is an effective solution to improve the performance of mutation testing without any loss of accuracy.

Since 2012, there has been some work on applying machine learning to predict mutation score in the field of mutation testing without executing step 2 in the process mentioned (see the section ‘Introduction’ of this paper). However, so far, according to our research, there are only quite a few studies in this field. Firstly, in Citation2012, based on the analysis and evaluation of the similarity of the mutants, the authors of the study, Strug and Strug, introduced a solution using classification algorithm to predict whether the mutants can be killed (found to be different from the original programme) by a given test set or not. However, their solution still includes the mutation testing execution step (step 2 above-mentioned), and only focuses on reducing the number of mutants executed on the test set. For each group of homologous mutants identified based on the classification algorithm, one mutant will be selected as a representative to be executed. The mutation score calculated for these representative mutants, according to their study, is considered as the mutation score for the mutation testing process.

In Chekam et al. (Citation2018) proposed a solution applying machine learning with a set of static programme features to the mutation testing process for selecting the ‘fault-revealing mutants’. According to them, fault-revealing mutants are those which represent (or are equivalent to) true faults and are more likely to find test cases that are capable of detecting faults. Their research results have shown that it is more effective to apply mutation testing technique without completely executing mutants on all test cases.

Another study which was introduced by Zhang et al. (Citation2019), in our opinion, is first study in the field of predicting mutation score without taking the step 2 as presented in the basic mutation testing process. In this study, they have proposed a solution using classification models based on 14 features of mutants and test cases. They conducted experiments for both methods, executing full mutation testing and executing mutation testing without step 2, with the real projects written in Java programming language. On the basis of experimental results analysis, they have proved that the efficiency of mutation testing without step 2 is increased more than 150 times in terms of execution time and computational cost. The correct prediction rate of their method is about 60% or more. Also in Tan et al. (Citation2019) suggested a method to predict the mutation score for each statement in the programme using the dynamic execution features. The authors conducted an experiment using five different models which are Logistic Regression, Artificial Neural Network, Random Forest, Support Vector Machine and Symbolic Regression. On the basis of evaluation and comparison of experimental results, the authors conclude that their solution has good results in predicting mutation testing with high accuracy, especially when using Artificial Neural Network.

In a study published in Alireza and Mirian-Hosseinabadi (Citation2020) proposed a solution to improve the prediction of the results of mutation testing. They proposed three research questions: (1) How do killed mutants affect the prediction performance? (2) In what way the solution proposed by them is better than the previously proposed solutions? And (3) What features are important to use in prediction? First, they analysed the advantages and disadvantages of the previously proposed solutions, then the authors have come up with a solution to predict mutation score by combining two algorithms (Random Forest and Gradient Boosting) into their machine learning model. The experimental results of this study have shown that this approach is superior to the previous solutions, according to the authors' assessment, in terms of the values of parameters AUC (Area Under the Curve), MCC (Matthews Correlation Coefficient) and BA (Balanced Accuracy). In Kim et al. (Citation2022) proposed a solution using Predictive Mutation Analysis (PMA) technique in predicting the entire kill matrix. They used a machine learning model to exploit the natural language channel in the source code and learn the relationship between syntax and semantics in the test cases and the mutants that could be killed by that test case. They used the predictive model to predict the results of mutation testing with later versions of the software under test, including in case these versions may have changes to the source code and test cases. The results of this study have showed that it is possible to predict accurately all killed mutants and mutation score for different versions of software under test. However, in the study, the authors have not presented whether this model is effective for new software or not.

The analysis of the above related studies leads to a main research question of our study, can we achieve the results and efficiency of higher order mutation testing without executing the generated mutants against the given set of test cases? More specifically, can we propose a solution to reliably and accurately predict the percentage of higher order mutants that will be killed without performing mutation testing or not?

In order to answer the above-mentioned questions, in this paper, we will focus on proposing solution and conducting experiments on the software projects written in Python, a popular language with a wide range of applications but execution speed is slow. Answering the questions will help us evaluate the quality of the test data set for software written in Python, and of course can be applied to other programming languages, without the need of taking the step 2 as presented in the basic mutation testing process to reduce the execution time and cost of mutation testing without reducing the effectiveness of this technique.

3. Proposed approach

3.1. Creating dataset

There are many researches (Jia & Harman, Citation2008, Citation2011; Madeyski, Citation2007; Van-Nho et al., Citation2019, Citation2021, Citation2022, May; Vu & Madeyski, Citation2014, Citation2015, Citation2016a, Citation2016b, Citation2016c, Citation2017) which have shown that higher order mutation testing is a technique that can overcome simultaneously three main limitations of mutation testing: a large number of generated mutants (leading to a very high computational cost), realism of generated mutants and equivalent mutant problems. In other words, higher order mutation testing is an approach that can be used to improve the effectiveness of mutation testing. Similar to mutation testing, the higher order mutation testing process has also 3 main steps (see Section ‘Introduction’). Therefore, in this study, we propose an approach using machine learning to predict MS for higher order mutation testing. Our expectation is to build a high-precision predictive model that can be applied to higher order mutation testing without executing step 2 in the 3-steps mutation testing process.

In our experiment, we use the Second Order Mutation Testing (SOMT), which belongs to the group of higher order mutation testing, to illustrate our proposal. In order to generate as many Second Order Mutants (SOMs) as possible for using in predictive models, we proceed to generate SOMs in 2 steps. First, with the given programme under test, generate as many its FOMs as possible. Then, produce SOMs by combining two different generated FOMs. With this method, we can have about n(n1)2 mutants for each programme under test, where n is the total number of FOMs that have been generated in the first step.

3.2. Supporting tools, algorithms and software under test

3.2.1. Supporting tools

In our experiment, we have used MutPy (https://pypi.org/project/MutPy/) to perform mutation testing for software written in Python language. It is a mutation testing tool that allows to perform unit testing, mutant generation and mutation testing. With the MutPy tool, FOMs will be generated based on the abstract syntax tree (AST) of programme. More specifically, the FOMs are generated by mutating the nodes on the AST representing the software under test. This mutation is done by replacing a node with another node, or adding a new node, or deleting a node. Then, as the above-described method, we build a tool by ourself to generate SOMs by combining two different generated FOMs.

Besides, Scikit-learn library (https://scikit-learn.org/stable/) is also used in the experiment. This is a FreeBSD standard licensed library and is currently considered as the most powerful library for machine learning algorithms written by Python programming language. Scikit-learn provides a set of tools to handle machine learning problems and statistical modelling including Classification, Regression, Clustering, Dimensionality Reduction,…

3.2.2. Algorithms

In our experiment, we have used the following algorithms to train the model and predict which mutants will be killed. It can be said that these are very effective and popular algorithms in machine learning models.

  • Logistic Regression algorithm: The algorithm is a classification algorithm used to assign objects to a set of discrete values to give a probability rating which was originally developed and popularized preeminent by Berkson (Citation1944, September). It is commonly used for prediction and classification problems, and is suitable for the problem in our study.

  • Random Forest: Random Forest algorithm is a supervised machine learning algorithm widely used in both classification and regression problems. This algorithm is implemented by building a decision tree on different samples and then taking the majority of votes to classify and average in the regression case by combining the output of multiple decision trees to reach a single result. This was first introduced in 2001 by Breiman (Citation2001).

  • XGBoost: This is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable (The XGBoost Contributors, Citation2023). It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

  • LightGBM: This is a gradient boosting framework that uses tree-based learning algorithms (Microsoft, Citation2023). It is designed to be distributed and efficient with the following advantages: faster training speed and higher efficiency; lower memory usage; better accuracy; support of parallel, distributed and GPU learning; capable of handling large-scale data.

3.2.3. Software under test

The project used to create the dataset for training the model, testing and prediction is CPython project version 3.7 (https://github.com/python/cpython/tree/3.7). Meanwhile, the 5 projects used for prediction (cross-prediction) are real projects downloaded from the following links:

(1)

https://github.com/ElijahLawal-7/AirBnBclone

(2)

https://github.com/avocado-framework/avocado

(3)

https://github.com/tvaroglu/connectfour

(4)

https://github.com/Maillol/cricri

(5)

https://github.com/CleanCut/green

3.2.4. Features used

Based on the study [11] by Zhang et al., out of the 14 features used by the authors, we found that there are some features which are available and effective in generating second/higher order mutants as well as effective to apply into our proposed predictive models. The decision to choose these features is also referred to the experimental results of the study of Alireza and Mirian-Hosseinabadi (Citation2020). Therefore, in this study, we use those 8 features (presented below) in our experiments.

(1)

typeStatement: The characteristic representing the type of statement is mutated. This feature extraction is based on the mutated node or the parent nodes of the mutated node.

(2)

typeOperator: The type of operator used for mutation testing which the details are as follows:

  • • AOD – Arithmetic Operator Deletion

  • • AOR – Arithmetic Operator Replacement

  • • ASR – Assignment Operator Replacement

  • • BCR – Break Continue Replacement

  • • COD – Conditional Operator Deletion

  • • COI – Conditional Operator Insertion

  • • CRP – Constant Replacement

  • • DDL – Decorator Deletion

  • • EHD – Exception Handler Deletion

  • • EXS – Exception Swallowing

  • • IHD – Hiding Variable Deletion

  • • IOD – Overriding Method Deletion

  • • IOP – Overridden Method Calling Position Change

  • • LCR – Logical Connector Replacement

  • • LOD – Logical Operator Deletion

  • • LOR – Logical Operator Replacement

  • • ROR – Relational Operator Replacement

  • • SCD – Super Calling Deletion

  • • SCI – Super Calling Insert

  • • SIR – Slice Index Remove

  • • CDI – Class method Decorator Insertion

  • • OIL – One Iteration Loop

  • • RIL – Reverse Iteration Loop

  • • SDI – Static method Decorator Insertion

  • • SDL – Statement Deletion

  • • SVD – Self Variable Deletion

  • • ZIL – Zero Iteration Loop

(3)

typeReturn: The return type of the operator is mutated.

(4)

depthOnAST: The depth of the mutated node in the AST. Before performing the mutation, we add the attribute to calculate the depth for the nodes on the AST, when generating the mutants, we can get this attribute of the mutated node.

(5)

distanceBetweenTwoMutants: The distance between two mutated nodes on the AST, we use LCA (Least Common Ancestor) method to calculate this distance.

(6)

result: The types of the mutant when executed with the test cases. There are 4 types: Killed (In case of the test cases that detect the difference of the original programme and mutant), Survived (In case of there are no test cases that do not detect the difference of the original and mutant programmes), Incompetent (the mutants are eliminated at runtime (Derezinska & Halas, Citation2014)) and Timeout (Execution time exceeds the allowed threshold). Our research will focus on evaluating the group of mutants that have two states Killed or Survived because when calculating the MS (DeMillo et al., Citation1978), we only need to care about the number of mutants that have either of these two states.

(7)

line: the first line in the command block affected when applying mutation operators.

(8)

end_line: the last line in the command block affected when applying mutation operators.

In addition, we also used a tool called ‘Coverage.py’ (https://coverage.readthedocs.io/en/7.2.2), which is used for measuring code coverage of Python programmes. With this tool, we can monitor the programme and determine which source code has been executed or not. More specifically, we used the tool ‘Coverage.py’ to run the programme under test against each test case to check which lines of code in the programme under test are ‘covered’ by that test case. In other words, we want to know which specific lines of code will be ‘affected’ by specific test cases. From this experiment, we propose the use of the following 5 additional features:

(9)

ancestor: Do two FOMs have a parent-child relationship?

(10)

start_line_coverage: Is the first line of the mutant covered by test cases?

(11)

start_line_test_coverage: Number of test cases that cover the first line of the mutant out of the given total number of test cases.

(12)

body_coverage: Total lines of code covered by test cases/ total lines of source code of mutant.

(13)

body_test_coverage: Total number of test cases that cover the ith line of code in the mutant/number of test cases multiplied by the number of lines of mutant.

4. Experimental process

The experimental procedure is carried out in two following sub-processes.

4.1. Sub-process 1: model training and testing

Step 1.1.

Generate FOMs for the software under test: CPython project version 3.7 (https://github.com/python/cpython/tree/3.7).

Step 1.2.

Generate SOMs for the programme under test by combining two any different generated FOMs in Step 1.1.

Step 1.3.

Execute the original programme and all generated SOMs on the given test cases. Then, calculate the mutation score (referred to as MS1).

Step 1.4.

Use the algorithms to train the model (*) (using 80% of the number of generated SOMs in Step 1.2 which have been labelled ‘Killed’ or ‘Survived’).

Step 1.5.

After completing the training, do the prediction of which mutants will be killed (using the remaining 20

Step 1.6.

Compare the results of MS1 to MS2 and match the mutants that are actually killed in Step 1.3 and those that are predicted to be killed in Step 1.5.

4.2. Sub-process 2: cross-prediction performing

Step 2.1.

Generate FOMs for the 5 projects used for prediction.

Step 2.2.

Generate SOMs for the programme under test by combining two any different generated FOMs in Step 2.1.

Step 2.3.

Do the prediction of which mutants will be killed using trained model (*) in Sub-process 1, and calculate the MS based on the number of SOMs predicted to be killed and the total number of generated SOMs in Step 2.2 (referred to as MS3).

Step 2.4.

Execute the original programme and all generated SOMs on the given test cases. Then, calculate the MS based on the number of actual killed SOMs and the total number of generated SOMs in Step 2.2 (referred to as MS4).

Step 2.5.

Compare the results of MS3 to MS4 and match the mutants that are predicted to be killed in Step 2.3 and those that are actually killed in Step 2.4.

5. Analysis and evaluation of experimental results

In this section, we will present the analysis and evaluation of the experimental results in two separate processes: (1) Model training and testing, corresponding to sub-process 1; and (2) Cross-prediction performing, corresponding to sub-process 2.

In our experimental, we have used the computer with GPU RTX 3090 Ti 24GB VRam and CPU Core i7 13900 Ram 32GB to to run our empirical studies.

5.1. Model training and testing

As described in the subsection ‘Software under test’ of the Section ‘Proposed approach’, we used CPython project version 3.7 (https://github.com/python/cpython/tree/3.7) to generate model training dataset as well as test data. These datasets are SOMs generated with the following split: 80% of the SOMs are used as model training datasets and the remaining 20% are used as test data.

The total number of FOMs generated is 64,128 and from that FOMs, there are more than 2 billion generated SOMs including all 4 types of mutants: Killed, Survived, Incompetent and Timeout. Because our study only focuses on predicting MS index, so we are only interested in the number of mutants (SOMs) belonging to Killed and Survived groups. This number is 14,194,665 including, of which 11,355,732 SOMs were used in model training and 2,838,933 SOMs were used in prediction performance.

In terms of time performance, 2 algorithms LightGBM and XGBoost are much better (total training time for each algorithm is about 10 minutes) than 2 algorithms Logistic Regression and Random Forest Classifier (total training time for each algorithm is about 240-300 minutes) with a training dataset of about 4 GB.

The experimental results are presented in .

Table 1. Experimental results for model training and prediction testing.

After training the model and testing the prediction (with the training data 11,355,732 SOMs) using 4 algorithms Logistic Regression, Random Forest Classifier, LightGBM and XGBoost in turn, we had the correct prediction rate of these models of 81%, 85%, 90.21% and 91.44% respectively. This accuracy is high enough that we can apply the models to predict whether the mutant will be Killed or Survived for the remaining 2,838,933 SOMs with the following results. However, the experimental results presented in have shown that the effectiveness of using 2 algorithms (Logistic Regression and Random Forest Classifier) in prediction is not enough good in term of values of Recall as well as F1-Score parameters. That is the main reason, plus the superiority of training time as described above, why we only focus on analysing the effectiveness of predictive testing using LightGBM and XGBoost as follows.

5.1.1. Predictive testing using the LightGBM algorithm

In this experiment, 1,634,353 mutants were predicted to be ‘Survived’ out of a total of 1,712,101 actually survived mutants. Meanwhile, there are only 1,126,832 mutants are actually killed, but the number of mutants predicted to be killed is 1,204,580. The prediction rates (Precision values) are 90% (for predicting the number of surviving mutants) and 91% (for predicting the number of killed mutants), respectively.

Out of a total of 1,634,353 mutants predicted to be ‘Survived’, 1,533,959 were exactly survived, the rest were wrongly predicted because they were actually killed mutants. The correct prediction rate (Recall value) for this group is 94%. For the group of SOMs that were killed, there are 1,026,438 mutants that are actually killed, meanwhile, the number of mutants that are predicted to be killed is 1,204,580. The match rate (Recall value) for this group is 85%. The accuracy (F1-score values), in turn, for these two predictions is 90% (for predicting the number of surviving mutants) and 88% (for predicting the number of killed mutants).

5.1.2. Predictive testing using the XGBoost algorithm

With predictive testing using XGBoost algorithm, we also got similar results. The number of SOMs predicted to be survived is 1,633,032, whilst the total number of SOMs, which are actually survived mutants, is 1,688,759. The corresponding two numbers for the killed mutants are 1,150,174 (actually killed) and 1,205,901 (predicted to be killed). The prediction rates (Precision values) of this model are 91% (for predicting the number of surviving mutants) and 92% (for predicting the number of killed mutants), respectively.

In fact, there are only 1,539,280 survived SOMs (out of a total of 1,633,032 mutants predicted to be survived) which are exactly predicted. Meanwhile, the number of SOMs that are killed is exactly predicted to be 1,056,422 compared to the number of SOMs that are predicted to be killed of 1,205,901. The correct prediction rates (Recall value) for predicting the number of surviving mutants and predicting the number of killed mutants using the XGBoost algorithm are respectively 94% and 88%. The accuracy (F1-score values), for predicting the number of surviving mutants and predicting the number of killed mutants is respectively 93% and 90%.

With the data matched, calculated and presented above, we find that the number of mutants predicted to be Survived or Killed has a very high rate of coincident with the actually surviving or killed mutants, with the mean values of the Recall parameter being 90% and more for both algorithms (LightGBM and XGBoost). Besides, the values of the F1-scores are very high (greater than or equal to 90%). This proves that our proposed approach has achieved the positive results and is feasible to apply in cross-prediction to predict mutation score for new and completely different software with the software used in this model training and testing.

5.2. Cross-prediction performing

The experimental results for cross-prediction, for both algorithms (LightGBM and XGBoost), are presented in and . The prediction of survived or killed mutants has the pretty good matches (through the values of the Recall index, from 78% to 92% for the LightGBM algorithm and from 83% to 93% for the XGBoost algorithm) with those that were actually survived or actually killed. The value of the F1-score is also very high, about 88%. Taken positively, these results have demonstrated that our proposed prediction approach is reliable and has a high accuracy.

Table 2. Experimental results for cross-prediction using the LightGBM algorithm.

Table 3. Experimental results for cross-prediction using the XGBoost algorithm.

Finally, through the above experimental results and analysis, we have an answer to the posed research question that we can accurately predict the number of higher order mutants to be killed as well as the number of mutants that are survived, without executing the generated mutants against the given set of test cases.

6. Discussions, conclusions and future works

Since the first appearance in 1977, the mutation testing technique become one of the most techniques studied and practiced thanks to its advantages in terms of automation ease. Many researches have been proposed to improve its effectiveness by reducing number of generated mutants and running time, selecting high quality mutation operators, etc. Higher order mutation was proposed to solve at the same time three problems: the large number of mutants, reflection of real faults and equivalent mutants. Recently, a few studies have started to predict killed mutants or mutation score instead of running all the mutants. However, non of such researches focuses on higher order mutation.

In this paper, we propose applying different machine learning algorithms including Logistic Regression, Random Forest Classifier, LightGBM and XGBoost to predict killed second order mutants for Python programmes. Our approach analyses source code to extract features relating to higher order mutation. Machine learning algorithms are employed to predict killed mutants without running them. Predicted mutation score is compared to actual mutation score computed by running mutants and we check also predicted killed mutants versus actual killed mutants. The experimental results show that the LightGBM and XGBoost algorithms are very effective (F1-score is about 0.88). This allows the approach to be applied in practice in order to reduce effort of mutation testing.

In the future work, we intend to continue studying more other machine learning algorithms and also apply to other programming languages.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • Microsoft (2023).
  • Acree, A. T., Budd, T. A., DeMillo, R. A., Lipton, R. J., & Sayward, F. G. (1979). Mutation analysis (Technique Report GIT-ICS-79/08). Georgia Institute of Technology.
  • Alireza, A., & Mirian-Hosseinabadi, S. H. (2020). The threat to the validity of predictive mutation testing: The impact of uncovered mutants, ArXiv abs/2005.11532.
  • Berkson, J. (1944, September). Application of the logistic function to bio-assay. Journal of the American Statistical Association, 39(227), 357–365.
  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
  • Chekam, T. T., Papadakis, M., Bissyandé, T., & Le Traon, Y. (2018). Predicting the fault revelation utility of mutants. In 2018 IEEE/ACM 40th international conference on software engineering: Companion (ICSE-Companion) (pp. 408–409).
  • DeMillo, R. A., Lipton, R. J., & Sayward, F. G. (1978). Hints on test data selection: lp for the practicing programmer. IEEE Comput, 11(4), 34–41. https://doi.org/10.1109/C-M.1978.218136
  • Derezinska, A., & Halas, K. (2014). Experimental evaluation of mutation testing approaches to python programs. In 2014 IEEE seventh international conference on software testing, verification and validation workshops (pp. 156–164).
  • Hamlet, R. G. (1977). Testing programs with the aid of a compiler. IEEE Transactions on Software Engineering, SE-3(4), 279–290. https://doi.org/10.1109/TSE.1977.231145
  • Jia, J., & Harman, M. (2008). Constructing subtle faults using higher order mutation testing. In Proc. eighth int'l working conf. source code analysis and manipulation (pp. 249–258).
  • Jia, Y., & Harman, H. (2011). An analysis and survey of the development of mutation testing. IEEE Transactions on Software Engineering, 37, 649–678. https://doi.org/10.1109/TSE.2010.62
  • Kim, J., Jeon, J., Hong, S., & Yoo, S. (2022). Predictive mutation analysis via the natural language channel in source code. ACM Transactions on Software Engineering and Methodology, 31(4), 1–27. https://doi.org/10.1145/3510417
  • Madeyski, L. (2007). On the effects of pair programming on thoroughness and fault-finding effectiveness of unit tests. In J. Münch & P. Abrahamsson (Eds.), Product-Focused Software Process Improvement (pp. 207–221). 4589 of Lecture notes in computer science Springer.
  • Strug, J., & Strug, B. (2012). Machine learning approach in mutation testing. In: B. Nielsen, C. Weise (Eds.), Testing software and systems, ICTSS 2012 (pp. 200–214). Lecture notes in computer science: Vol. 7641 Springer.
  • Tan, L., Gong, Y., & Wang, Y. (2019). A model for predicting statement mutation scores. Mathematics, 7(9), 778. https://doi.org/10.3390/math7090778
  • The XGBoost Contributors (2023). XGBoost Documentation [Computer software]. XGBoost.
  • Van-Nho, D., Vu, N. Q., & Binh, N. T. (2019). A solution for improving the effectiveness of higher order mutation testing. In 2019 IEEE-RIVF international conference on computing and communication technologies (RIVF) (pp. 1–5).
  • Van-Nho, D., Vu, N. Q., & Binh, N. T. (2021). Evaluating mutation operator and test case effectiveness by means of mutation testing. In N.T. Nguyen, S. Chittayasothorn, D. Niyato, B. Trawiński (Eds.), Intelligent information and database systems, ACIIDS 2021 (pp. 837–850) Lecture notes in computer science: Vol. 12672 Springer.
  • Van-Nho, D., Vu, N. Q., & Binh, N. T. (2022, May). Toward improving the quality of mutation operator and test case effectiveness in higher-order mutation testing. Vietnam-Journal-of-Computer-Science, 2196–8888.
  • Vu, N. Q., & Madeyski, L. (2014). Problems of mutation testing and higher order mutation testing. In Advanced computational methods for knowledge engineering (pp. 157–172). Advances in intelligent systems and computing: Vol. 282.
  • Vu, N. Q., & Madeyski, L. (2015). Searching for strongly subsuming higher order mutants by applying multiobjective optimization algorithm. Advanced computational methods for knowledge engineering (pp. 391–402). Advances in intelligent systems and computing: Vol. 358.
  • Vu, N. Q., & Madeyski, L. (2016a). Higher order mutation testing to drive development of new test cases: An empirical comparison of three strategies. In Lecture notes in computer science (Vol. 9621, pp. 235–244).
  • Vu, N. Q., & Madeyski, L. (2016b). On the relationship between the order of mutation testing and the properties of generated higher order mutants. In Lecture notes in computer science (pp. 245–254).
  • Vu, N. Q., & Madeyski, L. (2016c). Empirical evaluation of multi-objective optimization algorithms searching for higher order mutants. Cybernetic and Systems: An International Journal, 47(1–2), 48–68.
  • Vu, N. Q., & Madeyski, L. (2017). Addressing mutation testing problems by applying multi-objective optimization algorithms and higher order mutation. Journal of Intelligent & Fuzzy Systems, 32(2), 1173–1182. https://doi.org/10.3233/JIFS-169117
  • Zhang, J., Zhang, L., Harman, M., Hao, D., Jia, Y., & Zhang, L. (2019). Predictive mutation testing. IEEE Transactions on Software Engineering, 45(9), 898–918. https://doi.org/10.1109/TSE.32