Detection method for software self-recognition type technical debt

文档序号:7441 发布日期:2021-09-17 浏览:15次 中文

1. A method for detecting software self-approval-type technical debt, comprising the steps of:

step 1: acquiring and processing a data set;

dividing data in a source code annotation data set into a plurality of project data sets by adopting a public source code annotation data set, then respectively performing symbolization and removal of stop words, and then performing text representation to convert the stop words into an annotation matrix;

inputting the annotation matrix into an Embedding layer of the neural network, training the Embedding layer, and outputting word vectors; splicing the word vectors according to the annotation sequence to obtain a word vector matrix;

step 2: constructing a self-recognition technical debt detection model;

the self-recognition type technical debt detection model comprises three parallel base classifiers;

the first base classifier is CNN;

the second base classifier is a CNN-LSTM mixed model;

the third base classifier is DPCNN;

respectively inputting the word vector matrix into three base classifiers, training each base classifier independently, and outputting the probability that each base classifier belongs to the self-recognition type technical debt to the annotation data after the training is finished;

fusing the output results of the three base classifiers to obtain the probability that the final annotation data belongs to the self-recognition type technical debt;

setting a classification threshold, and if the probability of belonging to the self-approval-type technical debt is greater than the classification threshold, judging that the annotation data is the self-approval-type technical debt; if the probability of belonging to the self-approval-type technical debt is less than or equal to the classification threshold value, judging that the annotation data is the non-self-approval-type technical debt;

and step 3: inputting the annotation data to be detected into the self-acceptance type technical debt detection model, and outputting a result of whether the annotation data to be detected is the result of the self-acceptance type technical debt.

2. The method for detecting software self-recognition technical debt, according to claim 1, wherein the public source code annotation data set is a data set organized by Maldonado and Shihab.

3. The method for detecting software self-recognition technical debt according to claim 1, wherein preferably, when each base classifier is trained independently, Adam algorithm is adopted, and the loss function adopts focal loss function.

4. The method for detecting software self-recognition technical debt according to claim 1, wherein the classification threshold is 0.5.

Background

Technical debt is a metaphor that expresses the use of an incomplete, provisional, or suboptimal solution in the software development process. This concept was first proposed by Cunningham in 1992 and he considered "incompletely correct code" as a form of debt. Later, Potdar and Shihab studied source code annotations directed to liability instances, and they discovered by manually dealing with a large number of annotations that developers tended to show up in the annotations what problems might be in the code. Against this finding, Potdar and Shihab proposed Self-acceptance Technical Debt (Self-accepted Technical Debt) and summarized 62 patterns for identifying it. Researchers have been focusing on SATD research, discussing and improving various methods and techniques, and identifying as a primary problem in current research and for better analysis and management of self-supporting technology debt is also a current research focus.

The existing research of debt identification of self-acceptance technology mainly has four aspects: a pattern-based approach, a natural language processing approach, a text mining approach, and a deep neural network (CNN) approach. Among other things, Ren et al, conducted preliminary attempts to identify self-acceptance technical debts using convolutional neural networks, who analyzed five features of SATD text annotation that affected pattern-based SATD detection and the performance, generalizability, and applicability of traditional text-mining-based methods to identify SATD. To improve SATD recognition accuracy, particularly cross-item recognition accuracy, and to improve interpretability of machine learning-based recognition results, their methods learn the informative text function of extracting SATD recognition tasks from review data. However, when SATD recognition is performed using a convolutional neural network model, local features are of excessive concern, and part of global features are lost. Furthermore, a single model usually only learns the training data of known classes by using a classification model, so that a single classifier is obtained to classify the unknown data. Compared with a plurality of models, the classification precision is low, and the generalization capability is weak.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a detection method for software self-recognition type technical debt, which comprises the steps of firstly, acquiring and processing a data set; then constructing a self-recognition type technical debt detection model; comprises three parallel base classifiers; CNN, CNN-LSTM mixed model and DPCNN respectively; inputting the word vector matrix into three base classifiers respectively, wherein each base classifier outputs the probability that the annotation data belongs to the self-recognition type technical debt; the classification results output by the three base classifiers are fused to obtain the probability that the final annotation data belongs to the self-recognition type technical debt; and finally, judging the size relationship between the probability and the classification threshold value, and outputting a result of detecting whether the annotation data is the self-acceptance type technical debt. The method can overcome the problem of high false positive rate, can identify more SATDs compared with other methods, and simultaneously reduces the bias of a single model, thereby solving the problems of false positive and low accuracy rate in SATD identification.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1: acquiring and processing a data set;

dividing data in a source code annotation data set into a plurality of project data sets by adopting a public source code annotation data set, then respectively performing symbolization and removal of stop words, and then performing text representation to convert the stop words into an annotation matrix;

inputting the annotation matrix into an Embedding layer of the neural network, training the Embedding layer, and outputting word vectors; splicing the word vectors according to the annotation sequence to obtain a word vector matrix;

step 2: constructing a self-recognition technical debt detection model;

the self-recognition type technical debt detection model comprises three parallel base classifiers;

the first base classifier is CNN;

the second base classifier is a CNN-LSTM mixed model;

the third base classifier is DPCNN;

respectively inputting the word vector matrix into three base classifiers, training each base classifier independently, and outputting the probability that each base classifier belongs to the self-recognition type technical debt to the annotation data after the training is finished;

fusing the output results of the three base classifiers to obtain the probability that the final annotation data belongs to the self-recognition type technical debt;

setting a classification threshold, and if the probability of belonging to the self-approval-type technical debt is greater than the classification threshold, judging that the annotation data is the self-approval-type technical debt; if the probability of belonging to the self-approval-type technical debt is less than or equal to the classification threshold value, judging that the annotation data is the non-self-approval-type technical debt;

and step 3: inputting the annotation data to be detected into the self-acceptance type technical debt detection model, and outputting a result of whether the annotation data to be detected is the result of the self-acceptance type technical debt.

Preferably, the disclosed source code annotation data set is a data set compiled by Maldonado and Shihab.

Preferably, when each base classifier is trained separately, Adam algorithm is adopted, and the loss function adopts focal loss function.

Preferably, the classification threshold is 0.5.

The invention has the following beneficial effects:

1. the invention can overcome the problem of higher misjudgment rate, and can identify more SATDs compared with other methods.

2. The invention integrates the CNN, LSTM and DPCNN deep learning models, can make the extracted features more comprehensive, and simultaneously reduces the bias of a single model, thereby solving the problems of misjudgment and low accuracy rate in SATD identification.

3. The invention is applied to ten projects and carries out intra-project and cross-project experiments respectively. The result shows that the method is effective, especially the intra-project prediction, the method can obviously improve the problems of low identification accuracy rate existing in the previous research, especially the problem of high identification difficulty under the condition of data imbalance, and the identification effect is obviously improved.

4. In addition, on the data set with obvious data imbalance, the method has more obvious improvement effect, and the data imbalance condition is more common in the actual project, so that the method is more suitable for being applied to the actual process compared with other methods.

Drawings

Fig. 1 is a framework diagram of a self-recognition technical debt detection model according to the present invention.

FIG. 2 is a schematic diagram of the processing of a data set according to the present invention.

FIG. 3 is a graphical representation of F-Measure results over 10 projects for the method of the present invention and other methods.

FIG. 4 is a diagram illustrating SATD visualization of significant SATD features according to an embodiment of the invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

When the traditional convolutional neural network model is used for SATD identification, local features are over-concerned, and part of global features are lost. Furthermore, a single model usually only learns the training data of known classes by using a classification model, so that a single classifier is obtained to classify the unknown data. Compared with a plurality of models, the classification precision is low, and the generalization capability is weak. According to the method, the generalization capability of the model can be effectively improved by combining a plurality of learners through integrated learning, and compared with a single model, the integrated model can effectively improve the identification accuracy. The method of integrating deep neural networks can combine the high performance of deep learning with the generalization capability of integrated learning.

The inspection model framework of the present invention is shown in FIG. 1. An advantage of integrated classifiers over single classifiers is their ability to correct errors of a single integrated member to improve overall integration performance. And it has been shown from research that integrated classifiers are considered more reliable than single classifiers and can improve prediction accuracy by reducing the risk of selecting a mis-learning model compared to basic classifiers. The invention selects three deep learning models with better performance to integrate, including a CNN model, a CNN-LSTM model and a DPCNN model. And (4) performing final fusion on the output layer through training and optimizing the models of each model, and judging whether the fused result is the SATD as the final classification result.

A method for detecting software self-approval-type technical debt, comprising the steps of:

step 1: acquiring and processing a data set;

dividing data in a source code annotation data set into a plurality of project data sets by adopting a public source code annotation data set, such as Maldonado and Shihab data sets, then respectively performing symbolization and removal of stop words, and then performing text representation and converting the text representation into an annotation matrix;

inputting the annotation matrix into an Embedding layer of the neural network, training the Embedding layer, and outputting word vectors; splicing the word vectors according to the annotation sequence to obtain a word vector matrix;

step 2: constructing a self-recognition technical debt detection model;

the self-recognition type technical debt detection model comprises three parallel base classifiers;

the first base classifier is CNN;

the second base classifier is a CNN-LSTM mixed model;

the third base classifier is DPCNN;

and respectively inputting the word vector matrix into three base classifiers, and training each base classifier independently by adopting an Adam algorithm, wherein the loss function adopts a focal loss function. After training is finished, each base classifier outputs the probability that the annotation data belongs to the self-recognition type technical debt;

fusing the output results of the three base classifiers to obtain the probability that the final annotation data belongs to the self-recognition type technical debt;

setting a classification threshold, and if the probability of belonging to the self-approval-type technical debt is greater than 0.5, judging that the annotation data is the self-approval-type technical debt; if the probability of belonging to the self-approval-type technical debt is less than or equal to 0.5, judging that the annotation data is the non-self-approval-type technical debt;

and step 3: inputting the annotation data to be detected into the self-acceptance type technical debt detection model, and outputting a result of whether the annotation data to be detected is the result of the self-acceptance type technical debt.

The specific embodiment is as follows:

1. acquisition and processing of data sets

The dataset of the present invention was derived from Maldonado and Shihab, has been published on the gethub website, and is used by several researchers. The data set selects 33090 source Code annotations of ten items, the item names are Ant, ArgoUML, Columba, EMF, Hibernate, JEdit, JFreeChart, JRuby and SQuirrel, the item detailed information is shown in table 1, the item detailed information comprises the version of the item, the number of contributors, the number of source Code Lines (Software Lines of Code), the field to which the item belongs, the number of the annotations, and the number and the proportion of self-acceptance technical debts are shown in table 2. Maldonado and Shihab use the open source plug-in to parse and extract the source code and filter it using five filtering heuristics to get 62566 comment comments, classifying the 62566 comments into self-acceptance technical debts and non-self-acceptance technical debts. The self-acceptance technical debts are divided into five categories including design debts, defect debts, document debts, demand debts and test debts.

Table 1 item related information

Project Release #of Contributors SLOC
Ant 1.7.0 74 115,881
ArgoUML 0.34 87 176,839
Columba 1.4 9 100,200
EMF 2.4.1 30 228,191
Hibernate 3.3.2GA 226 173,467
JEdit 4.2 57 88,583
JFreeChart 1.0.19 19 132,296
JMeter 2.10 33 81,307
JRuby 1.4.0 328 150,060
SQuirrel 3.0.3 46 215,234

TABLE 2 project related data statistics

As shown in fig. 2, it is a data preparation process, dividing source code data into ten item data sets, then performing symbolization, removing stop words, etc. respectively, and then performing text representation and converting into annotation matrix. Specifically, stop words annotated by the source code are removed and the offset is cut off for the convenience of word vector representation to facilitate better semantic representation. For example, if the characters like "a", "can", etc. are useless for text analysis, the removal is performed, which not only can effectively reduce the size of the data file, but also can improve the operation efficiency.

In order to solve the problems of high latitude and high sparsity of the traditional one-hot coding, the invention adopts a word vector mode. Unlike images, the annotation text does not have rich high-dimensional vector representation and cannot be directly used as input data of a neural network for feature extraction, so that part of important work of self-acceptance technology debt recognition is to train word vectors, convert the annotation text into features which can be recognized by the network, and then learn semantic information. In the invention, the training of word vectors is realized by using an Embedding layer provided by a neural network, and a word vector matrix obtained by splicing generated word vectors according to an annotation sequence is used as the input of a convolutional layer so as to better extract characteristics.

2. Selection and training of integrated models

The convolutional neural network has strong feature extraction capability, and can capture local correlation by extracting key information in a sentence by using kernal of a plurality of different sizes in the convolutional neural network. The CNN has obvious application effect in text classification, and compared with a mode-based method and a text mining method, the CNN is used for SATD recognition, the CNN can automatically extract features, manual participation is reduced, and the recognition accuracy is improved. To this end, the present invention uses a convolutional neural network as a basis classifier in the integrated model. Secondly, although the text features can be extracted and the text can be classified according to the features when the CNN performs text classification, only the local features can be concerned, but the global features are not concerned, which easily causes a phenomenon that one annotation is classified into a plurality of categories, and the classification result is influenced by the phenomenon.

The LSTM is a serialized processing mode, can better link the context, has good accuracy for identifying the short text annotated in the self-acceptance technical debt task, but has a complex structure, and the calculation amount is increased along with the increase of the input text amount, so that the link of the upper text and the lower text is reduced, and the identification accuracy is reduced. Therefore, the invention combines the respective advantages of CNN and LSTM to construct a CNN-LSTM mixed model as a second base classifier in the integrated model.

Finally, CNN cannot obtain the long-distance dependency of the text by convolution, while DPCNN can extract the long-distance text dependency by continuously deepening the network. Under the condition of not increasing too much calculation cost, higher accuracy can be obtained by increasing the network depth. Thus, DPCNN was chosen as the third base classifier in the integrated model.

Each of the integrated models needs to be trained and each model is optimized over multiple iterations. In each iteration, the model can obtain a prediction label according to a known training data label, parameters of the model are adjusted according to the prediction label measurement and the loss of a real label, the loss is tried to be gradually reduced, and meanwhile, the optimization problem is solved by using an Adam algorithm. In addition, aiming at the problem of data imbalance, inspired by the fact that the problem of sample imbalance in anomaly detection is solved by naoming, the focal loss function is used, and the weight of most samples is reduced, so that the model is more concentrated on a few classes of samples during training.

3. Detection of SATD

The integration model may be trained to identify whether an annotation is a self-acceptance technical liability for an unclassified sample. For an annotation sample to be classified, each model firstly converts the established dictionary sequence into a digital matrix represented by the digital sequence, then trains word vectors through an Embedding layer, and finally classifies through feature extraction and model training of the respective model. For example, for a convolutional neural network, features are extracted through a convolutional layer and a pooling layer, the model is trained, and then the convolutional neural network is classified by a sotfmax classifier. And the output layer of the classifier of each model gives a corresponding classification result, the output results of the models are fused, the SATD probability of the annotation to be classified is given, if the probability value is more than 0.5, the annotation is considered to be SATD, otherwise, the annotation is not SATD.

This is known to be a binary classification problem based on the classification labels SATD and non-SATD of the dataset. When recognition is performed for each piece of annotation text, four results may be produced, as shown in table 3. When the predicted class and the true class match, they are called correct classification results, such as True Positive (TP) and True Negative (TN). When they do not match, they are called misclassification results, such as False Positives (FP) and False Negatives (FN). Wherein TP represents that the prediction result belongs to SATD, and the real result also belongs to SATD; TN means that the predicted result does not belong to SATD, and the final true result does not belong to SATD; FP means that the predicted result does not belong to SATD, but the final true result belongs to SATD; FN indicates that the predicted result belongs to SATD, but the actual true result does not. Different classification models have different experimental results, and the classification performance of the classification model is judged by calculating the accuracy, recall ratio, precision ratio and F1 value of the different classification models.

TABLE 3 Classification result matrix

Precision ratio: and the sample proportion of the true result in the sample of which the model is predicted to be the positive class is also represented. As shown in formula (1):

and (3) recall ratio: and the proportion of the sample predicted to be the positive class by the model to the sample actually predicted to be the positive class is represented. As shown in formula (2):

f1 value: the recall ratio and precision ratio cannot be evaluated comprehensively, and F1 is additionally introduced for this purpose, and the harmonic mean value of the recall ratio and the precision ratio tends to be close to a smaller value, so that the high value of the F1 can ensure that the precision ratio and the precision ratio are higher. The formula of F1 is shown in (3):

the method adopted by the invention integrates the neural network models of CNN, CNN-LSTM and DPCNN. To verify the validity of this method with respect to the existing SATD identification method, a comparative experiment of intra-project predictions was performed. A total of five experiments were performed. First, the method of the invention; second, the CNN method; third, Huang's text mining method; fourth, a method of natural language processing by Maldonado and Shihab; fifth, a Guo's pattern recognition method.

The result shows that the average value of the precision ratio of the method reaches 0.800. Compared with 0.637 of CNN method, 0.702 of TM method and 0.537 of NLP method, the method of the invention improves by 25.59%, 13.96% and 48.97%. The average recall ratios of the five methods are respectively 0.736, 0.705, 0.714, 0.604 and 0.612, and the methods of the invention are respectively improved by 4.40%, 3.08% and 21.85%. FIG. 3 shows F-Measure results of the method of the present invention and four other methods in 10 items. The average values of F-Measure scores of the five methods are 0.763, 0.662, 0.697, 0.561 and 0.689 respectively, the method of the invention improves 15.26%, 9.47%, 36.01% and 10.74% respectively, the five methods achieve higher F1 score on ArgoUML project, and the F1 score on Columba reaches the highest 0.913. It is noted that the method of the present invention was improved by 23.2%, 22.71%, 34.20% and 101.97% respectively in the JEdit project compared to the other methods. In addition, there is a large improvement in the EMF project. It can be seen that the method of the present invention can be significantly superior to other methods in the project where the SATD ratio is low, JEdit and EMF.

To improve the interpretability of the method, the invention proposes visualization of SATD important features based on the Attention mechanism. Compared with the CNN model, the CNN-LSTM model focuses more on sentence characteristics, and is simpler and easier to converge compared with the DPCNN model. Currently, Attention has been widely used in NLP. The Attention mechanism encodes sequence data by using the importance scores assigned to each element, and judges which parts of the neural network are concerned when performing tasks by visualizing the Attention matrix. The model performance can be obviously improved on a series of tasks, so that the interpretability is improved, and the model performance is particularly based on a recurrent neural network structure. To this end, the Attenttion was added to the CNN-LSTM model to verify interpretability. This is done by constructing a thermodynamic diagram for the entire sentence, the strength of which is proportional to the normalized attention score received for each word. That is, the larger the contribution of a word, the darker the color, to prove that some words are more important than others.

As shown in fig. 4, a visualization result is displayed, Attention is focused on annotations containing SATD, and therefore 10 sentences displayed in fig. 4 are all SATD, it can be seen that weights given to words by the Attention can be relatively paid Attention to some words or models during model training, and the importance degree of the words or models to the whole sentence classification is measured, so as to achieve the purpose of classification, and some words in the sentences which have relatively large contribution to classification are extracted by using the Attention, so as to be displayed in a manner of being helpful for human understanding, and the interpretability is improved to a certain extent. The appearance of words such as "todo" and "workaround" is largely considered to be SATD in the sentence, and in experiments these words are recognized as well, and words such as "todo", "fixme" and the like also prove to be strong patterns in Guo's experiments, and they have achieved good results using only four pattern recognition SATD. In addition, table 4 also exemplifies the current 20 feature words for each item identified using this method.

Table 4: top 20 keyword of 10 items extracted based on CNN-LSTM attention mechanism

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:无线通信模组的测试系统和方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!