Java method name recommendation method based on two-stage framework

文档序号:8599 发布日期:2021-09-17 浏览:33次 中文

1. A Java method name recommendation method based on a two-stage framework is characterized by comprising the following steps:

1) in the preprocessing stage, for the characteristics of an input Java method, a heuristic rule is adopted to filter out a getter/setter method and a delections method, wherein the getter is a method specially used for acquiring non-static private attributes in the Java class, and the setter method is a method specially used for assigning values to the non-static private attributes in the Java class;

2) a method classification stage, namely classifying the Java method by adopting a feedforward neural network FastText, wherein the classified input data adopts a mark stream of a Java method body, the mark stream is a key value pair obtained by lexical analysis, and the key value pair comprises a type and a mark; the method is classified by using the prefixes of the method names corresponding to the method as classification bases, and the prefixes of the method names are predicted by a classification method;

3) a method name generation stage, which generates method names in different ways according to prefixes of different method names; the method name is generated by adopting a heuristic rule based on frequency for the Java method of which the classifier predicts the method name beginning with the 'get', 'set', 'is' or 'test', and the method name is generated by adopting a mode based on a neural network for the Java method of which the classifier predicts the method name beginning with other prefixes.

2. The method according to claim 1, wherein in step 1), a Java program source code is read and parsed to obtain a method list, wherein a rule for determining whether a method is a getter method is that if a method has only one row of code statements and the method returns a value of a private non-static attribute in a class, the method is a getter method, and the method format is described as return this $ { non-static private attribute in the class where the method is located }; or return $ { non-static private attribute in class of method }; the rule that the judgment method is the setter method is that if only one row of code statements exist in one method and the function of the statements is to assign values to the non-static attributes in the class where the method exists, the physical form is described as this. $ { the non-static private attributes in the class where the method exists }, the values are assigned; or assigning a value to the non-static private attribute in the class where the method is located; .

3. The method of claim 1, wherein: the deligements method is a code statement with only one row and the statement calls other methods in the class where the method is located, and the method format is described as return $ { method names of other methods in the class where the method is located } (parameter list); or $ { method name of other method in the class the method belongs to } (parameter list); .

4. The method according to claim 1, characterized in that the method name is given for the getter/setter method, wherein the format of the method name for the generation of the getter method is get $ { non-static private attribute in the class of the method, wherein the first letter is upper case }, wherein the format of the method name for the generation of the setter method is set $ { non-static private attribute in the class of the method, wherein the first letter is upper case }; the method name format generated for the deltas method is $ { method name of other method in the class where the method is located }.

5. The method as claimed in claim 1, wherein in step 2), the input FastText markup stream format is composed of key-value pairs, the key-value pairs are derived by lexical analysis, FastText classifies Java methods into 5 categories according to the prefix of the method name, respectively methods whose method names begin with special prefixes "test", "is", "get", "set", and Java methods that begin with other prefixes "others", and bigram level training models are used during the training process.

6. The method of claim 1, wherein in step 3), different method name recommendation methods are employed for methods beginning with the method name prefix; in the method, a heuristic rule is adopted to name the Java method according to the method that the name of the opposite side starts with the prefixes of 'test', 'is', 'get' and 'set'; the heuristic naming rule is as follows: for the prefixes obtained by classification of the classifier in the known method, the identifier with the highest occurrence frequency is extracted from the identifier stream in the method body to serve as a suffix, and the prefixes and the suffix are spliced through a hump naming rule to obtain the Java method name.

7. The method as claimed in claim 1, wherein the context of the method is extracted, the hump rule is adopted to split the code segment, the method context is obtained by splicing the class of the method, the identifier in the method body, the parameter list of the method and the return value, the method context characteristics are input into a Recursive RNN model, and the Java method name is obtained by prediction.

Background

With the continuous promotion of the informatization wave and the coming of the internet era, all aspects of clothes and food residents in the life of people are increasingly unable to leave software. Due to the endless software requirements in people's lives, the functions of modern software become more and more complex, and the number of modules increases, so that the complexity of the software is continuously improved. In the face of increasingly complex software, the difficulty and high cost of software maintenance become a big problem at present. It is known that the cost of software maintenance accounts for about 70% or more of the total cost of software, and the important reason for the high maintenance cost is that the understandability of software is low, and developers often spend more than 50% of their efforts in understanding the code of software. There are many reasons for low software understandability, and among them, names of variables, methods, and parameters that are not properly named in programs are often important reasons for low program understandability.

High-quality program naming can improve readability and maintainability of programs, but program naming is a great problem in the field of software engineering. Identifier naming is one of the most difficult tasks that programmers must accomplish. In programming, names (i.e., identifiers) are ubiquitous in all program concepts, such as classes, methods, and variables. In actual development, developers often write inconsistent program nomenclature for a number of reasons. In addition to the developers' own qualifications, such as no one unified synonym table, conflicting naming styles among different collaborating developers, etc. Therefore, the naming problem of the program is most important in the development and maintenance of the software, but the problem of inconsistent naming is easily caused due to the fact that the naming problem is influenced by human factors and lacks of uniform naming style and constraint.

Method naming is one of the program naming tasks. In actual development, the names of methods are often viewed as brief descriptions of the functions of the method body. In actual development, the method name is intuitive and important information for developers to understand the behavior of a program or API. Thus, inconsistent method names can make the program more difficult to understand and maintain, and can even lead to software bugs. Developers often guess the functions of the methods and call the corresponding methods by means of the names of the methods, and if the names of the methods are not good, misuse of the API by the developers is easily caused, and further software defects are caused.

In summary, in order to improve the maintainability of software and reduce the maintenance cost of software, a method is urgently needed in the field of software engineering to efficiently recommend high-quality method names, further improve the readability of programs, reduce the time and effort of developers in understanding codes, and finally improve the maintainability of software.

Disclosure of Invention

The invention aims to provide a Java method name recommendation method based on a two-stage framework, which improves the readability and the intelligibility of a project code and helps a developer to quickly know the functions realized by the method through the method name, thereby reducing the cost of software maintenance and improving the efficiency of software development.

In order to achieve the above purpose, the invention adopts the following method:

a Java method name recommendation method based on a two-stage framework comprises the following steps of:

the first stage is as follows: and (5) a method classification stage.

And a second stage: a method name generation phase.

Specifically, in the data preprocessing process, for the characteristics of an input Java method, a heuristic rule is adopted to filter out a getter/setter method and a deligations method, wherein the getter is a method specially used for acquiring non-static private attributes in the Java class, the setter method is a method specially used for assigning values to the non-static private attributes in the Java class, and the getter method and the setter both have fixed formats in the programming specification of the Java language. There is only one run-order statement in the delections method, and this program statement calls the other methods in the class in which it is located.

Further, reading a source code of the Java program and analyzing to obtain a method list, wherein a rule for judging whether the method is the getter method is that if only one line of code statements exists in one method and the method returns a value of a private non-static attribute in a class, the method is the getter method, and the method format is described as return this. $ non-static private attribute in the class where the method is located }; or return $ { non-static private attribute in class of method }; . The rule that the judgment method is the setter method is that if only one line of code statements exists in one method and the function of the statements is to assign values to the non-static attribute in the class where the method is located, the method is described in a form of this. $ { the non-static private attribute in the class where the method is located }, namely, assigning values; or give a value to this $ { non-static private attribute in the class where the method is located };

preferably, the rule for determining whether a method is a delegmentation method is as follows, if a method only has one row of code statements and the statements invoke other methods in the class of the method, then the method is a delegmentation method whose method format is described as return $ { method names of other methods in the class of the method } (parameter list); or $ { method name of other method in the class the method belongs to } (parameter list); .

Further, naming the getter/setter method, wherein the format of the method name generated by the getter method is get $ { non-static private attribute in the class where the method is located, wherein the first letter is capitalized }, and wherein the format of the method name generated by the setter method is set $ { non-static private attribute in the class where the method is located, wherein the first letter is capitalized }. The method name format generated for the deltas method is $ { method name of other method in the class where the method is located }.

Specifically, in the first-stage method classification process, a feed-forward neural network FastText is adopted to classify the Java method, wherein the classification input data adopts a tag stream of a Java method body, the tag stream is a key value pair obtained by lexical analysis, and the key value pair comprises a type and a tag. The method name prefix corresponding to the method is used as a classification basis, and the method name prefix is predicted through a classification method.

Further, it is characterized in that the input fastText markup stream format consists of key-value pairs, which are derived from lexical analysis, for example for a given program fragment "int a; ", the analyzed marker stream is" [ PrimitiveType, int, Variable, a ] ". FastText classifies Java methods into 5 categories according to the prefix of the method name, namely methods with the method name beginning with the special prefix "test", "is", "get", "set", and Java methods beginning with the other prefix "others", and in the training process, particularly bigram word-level training models are adopted.

Specifically, in the second-stage method name generation process, different method name recommendation methods are adopted for the methods at the beginning of the prefix of the method name. In the method, the method with the name beginning with the prefixes of test, is, get, set and the like is named for the Java method by adopting a heuristic rule. Heuristic naming rules are as follows: for the prefixes obtained by classification of the classifier in the known method, the identifier with the highest occurrence frequency is extracted from the identifier stream in the method body to serve as a suffix, and the prefixes and the suffix are spliced through a hump naming rule to obtain the Java method name.

Furthermore, the extraction method is characterized in that context of the extraction method is obtained, the hump rule is adopted to segment code segments, the code segments comprise the class of the method, the identifier in the method body, the parameter list of the method and the return value, the context of the method is obtained through splicing, the context characteristics of the method are input into a Recursive RNN model, and then the Java method name is obtained through prediction. The corresponding method name is generated by a recursive RNN, a special seq2seq model based on encoder and decoder mechanisms. The model can extract important features from the context of the method and select possible labels to compose the corresponding method name. The main reason that this project selects recursive RNN is because it takes the current output as the next input, and by considering the already generated sequence and combining the current input sequence at the same time, it can provide a wider context for the decoder to interpret and output the next label, so it improves the semantic consistency of the generated method name.

The architecture of a Recursive RNN is shown in FIG. 4. First, for each token in the input context, it is first converted into a one-hot vector, which is then input into the long-term memory network LSTM. The model employs two long-term memory networks (LSTM), both of which have the same structure. One responsible for processing the embedded representation of the recipe context and the other responsible for processing the embedded representation of the recipe name tag stream. The embedded representation of the method context and the embedded representation of the method name tag stream are spliced to obtain a vector. The vector is submitted to the dense layer to get the likelihood that each word in the vocabulary is selected. The model feeds back the predicted tag to the method name sequence. This process continues until the model outputs an end marker.

The invention has the following beneficial effects:

(1) according to the method, the high-quality Java method name is recommended, so that a developer can quickly know the function of the method through the method name when calling the method in actual development, the energy spent on program function understanding is reduced, and further development efficiency is improved.

(2) Because a developer can better understand the functions of the whole Java method through the method name, the possibility of misusing the API can be reduced, the possibility of software defects of the software can be reduced, and the quality of the whole software code can be improved.

(3) The method helps developers, saves the cost of software development in the maintenance stage, and improves the overall understandability and maintainability of the software.

(4) By recommending high-quality method names to developers, the problem of personnel change in the development and maintenance stages can be solved, and the developers who newly participate in the project can quickly understand the functions of each module of the whole project, so that the cost of the whole project is reduced.

(5) The quality of the whole software code can be improved due to the high-quality Java method name, and the quality of other projects depending on the project is indirectly improved due to the improvement of the code quality of the project.

Drawings

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

(1) FIG. 1 is a layout diagram showing the overall process of the present invention.

(2) FIG. 2 is a flow chart illustrating data preprocessing according to the present invention.

(3) FIG. 3 is a flowchart illustrating a two-stage framework-based Java method name recommendation method according to the present invention.

(4) FIG. 4 is a diagram illustrating a architecture of a Recursive RNN model according to the present invention.

Detailed Description

In order to more clearly illustrate the present invention, the present invention is further described below with reference to preferred embodiments and drawings. Similar parts in the figures are denoted by the same reference numerals. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and is not to be taken as limiting the scope of the invention.

First, the technical concept of the technical solution disclosed in the present invention will be explained. The functions of modern software are increasingly complex, and the difficulty and high cost of software maintenance are the current problems facing increasingly complex software. The cost of software maintenance accounts for the greatest proportion of the total cost of the software, and the important reason for high maintenance cost is that the understandability of the software is low, and developers tend to spend a great deal of effort on understanding the code of the software. There are many reasons for the low software intelligibility. The names of variables, methods, parameters and the like which are not properly named in the program are often important reasons for low program understandability. Developers tend to be limited by the progress pressure of the project, put major effort on the implementation of the program functions, but neglect the importance of the naming of the program, resulting in a great deal of effort to understand the functions of the program during the maintenance phase.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1 is a general design diagram of a Java method name recommendation method based on a two-stage framework according to an embodiment of the present invention. Referring to fig. 1, the present embodiment provides a method for recommending Java method names based on a two-stage framework, including:

and (3) environment configuration: configuring the environment of a Java method name recommendation method based on a two-stage framework;

and (3) data mining of a Git warehouse: collecting a Java project warehouse under a well-known open source organization, analyzing and extracting to obtain a high-quality Java method;

a Java method name recommendation method based on a two-stage framework; the recommendation of the Java method name is realized based on a two-step neural network, and the process mainly comprises two stages, wherein in the preprocessing process, the names of the getter/setter method and the delegatones method are named at the stage. The first phase is a method classification phase, in which Java methods are classified according to the prefix of the method name. And a second stage of Java method name generation, wherein the second stage mainly comprises two items, one is to generate Java method names by adopting heuristic rules aiming at the Java method names beginning with specific prefixes, and the other is to generate the Java method names by adopting a neural network aiming at the Java methods beginning with other prefixes.

In specific implementation, the environment configuration is implemented in this embodiment. The compiler version is Python version Python3.6 and the Java version JDK 1.8. The server employs ubuntu 16.04.2. The integrated development environment adopts: PyCharm and IDEA. The crawler framework employs Scapy. The Java source code parsing framework adopts Javaparser. FastText and Recursive RNN are implemented using Pytrch.

In specific implementation, as shown in fig. 2, in the Git warehouse data mining implementation in this embodiment, in the crawling process, a crawler program deployed on a cloud server is responsible for analyzing and acquiring an address of a warehouse, and cloning and downloading the warehouse to the local, a code warehouse downloaded to the local is cleaned to obtain a source code file of the program, and the source code is subsequently processed and analyzed. The processed data can be stored on the cloud server. Since the GitHub has a corresponding anti-crawling mechanism, which can prevent excessive and too fast access times, the crawler adopts strategies of simulating request headers, stopping waiting, and the like, and strategies of a token pool to solve the crawler limit of the GitHub. The GitHub often judges whether a visitor is a user or a robot through a simple request header, corresponding limitation often exists on a background of a robot website, and in order to pretend to be a normal user to access the website, the program simulates the request headers of a plurality of access threads into the request headers of different user clients. To solve the problem, the project adopts a mode of setting delay to reduce the access frequency, and although the crawling frequency of crawlers is reduced, the limitation of the mechanism is prevented. Finally, because GitHub has a token mechanism, the so-called token mechanism, i.e., the website, may have certain restrictions on the authority of each user, such as limiting the frequency with which a user clones the warehouse over a period of time. The project firstly constructs a token pool, and a plurality of threads finish crawling on the website by using different tokens in the token pool in turn. The task scheduler is responsible for creating and scheduling crawler threads to complete the crawl of the GitHub by first placing crawled tasks into task queues, then assigning the tasks to different threads, and then organizing the threads to expedite the crawl of the GitHub repository.

Further, for the crawled Git warehouse, the item firstly traverses the corresponding Git warehouse directory to extract the source code, then extracts an Abstract Syntax Tree (AST) corresponding to the source code through a Java syntax parsing library JavaParser, and acquires the method name and the method body corresponding to the method and the context where the method is located (including the class name where the method is located, a parameter list of the Java method, a return value of the Java method, and a realization code of the Java method body) by traversing the abstract syntax tree.

Specifically, for the Java method obtained by parsing, the present project parses it into three data formats. The first data format is that each mark in the Java method body is obtained through lexical analysis, and is used in a preprocessing stage of a Java method name recommendation method based on a two-stage framework, and is used for generating a Java method name according to heuristic rules in the preprocessing stage. The second format is to parse the Java methods into a sequence of key-value pairs by lexical analysis, which is used in the second step to classify the Java methods according to different method name prefixes. The third format is to extract the context of the Java method (including the name of the class where the method is located, the parameter list of the Java method, the return value of the Java method, and the implementation code of the Java method body), which is used to generate the method name in the second step process.

In particular, it is shown in FIG. 3. The implementation case is realized based on a two-step neural network Java method name recommendation method. The core idea of the method is to recommend corresponding method names according to the characteristics of different types of methods, firstly classify the different methods through feed-forward neural network FastText, and then respectively adopt different recommendation algorithms to complete the recommendation of the whole method according to the different types of methods. After going through the data preprocessing stage, the whole process can be divided into two stages, the first stage is a method classification stage, and the method names of different prefixes are classified through a feedforward neural network. The second stage is a method naming stage, for methods in which some method names begin with special prefixes, corresponding heuristic rules are adopted to generate the method names, and for methods in which the method names begin with other prefixes, the project adopts a Recursive RNN to generate the method names.

Specifically, in the data preprocessing stage, if one method is a getter/setter method or a delections method, the input data format 1 of the item is handed to a corresponding heuristic generation program to generate a corresponding method name for the method, otherwise, the item hands the data format 2 of the method to a feedforward network classifier. The classifier classifies the methods into different categories according to different method name prefixes of the methods, for methods with some method names beginning with some special prefixes such as (test, is, get, set), the item gives the data format 3 to a corresponding rule generator and adopts some heuristic rules to generate corresponding method names, and for methods with other prefixes, gives the related data format 3 to the neural network Recursive RNN to generate method names.

Further, the main purpose of the classification preprocessing of the method is to identify a special method suitable for generating a method name by using a heuristic rule through a classifier. The classifier may classify methods into multiple categories for their method name prefixes (such as get, set, test, is). In order to identify the method name by a special prefix naming method, FastText is used as a classifier of the first stage in the project, because the speed of the method name in the training and testing stages is very fast, and in addition, the method name can keep higher classification accuracy.

Further, the classifier of the first stage divides the methods into two main categories, wherein the prefixes of the method names of the methods of the first category are composed of special prefixes, and the method names of the methods of the second category have no special prefixes. Thus, the present term considers a method as a first-type method if the prefix of the corresponding method name of the method is "get", "set", "is", or "test". For the first method, the item adopts heuristic rules to recommend the method name. For the second type of method, the project adopts a curative RNN to generate a corresponding method.

Specifically, the project provides a corresponding heuristic rule, firstly, for a first-class method, the project regards the method names of the methods as the concatenation of a prefix and a suffix, for a given method body mb, the project converts the method into a data format 2 and extracts an identifier sequence in the data format 2, and the project selects a suffix s with the highest frequency of occurrence. The prefix obtained by classification in the first step is p, and the prefix and the suffix are spliced together by a hump naming rule to obtain a corresponding method name in the project.

Further, for the second class of methods, the present project employs a recursive RNN, which is a special seq2seq model based on encoder and decoder mechanisms, to generate the corresponding method names. The model can extract important features from the context of the method and select possible labels to compose the corresponding method name. The main reason that this project selects recursive RNN is because it takes the current output as the next input, and by considering the already generated sequence and combining the current input sequence at the same time, it can provide a wider context for the decoder to interpret and output the next label, so it improves the semantic consistency of the generated method name.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the invention described above can be implemented by programs.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:磁轴承的位移确定方法、装置、系统、存储介质及处理器

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!