A method for automatic generation of annotations based on multi-modal code representations
By employing a multimodal code representation method, and utilizing graph encoders, tree encoders, and field encoders to comprehensively model the code, the problem of low annotation quality in existing technologies is solved, and high-quality annotation generation is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SICHUAN UNIV
- Filing Date
- 2022-11-30
- Publication Date
- 2026-06-30
AI Technical Summary
Existing methods for automatically generating code comments lack a holistic perspective on the code and fail to fully explore the semantic and structural features of the code, resulting in low-quality comment generation.
A multimodal code representation method is adopted, which extracts multiple modal information of the code through graph encoder, tree encoder and field encoder respectively, and generates high-quality annotations through feature fusion layer. Comprehensive modeling is carried out using API context graph, abstract syntax tree and field sequence.
It significantly improves the quality of generated code comments, enhances the overall learning of code representation, and generates more accurate and comprehensive comments.
Smart Images

Figure CN115756597B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of code annotation generation technology, specifically a method for automatically generating annotations based on multimodal code representation. Background Technology
[0002] As software size and complexity increase with iterations, the difficulty of maintenance and modification increases significantly. Experience shows that developers spend 58% of their time on program understanding throughout the software development lifecycle. Code understanding is a prerequisite for code reuse and software maintenance. Code comments help developers intuitively understand the function of code snippets without fully reviewing the code, effectively improving software development productivity. However, manually maintaining comments is labor-intensive and time-consuming. With each software iteration, comments often become lost or outdated. Therefore, inventing a method to automatically generate code comments using models is essential.
[0003] The goal of automatic code annotation generation is to generate natural language descriptions of the functions of code snippets. Initially, these methods were mainly based on manual templates and keyword retrieval. Recently, with the increasing size of accumulated code on open-source platforms (GitHub), data-driven methods based on neural networks have gradually become mainstream. These methods mostly employ the encoder-decoder framework of Neural Machine Translation (NMT). The encoder, as the component learning code representation, is responsible for providing a continuous representation of the code, while the decoder parses this intermediate information to generate annotations step by step. The main difference between these encoder-decoder framework-based methods lies in how they learn the code representation. In fact, code can be represented in multiple modalities, including sequences, trees, graphs, etc., each focusing on extracting specific aspects of the code's information. As code representation becomes increasingly complex, using multiple encoders to learn code representations can significantly improve model performance. Therefore, multimodal methods are highly promising and can drive the development of automatic code annotation generation methods.
[0004] However, existing automatic code annotation generation methods still face the following challenges: (1) Lack of a holistic perspective on code modeling. The use of Application Programming Interfaces (APIs) is strongly correlated with the functionality of code snippets. However, if only the sequence of APIs is extracted or the code is treated as plain text, this crucial clue for generating annotations may be missed. Moreover, API usage is often accompanied by control units (e.g., if, for, while, switch, etc.), which contain the overall logic of the program. Therefore, considering a holistic approach to representing the code may improve model performance. (2) Insufficient exploitation of code semantics. Code naturally possesses rich semantic information (e.g., variable names, API call names, syntax information, etc.), and also exhibits distinct structural features (e.g., loops, control flow information, function calls, etc.). It is very difficult to extract all aspects of code information using only one encoder. Therefore, an alternative approach is to incorporate multiple encoders into the model. Summary of the Invention
[0005] To address the aforementioned problems, the present invention aims to propose an automatic multimodal code annotation generation method. This method involves incorporating multiple encoders within the model to obtain intermediate representation vectors for their respective codes. A fusion layer combines these intermediate vectors, and finally, a joint encoder generates high-quality annotations word by word. The technical solution is as follows:
[0006] An automatic annotation generation method based on multimodal code representation includes the following steps:
[0007] Step 1: Code data preprocessing: Constructing three representation modalities for code snippets: API context graph, abstract syntax tree, and its field sequence; the API context graph represents the relationship between code snippets and control units, describing the overall logic of the program; the abstract syntax tree represents the structural and syntactic information in the code snippet, and the field sequence represents the natural semantic information in the code snippet;
[0008] Step 2: Representation learning for the multimodal code architecture: Based on an encoder-decoder framework, a graph encoder is used to represent the API context graph, a tree encoder to represent the abstract syntax tree, and a field encoder to represent the field sequence. The graph encoder includes a node embedding layer and a graph attention module. The graph encoder encodes the input API context graph into graph intermediate feature vectors; the tree encoder encodes the input abstract syntax tree into tree intermediate feature vectors; and the field encoder encodes the input field sequence into field intermediate feature vectors.
[0009] Step 3: Feature fusion and decoding: The intermediate feature vectors output by each encoder are fused through the feature fusion layer to generate a single joint vector. Finally, the joint encoder outputs the code annotation word by word.
[0010] Furthermore, constructing the API context graph in step 1 includes the following steps:
[0011] Step 11: Given a code snippet, extract the function names of the code separately as the starting node of the ACG; while all statements containing the "return" keyword are mapped to the same node as the ending node;
[0012] Step 12: If the statement contains an API method call, create an ACG node for the statement; if the parameter of the API method call is also an API method call, first create nodes for the parameter in order from right to left.
[0013] Step 13: If the current statement is a control statement, then create a control node, a Condition node, and a Body node for the control unit, as well as edges connecting each node;
[0014] Step 14: Systematically analyze the control dependencies between nodes and connect the edges of the corresponding nodes.
[0015] Furthermore, step 2 specifically includes the following steps:
[0016] Step 21: Obtain the graph embedding representation I of the API context graph via the node embedding layer. ACG ={I1,I2,...,I n};
[0017] Step 22: Map the input graph embedding representation to two feature spaces, the query space and the value space, through a linear transformation to obtain the query vector and the value vector, respectively; the formula is as follows:
[0018] f(I ACG ) = I ACG W q
[0019] g(I ACG ) = I ACG W k (1)
[0020] Where f(·) represents a linear transformation mapping to the query space; g(·) represents a linear transformation mapping to the value space; W q W k All of these are learnable parameters;
[0021] Step 23: Obtain the attention matrix α by calculating the dot product of the query vector and the value vector. ij The specific calculation formula is as follows:
[0022]
[0023] s ij =f(I i )⊙g(I j ) T (3)
[0024] in, B represents the adjacency matrix of ACG, E n It is the identity matrix; α ij Let represent the attention coefficient of node j with respect to node i when composing the representation vector of node i, and s ij The attention score of node j for node i; softmax j (·) represents the normalized exponential function;
[0025] Step 24: Output the ensemble representation of each node through the graph attention module, calculated as follows:
[0026]
[0027] Among them, W v These are learnable parameters; This represents the intermediate representation of ACG after passing through the attention module; The set of neighboring nodes of node i; ReLU(·) represents the piecewise linear activation function;
[0028] Step 25: Connect the graph attention module using a BiLSTM network to obtain the graph intermediate feature vector output by the graph encoder.
[0029]
[0030] in, h is the ACG intermediate feature vector of node i-1 output by the graph encoder; ACG This is the output of the hidden layer of the last unit of the graph encoder; The ACG intermediate feature vector of node i output by the graph encoder;
[0031] Step 26: Process the embedded abstract syntax tree vector I using a Tree-LSTM network. AST The calculation is as follows:
[0032]
[0033] in, h is the intermediate vector of the AST output by the tree encoder. AST This is the output of the hidden layer of the last unit of the encoder; and These are the inputs for the left and right child nodes of the AST node, respectively;
[0034] Step 27: Process the embedded field sequence vector I using a BiLSTM network. TOKEN The calculation is as follows:
[0035]
[0036] in, h is the intermediate vector of the TOKEN output by the field encoder. TOKEN This is the output of the hidden layer of the last unit of the encoder; Let i be the vector representation of the i-th field in the code field sequence.
[0037] Furthermore, step 3, feature fusion and decoding, specifically includes the following steps:
[0038] Step 31: Given a specified intermediate feature vector of the graph Introducing a self-attention mechanism, the graph context vector is calculated according to the following formula.
[0039]
[0040] in, This is the attention coefficient matrix, representing the synthesis... For time The weights are calculated accordingly; the tree context vector is calculated similarly based on formula (8). and field context vector
[0041] Step 32: Given the tree context vector and field context vector Calculate the weighted average of the two.
[0042]
[0043]
[0044] in, This represents the contribution of each element to the result at each time step of the decoding process; tanh(·) is the hyperbolic tangent activation function; W p is a learnable parameter; b is a bias variable; δ(·) represents the sigmoid activation function;
[0045] Step 33: Considering the graph context vector The uniqueness of the feature space, which is directly related to the connection method and the weighted average value. The features are fused to obtain the final feature fusion representation. Right now:
[0046]
[0047] Step 34: Use an LSTM network as the decoder, whose initial state is the average of the last unit results from the tree encoder and the field encoder. During the model training phase, the embedding vector S of the reference annotation will be input. t The driver decoding process is calculated as follows:
[0048]
[0049]
[0050] Among them, h INIT W represents the initial state of the decoder. s W t These are learnable parameters; C represents a probability distribution; t This is the output of the decoder;
[0051] Based on the probability distribution of the output The word with the highest probability is selected as the result of this decoding sequence, and the final iterative output is the target sequence Y = {Y1, Y2, ..., Y}. T}
[0052] The beneficial effects of this invention are:
[0053] 1) This invention models source code from a holistic perspective, focusing on API usage and associated control flow units in the code, which can significantly enhance the learning of code representation.
[0054] 2) This invention proposes an automatic code annotation generation method based on multimodal representation, which uses multiple encoders to more comprehensively uncover potential information in the code, thereby generating high-quality annotations. Attached Figure Description
[0055] Figure 1 This is a diagram illustrating the overall architecture of the method of the present invention.
[0056] Figure 2 An example of an API context graph (ACG) for a code snippet.
[0057] Figure 3 This is a schematic diagram of the attention module in a graph encoder.
[0058] Figure 4 This is a schematic diagram of the Tree-LSTM used in the tree encoder.
[0059] Figure 5 This is a schematic diagram of the Bi-LSTM used in the field encoder.
[0060] Figure 6 This is a line graph showing the effect of sequence length on the results. Detailed Implementation
[0061] The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. The main idea of this paper is inspired by the following practice: to understand code semantics, development participants first scan the entire program (function names, API calls, control flow units, etc.), and then browse more comprehensive code information (syntax, identifier semantics, etc.). Therefore, the present invention proposes an automatic code annotation generation method based on multimodal representation. First, the source code is represented using API usage (called API context graph, ACG) in the control flow graph (CFG), and processed by a graph encoder. Then, in this multimodal network architecture, two additional encoders (tree and field encoders) are accommodated to more comprehensively uncover the potential information in the code. Finally, a joint decoder generates target code annotations word by word based on the fusion result after encoding. The overall architecture of the present invention is as follows: Figure 1 As shown, the method mainly consists of three parts: data processing, multimodal code representation, and so on.
[0062] Step 1: Given a code snippet, the method of this invention obtains three code representation modes through data processing: ACG, AST, and token sequence. Specifically, an ACG is systematically constructed from the source code. After analyzing the source code line by line, the nodes and edges of the ACG are constructed respectively. An ACG can be regarded as the skeleton of the program, presenting a joint overview of the code's semantic and structural information, such as... Figure 2 As shown. The specific construction steps are described below:
[0063] Step 11: Given a code snippet, extract the function names of the code separately as the starting node of the ACG; while all statements containing the "return" keyword are mapped to the same node as the end.
[0064] Step 12: If the statement contains API method calls, create an ACG node for that statement. Note that if the parameters of an API method call are also API method calls, then nodes for the parameters are created first, from right to left.
[0065] Step 13: If the current statement is a control statement, create a separate control node for the control unit, along with several other nodes (Condition and Body nodes) and the edges connecting them.
[0066] Step 14: Systematically analyze the control dependencies between nodes and connect the edges of the corresponding nodes.
[0067] Step 2: Based on the three code representation modalities in Step 1, three encoders (graph encoder, tree encoder, and field encoder) are used to obtain a multimodal representation of the code. The computation process of each encoder is described in detail below.
[0068] The graph encoder encodes the input ACG into a graph intermediate feature vector, which contains a node embedding layer and a graph attention module. Figure 3 A schematic diagram of the attention module is provided, and the specific steps are described below:
[0069] Step 21: Obtain the graph embedding representation I of ACG through a node embedding layer. ACG ={I1,I2,...,I n}
[0070] Step 22: Map the input graph embedding representation to two feature spaces through a linear transformation, one representing the query space and the other representing the value space. The formula is as follows:
[0071] f(I ACG ) = I ACG W q
[0072] g(I ACG ) = I ACG W k
[0073] Where f(·) represents a linear transformation mapping to the query space; g(·) represents a linear transformation mapping to the value space; W q W k All of these are learnable parameters, while W appears later in the text. * All of these are learnable parameters.
[0074] Step 23: Obtain the attention matrix α by calculating the dot product of the query vector and the value vector. ij The specific calculation formula is as follows:
[0075]
[0076] s ij =f(I i )⊙g(I j ) T
[0077] in, B represents the adjacency matrix of ACG, E n It is the identity matrix. α ij This can be interpreted as the attention coefficient of node j with respect to node i when synthesizing the representation vector of node i.
[0078] Step 24: The graph attention module outputs the ensemble representation of each node, calculated as follows:
[0079]
[0080] in,; This represents the intermediate representation of ACG after passing through the attention module; The set of neighboring nodes of node i; ReLU(·) represents the piecewise linear activation function.
[0081] Step 25: Connect this graph attention module with a BiLSTM network to obtain the graph encoding output graph intermediate representation vector.
[0082]
[0083] The tree encoder encodes the input AST into tree intermediate feature vectors. The specific steps are summarized as follows:
[0084] Step 26: Process the embedded AST vector I using a Tree-LSTM network. AST , Figure 4 A schematic diagram of a Tree-LSTM network is given, and its calculation process is summarized as follows:
[0085]
[0086] in, h is the intermediate vector of the AST output by the tree encoder. AST This is the output of the hidden layer of the last unit of the encoder; and These are the inputs for the left and right child nodes of the AST node, respectively.
[0087] The field encoder encodes the input AST into intermediate feature vectors for each field. The specific steps are summarized as follows:
[0088] Step 27: Process the embedded field sequence vector I using a BiLSTM network. TOKEN , Figure 5 A schematic diagram of a Bi-LSTM network is given, and its calculation process is as follows:
[0089]
[0090] in, h is the intermediate vector of the TOKEN output by the field encoder. TOKEN This is the output of the hidden layer of the last unit of the encoder.
[0091] Step 3: After obtaining the intermediate representation vectors of the three code modalities in Step 2, the three encoding results are then fused and transmitted to a joint decoder to output annotations word by word. The detailed process of feature fusion and decoding is described below:
[0092] The feature fusion layer fuses the intermediate features output by each encoder to generate a single joint vector.
[0093] Step 31: Given a specified intermediate feature vector Introducing the traditional self-attention mechanism,
[0094] The graph context vector is calculated according to the following formula.
[0095]
[0096] Where, β tj It is also an attention coefficient matrix, representing the synthesis For time The weights. Similarly, the other two vectors, the tree context vector, can be calculated. and field context vector
[0097] Step 32: Given and Calculate the weighted average of the two.
[0098]
[0099]
[0100] in, This represents the contribution of each element to the result at each time step of the decoding process. tanh(.) is the hyperbolic tangent activation function. p is a learnable parameter; b is a bias variable; δ(·) represents the sigmoid activation function.
[0101] Step 33: Considering The uniqueness of the feature space, which is directly related to the way it is connected. The features are fused to obtain the final feature fusion representation. Right now:
[0102]
[0103] Step 34: Use an LSTM network as the decoder, whose initial state is the average of the last unit results from the tree encoder and the field encoder. During the model training phase, the embedding vector S of the input reference annotation is used. tThe driver decoding process is calculated as follows:
[0104]
[0105]
[0106] Based on the probability distribution of the output The word with the highest probability is selected as the result of this decoding sequence, and the final iterative output is the target sequence Y = {Y1, Y2, ..., Y}. T}
[0107] To evaluate the effectiveness of the method presented in this invention, experiments were conducted on two open-source datasets (JHD and CSD). To demonstrate that the annotation quality generated by this method is sufficiently competitive compared to other annotation generation techniques, several advanced clustering methods (CodeNN, Tree-LSTM, Tl-codeSum, DeepCom, and ConvGNN) were compared. The evaluation metrics used included BLEU, METEOR, and ROUGE.
[0108] Table 1. Comparison results of HCCG and baseline methods
[0109]
[0110] Table 1 reports the main results of the proposed method compared to five other comparative methods. From the table, it is clear that the proposed method achieves the best results across all three evaluation metrics. For the JHD dataset, HCCS improves upon the other methods by 9.5%, 7.1%, and 6.4%, respectively. For the CSD dataset, the corresponding improvements are 13.6%, 15.1%, and 17.8%. Furthermore, it can be observed from the table that CodeNN and tree-LSTM do not show excellent results on either dataset. Both employ a single encoder, making it extremely challenging for the model to extract various aspects of information from the code. In summary, the proposed method can learn a holistic and comprehensive representation of code, thereby fully mining the latent semantics within the source code. These results demonstrate that the proposed method can effectively generate high-quality annotations.
[0111] Table 2. Impact of different multimodal fusion methods on the results
[0112]
[0113] The method of this invention incorporates three encoders in the model, resulting in three context vectors: and This study aims to explore the impact of different feature combination methods on the results. Table 2 presents the results of the proposed method under different combination methods. Here, "+" represents a weighted addition of two vectors, and ";" represents a direct connection between two vectors. The table shows that fusing three context vectors yields better results than other cases. This result further demonstrates the effectiveness of multimodal code representation. Specifically, comparing row 1 with row 6, we can observe an 8.81% improvement in the BLEU score. Therefore, we can conclude that holistic code modeling of API usage effectively improves the results. Comparing row 5 and row 6, we can conclude that the optimal feature fusion method is...
[0114] The performance of the method in this invention varies when receiving samples of different input lengths. This study aims to analyze the impact of different input sequence lengths on the accuracy of annotated vocabulary prediction. To this end, the average BLEU score was calculated for samples of different input sequence lengths on the JHD dataset, and the results of several models (treeLSTM, deepCom, convGNN, and HCCS) were recorded. Figure 6 As shown, all methods show an upward trend as the code sequence length increases. Because micro-code snippets with fewer than 20 fields may lack complete calling logic, all methods initially perform poorly.
[0115] The method of this invention exhibits acceptable performance in most cases and is sufficiently competitive. However, when the code length exceeds 200 fields, the results of all methods show a downward trend. This phenomenon can be explained as follows: long code segments tend to contain more noise, making it difficult for the method to consistently capture valid key clues; simultaneously, limited by the maximum sequence length, these methods struggle to capture long-distance dependencies between fields. However, the downward trend exhibited by the method of this invention is the least pronounced. This is because the encoding of ACG allows the method to learn the overall code representation, thus enabling the method to overcome noise interference and long-distance dependency problems to some extent.
Claims
1. A method for automatically generating annotations based on multimodal code representation, characterized in that, Includes the following steps: Step 1: Preprocessing of code data: Constructing three representation modalities for code snippets: API context graph, abstract syntax tree, and its field sequence; API context diagrams represent the relationship between code snippets and control units, describing the overall logic of the program; abstract syntax trees represent the structural and syntactic information in code snippets, and field sequences represent the natural semantic information in code snippets; Step 2: Representation learning for the multimodal code architecture: Based on an encoder-decoder framework, a graph encoder is used to represent the API context graph, a tree encoder to represent the abstract syntax tree, and a field encoder to represent the field sequence. The graph encoder includes a node embedding layer and a graph attention module. The graph encoder encodes the input API context graph into graph intermediate feature vectors; the tree encoder encodes the input abstract syntax tree into tree intermediate feature vectors; and the field encoder encodes the input field sequence into field intermediate feature vectors. Step 3: Feature fusion and decoding: The intermediate feature vectors output by each encoder are fused through the feature fusion layer to generate a single joint vector. Finally, the joint encoder outputs the code annotations word by word. Step 3, feature fusion and decoding, specifically includes the following steps: Step 31: Given a specified intermediate feature vector of the graph A self-attention mechanism is introduced, and the graph context vector is calculated according to the following formula. : (8); in, This is the attention coefficient matrix, representing the synthesis... Time for The weights are calculated accordingly; the tree context vector is calculated similarly based on formula (8). and field context vector ; Step 32: Given the tree context vector and field context vector Calculate the weighted average of the two. : (9); in, This represents the contribution of each element to the result at each time step of the decoding process. It is the hyperbolic tangent activation function; These are learnable parameters; For deviation variables; This represents the sigmoid activation function; This is the output of the hidden layer of the last unit of the field encoder; Step 33: Considering the graph context vector The uniqueness of the feature space, which is directly related to the connection method and the weighted average value. The features are fused to obtain the final feature fusion representation. ,Right now: (10) Step 34: Use an LSTM network as the decoder, whose initial state is the average of the last unit results from the tree encoder and the field encoder. During the model training phase, the embedding vector of the input reference annotation is used. The driver decoding process is calculated as follows: (11); (12); in, This is the initial state of the decoder. These are learnable parameters; It is a probability distribution; This is the output of the decoder; This is the output of the hidden layer of the last unit of the tree encoder; Based on the probability distribution of the output The word with the highest probability is selected as the result of the current decoding sequence, and the target sequence is finally output iteratively. .
2. The automatic annotation generation method based on multimodal code representation according to claim 1, characterized in that, The construction of the API context graph in step 1 includes the following steps: Step 11: Given a code snippet, extract the function names of the code separately as the starting node of the ACG; while all statements containing the "return" keyword are mapped to the same node as the ending node; Step 12: If the statement contains an API method call, create an ACG node for the statement; if the parameter of the API method call is also an API method call, first create nodes for the parameter in order from right to left. Step 13: If the current statement is a control statement, then create a control node, a Condition node, and a Body node for the control unit, as well as edges connecting each node; Step 14: Systematically analyze the control dependencies between nodes and connect the edges of the corresponding nodes.
3. The automatic annotation generation method based on multimodal code representation according to claim 1, characterized in that, Step 2 specifically includes the following steps: Step 21: Obtain the graph embedding representation of the API context graph via the node embedding layer. ; Step 22: Map the input graph embedding representation to two feature spaces, the query space and the value space, through a linear transformation to obtain the query vector and the value vector, respectively; the formula is as follows: (1); in, This represents a linear transformation mapped to the query space; This represents a linear transformation mapping to the value space; , All of these are learnable parameters; Step 23: Obtain the attention matrix by calculating the dot product of the query vector and the value vector. The specific calculation formula is as follows: (2); (3); in, B represents the adjacency matrix of ACG. It is the identity matrix; Indicates when the composite node When representing vectors, nodes For nodes The attention coefficient, and Let be the attention score of node j for node i; This represents the normalized exponential function; Step 24: Output the ensemble representation of each node through the graph attention module, calculated as follows: (4); in, These are learnable parameters; This represents the intermediate representation of ACG after passing through the attention module; The set of neighboring nodes of node i; Represents a piecewise linear activation function; Step 25: Connect the graph attention module using a BiLSTM network to obtain the graph intermediate feature vector output by the graph encoder. : (5); in, The ACG intermediate feature vector of node i-1 output by the graph encoder; This is the output of the hidden layer of the last unit of the graph encoder; The ACG intermediate feature vector of node i output by the graph encoder; Step 26: Process the embedded abstract syntax tree vectors using a Tree-LSTM network. The calculation is as follows: (6); in, This is the intermediate vector of the AST output by the tree encoder; and These are the inputs for the left and right child nodes of the AST node, respectively; Step 27: Process the embedded field sequence vector using a BiLSTM network The calculation is as follows: (7); in, This is the intermediate vector of the TOKEN output by the field encoder; Let i be the vector representation of the i-th field in the code field sequence.