A method and system for fusing two-dimensional code representation with dependency encoded code generation
By modeling code snippets as two-dimensional structures and using sparse autoencoders for in-line dependency encoding, the shortcomings of existing code generation models in generalization ability and structural understanding are addressed, thereby improving the accuracy of code generation and the ability to process long sequences.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANJING UNIV OF AERONAUTICS & ASTRONAUTICS
- Filing Date
- 2025-03-13
- Publication Date
- 2026-06-12
AI Technical Summary
Existing code generation models lack generalization ability when dealing with code structure and long sequences, ignoring the two-dimensional structure of code and inter-line dependencies, resulting in inaccurate generation results.
By modeling code snippets as two-dimensional structures and combining them with sparse autoencoders for in-line dependency encoding, and using self-attention mechanisms and multilayer perceptrons for intermediate character representation and model prediction, the generalization and structural understanding capabilities of the code generation model are improved.
It improves the ability of code generation models to model code structure and relationships, enhances performance in long context scenarios, and significantly outperforms traditional methods in code modeling, long sequence understanding, and context retrieval tasks.
Smart Images

Figure CN120255874B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and in particular to a code generation method and system that integrates two-dimensional code representation and dependency encoding. Background Technology
[0002] Existing code generation models, such as Codex and CodeLlama, are typically based on traditional Natural Language Processing (NLP) models and are obtained through incremental training on large-scale code datasets. These models treat code snippets as one-dimensional byte sequences and capture the sequence's order information through positional encoding (such as RoPE and ALiBi). However, this approach also has some significant drawbacks:
[0003] 1. Poor generalization of positional encoding: While positional indexing can effectively capture the relative positional relationships between characters, it ignores the potential translation invariance of the code itself in certain aspects. Specifically, functions, classes, or modules defined in many code snippets are semantically and functionally equivalent, even if they are in different positions in the code. Therefore, changing the position of these code snippets does not affect the execution logic of the program. However, existing positional encoding methods can only learn the distribution patterns between fixed positions, which limits the generalization ability of the model when faced with changes in code structure or order.
[0004] 2. Lack of code structure information: Existing code generation models typically treat code as a one-dimensional linear sequence, ignoring its essential structural features. Code is not merely a linear string of single characters or tags, but has a clear hierarchical structure, including functions, classes, conditional statements, etc. The logical relationships between these elements constitute the program's functionality and execution flow. Traditional serialization methods fail to effectively capture the inter-line logical flow and inline operations in the code, making it difficult for the model to understand the overall structure of the code, thus affecting the correctness and rationality of the generated results.
[0005] 3. Insufficient long sequence processing capability: Traditional positional encoding methods typically experience a significant performance drop when processing long sequences that exceed the training length. This is mainly because traditional positional encoding methods usually represent positions based on fixed-length indices. When the length of the input sequence exceeds the training length, the model cannot model the positional relationships between characters beyond the training length. Therefore, for tasks that require processing long code snippets or complex code structures, such as warehouse-level code generation tasks, the effectiveness of traditional positional encoding methods is relatively limited. Summary of the Invention
[0006] The purpose of this invention is to overcome the shortcomings of the prior art and provide a code generation method and system that integrates two-dimensional code representation and dependency encoding. By modeling code fragments as two-dimensional structures and combining them with sparse autoencoders (SAE) for interline dependency encoding, the generalization ability, structural understanding ability and long context processing ability of the code generation model are improved. This can effectively improve the model's ability to model code structure and relationships, thereby improving the model's performance.
[0007] To solve the above-mentioned technical problems, the present invention is implemented using the following technical solution:
[0008] In a first aspect, the present invention provides a code generation method that integrates two-dimensional code representation and dependency encoding, comprising:
[0009] The acquired code data is modeled into a two-dimensional structure to obtain two-dimensional code;
[0010] The two-dimensional code is input into the trained code generation model: word embeddings are performed on the two-dimensional code through an embedding layer to obtain word embedding vectors; inter-line masking is performed on the two-dimensional code through inline attention masking to obtain a mask matrix; dependency modeling is performed on the word embedding vectors through a sparse autoencoder to obtain dependency encoding of inter-line dependencies in the code; attention is calculated using a self-attention mechanism based on the word embedding vectors, mask matrix, and dependency encoding to obtain an intermediate representation of each character; based on the intermediate representation of each character, model prediction is performed through a multilayer perceptron and a softmax layer to obtain the generated code.
[0011] Optionally, modeling the acquired code data into a two-dimensional structure includes:
[0012] Obtain the position of each newline character in the code data to get the newline character position code. ,in, Indicates the newline character number;
[0013] Encoding based on the newline character position For any row index , making Then the character is considered Belonging to the same row, among which, Indicates row index, Represents characters in code data Location, and This indicates the position of two adjacent lines of code.
[0014] Optionally, the mask matrix M is calculated using the following formula:
[0015]
[0016] in, Character and characters Accessibility between characters is determined by setting a value as small as possible so that the attention score between the two characters is 0 after softmax calculation.
[0017] Optionally, the data processing procedure for dependency modeling includes:
[0018] Based on the word embedding vector, the code line is obtained. The last character Word embedding As a line of code semantic anchors, where lines of code The characters include ;
[0019] According to the semantic anchor Calculate code lines using a sparse autoencoder (SAE) The dependency encoding is calculated as follows:
[0020]
[0021] in, For learnable parameters in sparse autoencoders (SAE), This represents the encoder in a sparse autoencoder (SAE), used to transform the input into feature activation values. This represents a feature dictionary, where features in the dictionary are activated based on their activation values. Indicates the bias value. This represents the activation function. Indicates a line of code The activation values of dictionary features, Indicates a line of code Dependency encoding.
[0022] Optionally, the dependency modeling includes distance-based activation value enhancement, implemented through the following formula:
[0023]
[0024] in, Indicates a line of code The activation values of dictionary features, This indicates a cropping operation, taking... and The larger value in Indicates the input length of the model. Indicates semantic anchor point The location.
[0025] Optionally, the data processing procedure for attention calculation includes:
[0026] The attention score between characters is calculated based on the word embedding vector, dependency encoding, and mask matrix M, specifically using the following formula:
[0027]
[0028] in, Character Word embedding, Character Dependency encoding of the line of code. This indicates the character encoded via dependency. Word embedding, Character Word embedding, Character Dependency encoding of the line of code. This indicates the character encoded via dependency. Word embedding, This represents the key vector in the self-attention mechanism. Represents the query mapping matrix. This represents the value vector in the self-attention mechanism. Represents the key mapping matrix, Character and characters Attention scores between Character and characters Accessibility between them;
[0029] The intermediate representation of each character is calculated based on the attention scores between the characters, specifically using the following formula:
[0030]
[0031] in, Character Intermediate representations generated through a self-attention mechanism This represents the value vector in the self-attention mechanism. Represents a value mapping matrix, This represents the normalization function.
[0032] Optionally, the self-attention mechanism alleviates the attention distraction problem through MLP, specifically through the following formula:
[0033]
[0034] in, Character and characters Attention scores between This represents a set of multilayer perceptrons. This indicates a dimension concatenation operation.
[0035] Optionally, the model prediction using a multilayer perceptron and a softmax layer is achieved through the following formula:
[0036]
[0037] in, This represents the probability distribution of the model's final prediction results. This represents the normalization function, which transforms predicted values into a probability distribution. This represents a set of multilayer perceptrons. This represents the probability mapping matrix, which converts the model output into unnormalized scores in the vocabulary.
[0038] Optionally, the code generation model uses the following loss function:
[0039]
[0040] in, This represents the total loss value of the model. This represents the model's cross-entropy loss, used to optimize the difference between the model's predictions and the true labels. This represents the sparsity loss of the model, used to control the sparsity of the activation vectors in the sparse encoding, enabling the model to focus on key dependencies. These represent the hyperparameters of the model, used to control the activation sparsity in the sparse autoencoder. This represents the activation value of a feature in a sparse autoencoder.
[0041] Secondly, the present invention provides a code generation system that integrates two-dimensional code representation and dependency encoding, comprising:
[0042] The 2D structure modeling module is used to: model the acquired code data into a 2D structure to obtain 2D code;
[0043] The word embedding module is used to: embed words into the two-dimensional code through the embedding layer to obtain word embedding vectors;
[0044] The inline masking module is used to: perform inline masking on the two-dimensional code using inline attention masking to obtain a mask matrix;
[0045] The dependency modeling module is used to: perform dependency modeling on the word embedding vectors using a sparse autoencoder to obtain the dependency encoding of inter-line dependency relationships in the code;
[0046] The attention calculation module is used to: perform attention calculation using a self-attention mechanism based on the word embedding vector, mask matrix, and dependency encoding to obtain the intermediate representation of each character;
[0047] The model prediction module is used to: perform model prediction based on the intermediate representation of each character through a multilayer perceptron and a softmax layer to obtain the generated code.
[0048] Compared with existing technologies, the beneficial effects achieved by this invention are as follows:
[0049] 1. The code generation method integrating two-dimensional code representation and dependency encoding provided by this invention is based on the Transformer model. It treats code as a two-dimensional structure and proposes dependency encoding to model the dependencies within this structure, effectively improving the model's ability to model code structure and relationships, thereby enhancing model performance. The code fragment is modeled as a two-dimensional structure, where the vertical dimension represents the logical flow between code lines, and the horizontal dimension represents fine-grained meta-operations on characters within a line. In the shallow layer of the Transformer model, an in-line attention mask is set to extract the functional semantic representation of each line of code. In the deep layer of the Transformer model, a sparse autoencoder (SAE) is used to extract features from inter-line dependencies and generate corresponding dependency encodings, achieving the embedding of inter-line dependencies. Dependency embedding and character embedding are fused, attention scores between characters are calculated, and attention fusion and position-aware feature enhancement are applied to the attention scores to ensure the method's performance in long-context scenarios.
[0050] 2. The code generation system that integrates two-dimensional code representation and dependency encoding provided by this invention significantly outperforms existing systems in tasks such as code modeling, long sequence understanding, functional correctness, and context retrieval by setting up a two-dimensional structure modeling module, a word embedding module, an interline masking module, a dependency modeling module, an attention calculation module, and a model prediction module. It has practical significance and good application prospects. Attached Figure Description
[0051] Figure 1 This is a flowchart of a code generation method that integrates two-dimensional code representation and dependency encoding according to an embodiment of the present invention. Detailed Implementation
[0052] The technical solution of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the embodiments and specific features in the embodiments are detailed descriptions of the technical solution of the present application, rather than limitations thereof. In the absence of conflict, the embodiments and technical features in the embodiments can be combined with each other.
[0053] It should be noted that the term "and / or" in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.
[0054] Example 1:
[0055] This invention discloses a code generation method that integrates two-dimensional code representation and dependency encoding, with reference to... Figure 1 As shown, the specific steps include the following:
[0056] S1, model the acquired code data into a two-dimensional structure to obtain two-dimensional code;
[0057] S2, the two-dimensional code is input into the trained code generation model: word embedding is performed on the two-dimensional code through an embedding layer to obtain word embedding vectors; inter-line masking is performed on the two-dimensional code through an inline attention mask to obtain a mask matrix; dependency modeling is performed on the word embedding vectors through a sparse autoencoder to obtain dependency encoding of inter-line dependencies in the code; attention is calculated using a self-attention mechanism based on the word embedding vectors, mask matrix, and dependency encoding to obtain an intermediate representation of each character; based on the intermediate representation of each character, model prediction is performed through a multilayer perceptron and a softmax layer to obtain the generated code.
[0058] Specifically,
[0059] In step S1, the code snippet is modeled as a two-dimensional structure, where the vertical dimension represents the logical flow between code lines, and the horizontal dimension represents the fine-grained meta-operations of characters within a line. Current large-scale code models primarily treat code snippets as ordinary text. While this method facilitates the direct application of natural language processing techniques to code processing, it fundamentally ignores the hierarchical and modular structure of the source code. In actual programming, developers focus more on the dependencies between code lines than on the specific position of individual characters in the input sequence when writing or understanding code. From this perspective, representing code as a two-dimensional structure is more important than reducing character positions to a one-dimensional sequence. In this embodiment, the code is represented as a two-dimensional structure, based on code lines (vertical dimension) and each line... The code is organized by characters (horizontal dimension); specifically, the vertical dimension encodes the logical flow of the program across lines, covering elements such as control structures, function declarations, and inter-line dependencies; while the horizontal dimension captures fine-grained data operations within each line. By reflecting the spatial organization of the code, this representation is more in line with how developers naturally read, understand, and write code, reflecting the structural organization of the source code. To represent the source code as a two-dimensional structure, a simple approach is to use a grid similar to an image; however, since the length of source code lines usually varies greatly, this method introduces a large number of padding placeholders, thereby reducing computational efficiency. Therefore, this embodiment uses a one-dimensional sequence representation to process the source code, but emphasizes its line structure, specifically:
[0060] Given an input sequence The source code is divided into different lines by newline characters \n; the position of each newline character in the code data is obtained to get the newline character position code. (Assume the boundary conditions are) ),in, Indicates the newline character number;
[0061] Encoding based on the newline character position For any row index , making Then the character is considered Belonging to the same row, among which, Indicates row index, Represents characters in code data Location, and This indicates the position of two adjacent lines of code.
[0062] In step S2, an inline attention mask is set to extract the semantic representation of code lines. At the shallow level of the model, the goal is to extract the semantic information of each code line so that dependency analysis can be performed on these lines at a deeper level. To this end, this embodiment restricts attention to operations within each line, using a mask matrix M to achieve this. By decoupling inline dynamics from inter-line relationships, the model can achieve a more structured understanding of each line of code, thereby enabling the modeling of translation invariance between code lines. The mask matrix M is calculated using the following formula:
[0063]
[0064] in, Character and characters Accessibility between them.
[0065] Many code snippets, such as functions, classes, or independent modules, exhibit semantic translation invariance, meaning that rearranging these elements generally does not change their underlying logic. However, code language models that rely on fixed-position encoding struggle to capture this invariance, making it difficult to accurately understand the semantics of the code. Nevertheless, sequential relationships exist in source code (e.g., variables must be defined before they can be used), indicating that completely abandoning positional encoding is not an ideal solution. Instead, a more flexible, structured, and coarse-grained approach must be taken to view relationships between characters, balancing global invariance with adherence to local order constraints. Furthermore, positional information lacking a semantic foundation is inherently unreliable. For example, source code often contains non-functional elements such as comments, indentation, and newlines, which enhance readability but, in most cases, do not directly affect program logic. These elements can interfere with positional indexing, making positional encoding more complex and unreliable.
[0066] To this end, this embodiment proposes dependency encoding to capture the hierarchical relationships between lines of code, using the newline character \n as the anchor point for extracting dependencies of subsequent lines of code, since it marks the end of each line of code and the beginning of the next line. The data processing procedure for dependency modeling includes:
[0067] Based on the word embedding vector, the code line is obtained. The last character Word embedding As a line of code semantic anchors, where lines of code The characters include ;
[0068] According to the semantic anchor Calculate code lines using a sparse autoencoder (SAE) The dependency encoding is calculated as follows:
[0069]
[0070] in, For learnable parameters in sparse autoencoders (SAE), This refers to the encoder in SAE (Sparse Autoencoder). This represents a feature dictionary, where each column represents a single feature. Indicates the bias value. This represents the activation function. Indicates a line of code The activation values of dictionary features, Indicates a line of code The dependency encoding is fused with the original word embedding through additive positional encoding.
[0071] This embodiment also introduces a distance-based activation enhancement technique. Compared to location-based attention decay, this mechanism dynamically adjusts the activation values of the activation features in the sparse autoencoder based on the length of the context. This allows the model to focus on key dependencies within a longer context, rather than emphasizing only local information; specifically, this is achieved through the following formula:
[0072]
[0073] in, Indicates a line of code The activation values of dictionary features, This indicates a cropping operation, taking... and The larger value in Indicates the input length of the model. Indicates semantic anchor point The location.
[0074] As context length increases (e.g., in repository-level code generation tasks), the model's attention becomes scattered, leading to performance degradation. Traditional distance-based attention decay methods can avoid attention scattering; however, these methods inevitably lose long-distance dependencies. This example uses uniform dependency encoding for characters within the same line of code. As dependencies grow, the number of characters covered by attention also increases. Consider two embeddings... and ,make and These represent the dependencies extracted by SAE, and the output of the model's attention layer. It can be done and After performing a linear mapping, a dot product attention score is calculated, and then a weighted sum is applied based on the attention scores to obtain the intermediate representation of each character; the details are as follows:
[0075] The attention score between characters is calculated based on the word embedding vector, dependency encoding, and mask matrix M, specifically using the following formula:
[0076]
[0077] in, Character Word embedding, Character Dependency encoding of the line of code. This indicates the character encoded via dependency. Word embedding, Character Word embedding, Character Dependency encoding of the line of code. This indicates the character encoded via dependency. Word embedding, This represents the key vector in the self-attention mechanism. Represents the query mapping matrix. This represents the value vector in the self-attention mechanism. Represents the key mapping matrix, Character and characters Attention scores between Character and characters Accessibility between them;
[0078] The intermediate representation of each character is calculated based on the attention scores between the characters, specifically using the following formula:
[0079]
[0080] in, Character Intermediate representations generated through a self-attention mechanism This represents the value vector in the self-attention mechanism. Represents a value mapping matrix, This represents the normalization function.
[0081] Intuitively, The first item in the expanded terms represents semantic attention, which captures... and The semantic interaction between lines; the second and third terms represent cross-attention between semantic content and encoding; the fourth term represents dependency attention, which represents the direct interaction of inter-line dependencies; although the fourth term directly reflects the dependencies between code lines, in the setting of this embodiment, characters within the same code line share the same dependency information, which may lead to a uniform increase in the attention scores between characters between lines with strong dependencies, thus unintentionally increasing the model's attention to more characters, resulting in model attention dispersion; to solve this problem, this embodiment suppresses direct positional attention (i.e., the fourth term) and treats its corresponding attention score as an independent feature; specifically, the attention scores are flattened along the head dimension, and the scores of the fourth term are concatenated from each head. These concatenated features are then processed by an MLP to alleviate the attention dispersion problem; specifically through the following formula:
[0082]
[0083] in, Character and characters Attention scores between This represents a set of multilayer perceptrons. This indicates a dimension concatenation operation.
[0084] In this embodiment, model prediction is performed using a multilayer perceptron and a softmax layer, specifically implemented through the following formula:
[0085]
[0086] in, This represents the probability distribution of the model's final prediction results. Represents the normalization function. This represents a set of multilayer perceptrons. This represents the probability mapping matrix.
[0087] In this embodiment, the code generation model uses the following loss function:
[0088]
[0089] in, This represents the total loss value of the model. This represents the cross-entropy loss of the model. This represents the sparsity loss of the model. This represents the hyperparameters of the model. This represents the activation value of a feature in a sparse autoencoder.
[0090] This loss function not only encourages the model to focus on modeling key relationships and enhances the model's generalization ability, but also enhances the interpretability of the learned features due to the sparsity constraint.
[0091] This embodiment verifies the effectiveness of the code generation method that integrates two-dimensional code representation and dependency encoding through a code modeling task. The code modeling task is to predict the next character in a programming-related context. Perplexity is used to reflect the likelihood between the model's predicted sequence and the target sequence, while accuracy is used to evaluate the proportion of characters correctly predicted by the model. The results are shown in Table 1.
[0092] Table 1: Comparison of perplexity and accuracy between this method and the baseline method
[0093]
[0094] Experiments show that our method has a stronger ability to understand and generate long texts compared to baseline methods in terms of code modeling.
[0095] In summary, the code generation method proposed in this embodiment, which integrates two-dimensional code representation and dependency encoding, solves the problems of poor generalization, poor code structure awareness, and insufficient extrapolation ability of traditional positional encoding methods in the code generation process. It is the first to propose modeling code fragments as a two-dimensional structure, decomposing code from the one-dimensional modeling methods of traditional language models into vertical-dimensional logical flows and horizontal-dimensional meta-operations. Furthermore, it utilizes a sparse autoencoder (SAE) to capture semantic dependencies between code lines, replacing the traditional positional encoding method based on absolute position numbers. The steps include: a hierarchical Transformer model architecture design, where the shallow layer extracts semantic information of meta-operations by setting attention masks to avoid the influence of irrelevant context; and a deep integration of dependency encoding, where a sparse autoencoder is set to extract a dictionary of feature relationships between code lines, model the dependencies between code lines, and assign corresponding dependency codes. Experiments show that this method significantly outperforms existing methods in tasks such as code modeling, long sequence understanding, functional correctness, and context retrieval.
[0096] Example 2:
[0097] Based on the same inventive concept as Embodiment 1, this embodiment of the invention discloses a code generation system that integrates two-dimensional code representation and dependency encoding, comprising:
[0098] The 2D structure modeling module is used to: model the acquired code data into a 2D structure to obtain 2D code;
[0099] The word embedding module is used to: embed words into the two-dimensional code through the embedding layer to obtain word embedding vectors;
[0100] The inline masking module is used to: perform inline masking on the two-dimensional code using inline attention masking to obtain a mask matrix;
[0101] The dependency modeling module is used to: perform dependency modeling on the word embedding vectors using a sparse autoencoder to obtain the dependency encoding of inter-line dependency relationships in the code;
[0102] The attention calculation module is used to: perform attention calculation using a self-attention mechanism based on the word embedding vector, mask matrix, and dependency encoding to obtain the intermediate representation of each character;
[0103] The model prediction module is used to: perform model prediction based on the intermediate representation of each character through a multilayer perceptron and a softmax layer to obtain the generated code.
[0104] The specific functions of each module described above are explained in the relevant content of the method in Embodiment 1, and will not be repeated here.
[0105] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0106] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0107] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0108] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0109] The embodiments of the present invention have been described above with reference to the accompanying drawings. However, the present invention is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of the present invention without departing from the spirit and scope of the claims. All of these forms are within the protection scope of the present invention.
Claims
1. A method of fusing two-dimensional code representation with dependency coded code generation, characterized by, include: The acquired code data is modeled into a two-dimensional structure to obtain two-dimensional code; The two-dimensional code is input into the trained code generation model: word embedding is performed on the two-dimensional code through the embedding layer to obtain word embedding vectors; inline attention masking is performed on the two-dimensional code to obtain a mask matrix; Dependency modeling of the word embedding vectors is performed using a sparse autoencoder to obtain the dependency encoding of inter-line dependencies in the code. Based on the word embedding vector, mask matrix, and dependency encoding, attention is calculated using a self-attention mechanism to obtain the intermediate representation of each character; based on the intermediate representation of each character, model prediction is performed using a multilayer perceptron and a softmax layer to obtain the generated code; Modeling the acquired code data into a two-dimensional structure includes: acquiring the position of each line break in the code data to obtain line break position encoding wherein, denotes the line break number; According to the line break position encoding For any line index So that The characters are considered to belong to the same line, where denotes the line index, denotes the position of the character in the code data, and denotes the position of the adjacent two code lines.
2. The code generation method integrating two-dimensional code representation and dependency encoding according to claim 1, characterized in that, The mask matrix M is calculated using the following formula: in, Character and characters Accessibility between them.
3. The code generation method integrating two-dimensional code representation and dependency encoding according to claim 1, characterized in that, The data processing procedure for dependency modeling includes: Based on the word embedding vector, the code line is obtained. The last character Word embedding As a line of code semantic anchors, where lines of code The characters include ; According to the semantic anchor Calculate code lines using a sparse autoencoder (SAE) The dependency encoding is calculated as follows: in, For learnable parameters in sparse autoencoders (SAE), For the real number field, , Let be the dimension of the learnable parameters in a sparse autoencoder (SAE). This refers to the encoder in SAE (Sparse Autoencoder). Represents a feature dictionary. Indicates the bias value. This represents the activation function. Indicates a line of code The activation values of dictionary features, Indicates a line of code Dependency encoding.
4. The code generation method integrating two-dimensional code representation and dependency encoding according to claim 3, characterized in that, The dependency modeling includes distance-based activation enhancement, implemented using the following formula: in, Indicates a line of code The activation values of dictionary features, This indicates a cropping operation. Indicates will Limited to Within the range, Indicates the input length of the model. Indicates semantic anchor point The location.
5. The code generation method integrating two-dimensional code representation and dependency encoding according to claim 3, characterized in that, The data processing procedure for attention calculation includes: The attention score between characters is calculated based on the word embedding vector, dependency encoding, and mask matrix M, specifically using the following formula: in, Character Word embedding, Character Dependency encoding of the line of code. Indicates characters encoded via dependency Word embedding, Character Word embedding, Character Dependency encoding of the line of code. This indicates the character encoded via dependency. Word embedding, This represents the query vector in the self-attention mechanism. Represents the query mapping matrix. This represents the key vector in the self-attention mechanism. Represents the key mapping matrix, Character and characters Attention scores between Character and characters Accessibility between them; The intermediate representation of each character is calculated based on the attention scores between the characters, specifically using the following formula: in, Character Intermediate representations generated through a self-attention mechanism This represents the value vector in the self-attention mechanism. Represents a value mapping matrix, This represents the normalization function.
6. The code generation method integrating two-dimensional code representation and dependency encoding according to claim 5, characterized in that, The self-attention mechanism alleviates the problem of distraction through MLP, specifically through the following formula: in, Character and characters Attention scores between This represents a set of multilayer perceptrons. This indicates a dimension concatenation operation.
7. The code generation method integrating two-dimensional code representation and dependency encoding according to claim 5, characterized in that, The model prediction using a multilayer perceptron and a softmax layer is achieved through the following formula: in, This represents the probability distribution of the model's final prediction results. Represents the normalization function. This represents a set of multilayer perceptrons. This represents the probability mapping matrix.
8. The code generation method integrating two-dimensional code representation and dependency encoding according to claim 1, characterized in that, The code generation model uses the following loss function: in, This represents the total loss value of the model. This represents the cross-entropy loss of the model. This represents the sparsity loss of the model. This represents the hyperparameters of the model. This represents the activation value of a feature in a sparse autoencoder.
9. A code generation system integrating two-dimensional code representation and dependency encoding, characterized in that, include: The 2D structure modeling module is used to: model the acquired code data into a 2D structure to obtain 2D code; The word embedding module is used to: embed words into the two-dimensional code through the embedding layer to obtain word embedding vectors; The inline masking module is used to: perform inline masking on the two-dimensional code using inline attention masking to obtain a mask matrix; The dependency modeling module is used to: perform dependency modeling on the word embedding vectors using a sparse autoencoder to obtain the dependency encoding of inter-line dependency relationships in the code; The attention calculation module is used to: perform attention calculation using a self-attention mechanism based on the word embedding vector, mask matrix, and dependency encoding to obtain the intermediate representation of each character; The model prediction module is used to: perform model prediction based on the intermediate representation of each character using a multilayer perceptron and a softmax layer to obtain the generated code; Modeling the acquired code data into a two-dimensional structure includes: Obtain the position of each newline character in the code data to get the newline character position code. ,in, Indicates the newline character number; Encoding based on the newline character position For any row index , making Then the character is considered Belonging to the same row, among which, Indicates row index, Represents characters in code data Location, and This indicates the position of two adjacent lines of code.