A code defect detection method based on multi-view graph convolutional neural network

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing and cropping code graphs using multi-view graph convolutional neural networks and combining them with attention mechanisms for feature fusion, the problem of insufficient multi-view semantic information fusion in existing technologies is solved, thereby improving the accuracy and robustness of code defect detection.

CN122240446APending Publication Date: 2026-06-19BEIJING INST OF TECH

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIJING INST OF TECH
Filing Date: 2026-02-05
Publication Date: 2026-06-19

Application Information

Patent Timeline

05 Feb 2026

Application

19 Jun 2026

Publication

CN122240446A

IPC: G06F11/3604; G06F8/75; G06N3/045; G06N3/042; G06N3/0464; G06F18/213; G06F18/25; G06F18/243

AI Tagging

Application Domain

Error detection/correction Biological models

Technology Topics

Data streamAlgorithm

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

An efficient enterprise data matching method based on privacy protection
CN122241745ADigital data protection Internal/peripheral component protectionData streamTheoretical computer science
Method and apparatus for automated digital rights enforcement and management methods
US20260170155A1Digital data protection Machine learningDigital dataData stream
Smart ETL data routing system and method for dynamic big data ingestion pipelines
US20260170005A1Database management systems Relational databasesData streamEnterprise computing
Block device layer differentiated admission control method and system
CN122242776AMemory adressing/allocation/relocation Inference methods Computer networkCache access
A method and system for analyzing messages of a drone
CN122248086AEffective filteringImprove recognition accuracyTransmission noise suppressionAircraft traffic controlData streamAnti jamming

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing code defect detection methods fail to fully integrate multi-perspective semantic information, resulting in limited ability of the model to understand complex vulnerability semantics, and the graph structure is noisy, affecting detection accuracy and robustness.

Method used

A multi-view graph convolutional neural network is adopted. By constructing multiple types of code graphs and introducing a semantically aware graph pruning mechanism, abstract syntax trees, control flow graphs and data flow graphs are constructed respectively. Cross-view node alignment and feature learning are performed, and feature fusion is combined with an attention mechanism to improve detection accuracy and robustness.

Benefits of technology

It significantly improves the detection accuracy in complex defect scenarios, reduces the false alarm rate, and enhances the adaptability to different defect types and coding styles.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122240446A_ABST

Patent Text Reader

Abstract

This invention belongs to the field of code defect detection technology, specifically relating to a code defect detection method based on a multi-view graph convolutional neural network. It includes: constructing a multi-view code graph, comprising an abstract syntax tree, a control flow graph, and a data flow graph; performing semantically aware pruning operations on the three graph views respectively; aligning cross-view nodes under role consistency constraints and constructing a unified node embedding representation for nodes in each view; constructing independent graph convolutional neural networks to learn features from the code graphs under different views, obtaining view-level representation vectors; and further fusing them to obtain a unified code representation vector, which is then input into a classification module to determine whether the code has defects.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of code defect detection technology, specifically relating to a code defect detection method based on a multi-view graph convolutional neural network. Background Technology

[0002] In recent years, deep learning technology has been used for code semantic modeling and defect detection, among which graph neural networks (GNNs) have attracted attention due to their ability to capture structural information between programs. During compilation or interpretation, programs naturally generate various graph structures, such as abstract syntax trees (ASTs), control flow graphs (CFGs), and data flow graphs (DFGs), which respectively describe syntactic structure, path structure, and variable dependencies. However, existing research mostly focuses on only one or a few graph structures, failing to fully integrate multi-perspective semantic information, resulting in limited ability of models to understand complex vulnerability semantics.

[0003] Specifically, while AST captures the hierarchical structure of code, it struggles to represent the process of variable data transfer; CFG can reflect the execution order of statements but lacks descriptions of semantic relationships between statements; DFG can describe the production and dependency relationships of variables but lacks control semantics. A single graph structure cannot fully express the overall behavior of a program, limiting the performance improvement of defect detection models.

[0004] For example, the Devign method (Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks) proposes a code defect detection method. This method takes source code functions as the analysis object. First, it constructs a code attribute graph from the program, unifying the abstract syntax tree, control flow relationships, and data dependencies into a heterogeneous graph structure containing various types of nodes and edges. Based on this, Devign uses graph neural networks to perform feature propagation and aggregation on the entire code graph. By learning the structural and semantic information of nodes and their neighborhoods, it obtains a function-level graph representation and ultimately performs a binary classification to determine whether the code contains vulnerabilities.

[0005] The Devign method enhances code semantic modeling capabilities to some extent by integrating multiple program structural relationships into a single graph model, exhibiting stronger representational power compared to traditional static analysis methods. However, this method directly overlays structural relationships at different semantic levels onto the same graph, failing to differentiate between different types of program semantics and lacking explicit mechanisms for handling graph structural noise. Consequently, it still has certain limitations when facing complex code structures and diverse vulnerability patterns.

[0006] In summary, existing code representation methods often have the following shortcomings:

[0007] First, existing methods typically aggregate different types of code structure relationships, such as abstract syntax relationships, control flow relationships, and data dependency relationships, into the same code graph for representation. Different structural relationships are superimposed on each other in the same graph, resulting in a complex graph structure and making it difficult to distinguish the specific role of each type of relationship in the defect formation process. Secondly, since the above code graphs are usually constructed in full, they do not effectively filter out syntactic structures, control paths and data dependencies that are not related to the semantics of defects. This results in a large number of redundant nodes and edges in the graph representation, with high graph structure noise, which increases the difficulty of feature learning and reduces the accuracy of defect identification. Third, under the single fusion graph modeling method, different types of structural relationships cannot be modeled and analyzed separately, lacking the ability to characterize code features from different perspectives such as syntax structure, control flow behavior and data dependencies, thus making it difficult to fully explore the manifestation characteristics of complex defects at multiple structural levels.

[0008] There is an urgent need for a method that integrates multiple code structure views to achieve comprehensive modeling of code syntax, control semantics, and data dependencies, thereby improving the accuracy and robustness of vulnerability detection. Summary of the Invention

[0009] To address the aforementioned issues, this invention proposes a code defect detection method based on a multi-view graph convolutional neural network. By constructing multiple types of code graphs, introducing a semantically aware graph pruning mechanism, and utilizing a multi-view graph convolutional model to achieve deep cross-graph semantic fusion, the method improves the accuracy, generalization, and interpretability of code defect detection.

[0010] The technical solution for implementing the present invention is as follows: In a first aspect, the present invention provides a code defect detection method based on a multi-view graph convolutional neural network, the specific process of which is as follows: Multi-view construction and multi-view code graph construction and pruning: For the source code to be analyzed, three graph structures are constructed using functions as the basic input unit: abstract syntax tree, control flow graph, and data flow graph. For the abstract syntax tree, graph pruning is performed based on a predefined set of syntax node types, while maintaining the connectivity of the syntax structure. For the control flow graph, control decision nodes in the program are identified, and continuous basic block paths that only represent linear sequential execution and do not contain control decision information are compressed or merged to achieve graph pruning. For the data flow graph, key variables related to defect risks are identified based on static rules, and data dependencies unrelated to key variables are removed to achieve graph pruning. Cross-view node alignment and node embedding construction: Identify nodes corresponding to the same code entity in the abstract syntax tree, control flow graph, and data flow graph, and establish the cross-view node alignment relationship; construct a unified node embedding representation for nodes in each view; Feature learning fusion and defect discrimination: Independent graph convolutional neural networks are constructed to learn features from code graphs under different views to obtain view-level representation vectors; these vectors are then fused to obtain a unified code representation vector, which is then input into the classification module to determine whether there are defects in the code.

[0011] Optionally, the present invention, for the abstract syntax tree, identifies key syntax nodes related to conditional judgments, loop control, function calls, memory access, and return statements based on a predefined set of syntax node types. Under the constraint of maintaining the connectivity of the syntax structure, only the key syntax nodes and their necessary upstream and downstream syntax paths are retained to form a semantically focused syntax subtree structure.

[0012] Optionally, this invention identifies control decision nodes in the program, including branch nodes, loop entry nodes, and exception handling-related nodes, for control flow graphs. Control decision nodes that only represent linear sequential execution and do not contain control decision information are compressed or merged to form a trimmed control flow graph.

[0013] Optionally, for data flow graphs, the present invention identifies key variables related to defect risks based on static rules, retains only the definitions of the key variables, and then uses dependency chains and their associated data flow paths to remove data dependencies unrelated to the key variables, forming a data flow subgraph centered on the key variables.

[0014] Optionally, the present invention identifies nodes corresponding to the same code entity in the abstract syntax tree, control flow graph, and data flow graph, and assigns functional role labels to the nodes to characterize the functional attributes of the nodes in the program; when nodes from different views are in the same code position and their functional role labels meet preset consistency or compatibility constraints, the cross-view node alignment relationship is established.

[0015] Optionally, the node embedding described in this invention includes two parts of information: node category information and code semantic information. First, based on the type of the node in the corresponding graph view, the node type is one-hot encoded, and different types of nodes are mapped to discrete structural feature vectors, which are used to characterize the functional attributes of the node in the abstract syntax structure, control flow structure or data dependency structure. Secondly, the source code fragments or code tag sequences corresponding to the nodes are extracted, and the code fragments are encoded using the pre-trained open-source code semantic representation model CodeBert to generate code semantic embedding vectors corresponding to the nodes, which are used to characterize the contextual semantic information contained in the nodes. Finally, the one-hot encoded vector of the node type is concatenated with the code semantic embedding vector and linearly transformed to form the final embedded representation of the node. This node embedding is then used as the initial feature input of the node in the corresponding view to the subsequent graph convolutional neural network for feature learning.

[0016] Optionally, when constructing independent graph convolutional neural networks to learn features from code graphs under different views and obtain view-level representation vectors, the graph convolutional neural networks perform local feature aggregation and nonlinear transformation on the graph structure, so that each node continuously integrates structural and semantic information within its neighborhood during the layer-by-layer propagation process, thereby realizing the modeling of code context relationships. After obtaining the final node representation under each view, the node-level features are aggregated into a fixed-dimensional view-level graph representation vector through graph-level pooling operations.

[0017] Optionally, the present invention inputs the graph representation of each view into the attention fusion module to adaptively model the relative importance of different views in the current code sample, obtains a unified code representation vector, and inputs it into the classification module to determine whether there are defects in the code.

[0018] Optionally, the present invention applies to the first The graph representation vector obtained from each view , First, its importance score is calculated using an attention scoring function:

[0019] in, and For learnable parameters, For attention query vectors, For activation functions; Subsequently, the attention scores of each view are normalized to obtain the corresponding attention weights. Based on the attention weights, the graph representation vectors of each view are weighted and fused to obtain a unified code representation vector. ; Finally, the code representation is input into the classification module, and through a fully connected layer and a Sigmoid activation function, the binary classification result indicating whether the code has defects is output.

[0020] in, This represents the predicted probability that the code contains a defect. and For the learnable parameters of the classification layer, This represents the Sigmoid function.

[0021] Secondly, the present invention provides a code defect detection device based on a multi-view graph convolutional neural network, comprising: The multi-view construction and multi-view code graph construction and pruning module is used to construct three graph structure representations for the source code to be analyzed, using functions as the basic input unit: abstract syntax tree, control flow graph, and data flow graph. For the abstract syntax tree, graph pruning is performed based on a predefined set of syntax node types, while maintaining the connectivity of the syntax structure. For the control flow graph, control decision nodes in the program are identified, and continuous basic block paths that only represent linear sequential execution and do not contain control decision information are compressed or merged to achieve graph pruning. For the data flow graph, key variables related to defect risks are identified based on static rules, and data dependencies unrelated to key variables are removed to achieve graph pruning. The cross-view node alignment and node embedding building module is used to identify nodes corresponding to the same code entity in the abstract syntax tree, control flow graph, and data flow graph, establish the cross-view node alignment relationship, and build a unified node embedding representation for nodes in each view. The feature learning fusion and defect discrimination module uses independent graph convolutional neural networks to learn features from code graphs under different views to obtain view-level representation vectors; these vectors are then fused to obtain a unified code representation vector, which is then input into the classification module to determine whether there are defects in the code.

[0022] Beneficial effects: First, this invention can significantly improve the detection accuracy in complex defect scenarios. Compared with existing technologies that rely solely on a single code structure or model multiple relationships in a mixed manner, this invention can comprehensively analyze code behavior from multiple complementary perspectives, effectively alleviating the problem of misjudgment caused by incomplete structural expression, and exhibiting higher discrimination accuracy, especially in complex defect scenarios involving multiple statements and multiple path interactions.

[0023] Second, this invention reduces interference from irrelevant structures, thus decreasing the false positive rate. Existing code graph-based defect detection methods often directly use complete graph structures for modeling, making them susceptible to the influence of numerous nodes and edges unrelated to the defect. This invention suppresses redundant structures during the graph construction phase, allowing the model to focus more on key code regions relevant to the defect, thereby effectively reducing the false positive rate while maintaining detection recall.

[0024] Third, this invention enhances adaptability to different defect types and coding styles. Existing technologies typically employ fixed structural modeling and fusion strategies, resulting in limited adaptability to different defect patterns. This invention, however, can automatically adjust the influence of different code perspectives on the final judgment result based on the structural feature differences manifested in specific code samples, calculating their importance scores through an attention scoring function. This maintains stable detection performance even when facing diverse defect types and different coding styles. Attached Figure Description

[0025] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0026] Figure 1 This is a flowchart of the method of the present invention; Figure 2 A schematic diagram for node embedding. Detailed Implementation

[0027] The embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

[0028] It should be noted that, unless otherwise specified, the following embodiments and features can be combined with each other; and, based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.

[0029] It should be noted that various aspects of embodiments within the scope of the appended claims are described below. It will be apparent that the aspects described herein can be embodied in a wide variety of forms, and any particular structure and / or function described herein is merely illustrative. Based on this disclosure, those skilled in the art will understand that one aspect described herein can be implemented independently of any other aspect, and two or more of these aspects can be combined in various ways. For example, any number of aspects set forth herein can be used to implement the device and / or practice the method. Additionally, this device and / or method can be implemented using structures and / or functionalities other than one or more of the aspects set forth herein.

[0030] This application proposes a code defect detection method based on multi-view graph structure representation and attention fusion. The method constructs three code graph views from the source code: an abstract syntax tree, a control flow graph, and a data flow graph. A semantically aware pruning strategy is introduced during the graph construction phase, and node alignment relationships under role consistency constraints are established between the multiple views. Subsequently, graph embedding learning is performed on each view through a multi-channel graph neural network, and an attention mechanism is used to achieve adaptive fusion of multi-view features, ultimately outputting a binary classification result for code defects. The specific technical solution includes the following five steps, such as... Figure 1 As shown.

[0031] Step 1: Constructing the multi-view code graph For the source code to be analyzed, three graph structures are constructed, namely, abstract syntax trees, control flow graphs, and data flow graphs, using functions as the basic input units. Among them, abstract syntax trees are used to depict the syntactic organization structure of the code and are more sensitive to dangerous API uses, insecure syntax structures, and code pattern defects that violate safe coding standards, such as CWE-783 (Operator Precedence Logic Error), CWE-480 (Use of Incorrect Operator), and CWE-242 (Use of Inherently Dangerous Function). Control flow graphs are used to depict the execution path and control transfer relationships of a program, making them more suitable for detecting defects that originate from certain paths, such as CWE-457 (Use of Uninitialized Variable), CWE-476 (NULL Pointer Dereference), and CWE-252 (Unchecked Return Value). Data flow graphs are used to depict the definition and usage dependencies between variables, making them more suitable for detecting vulnerabilities related to data propagation and state dependencies, such as CWE-120 (Buffer Overflow), CWE-190 (Integer Overflow or Wraparound), and CWE-416 (Use After Free). These three types of graphs reflect the structural characteristics of code from different perspectives, providing a foundation for subsequent multi-perspective feature learning and defect detection.

[0032] Step 2: Semantic-aware multi-view image cropping After the initial graph is constructed, semantically aware pruning operations are performed on the three graph views to reduce redundant structures that are not related to the semantics of code defects.

[0033] For the abstract syntax tree (AST), based on a predefined set of syntax node types, key syntax nodes related to conditional statements, loop control, function calls, memory access, and return statements are identified. While maintaining the connectivity of the syntax structure, only these key nodes and their necessary upstream and downstream syntax paths are retained, thus forming a semantically focused syntax subtree structure. Specifically: First, the syntax types to be retained are defined, such as conditional control, memory-related, and function calls. The AST is traversed to search for these nodes, retaining their child nodes and the shortest syntax path to the root, ultimately forming the pruned AST.

[0034] For the control flow graph, identify the control decision nodes in the program, including branch nodes, loop entry nodes, and exception handling-related nodes. Compress or merge control decision nodes that only represent linear sequential execution and do not contain control decision information, thereby highlighting the key control structures related to program execution branches and path selection, weakening the noise interference of sequential execution structures, and forming a trimmed control flow graph.

[0035] For data flow graphs, key variables related to defect risks are identified based on static rules, including pointer variables, array index variables, length or size-related variables, function return values, and external input parameters. Only the definitions of the above key variables are retained. Then, dependency chains and their associated data flow paths are used to remove data dependencies that are not related to the key variables, forming a data flow subgraph centered on the key variables.

[0036] Through the semantically aware pruning operation described above, the graph structure of different views can be significantly reduced in terms of structural complexity while retaining the core semantic information.

[0037] Step 3: Cross-view node alignment under role consistency constraints Based on the multi-view code graph structure after semantically aware pruning, to ensure consistency in node feature representations across different views and guarantee that the graph representation tensors input to subsequent models have the same dimension, this invention establishes cross-view node alignment relationships for nodes from the same source code location. Through this cross-view node alignment, corresponding nodes in the abstract syntax tree view, control flow graph view, and data flow graph view can adopt a common node embedding construction method, thereby providing multi-view graph convolutional neural networks with consistent node feature input dimensions.

[0038] Specifically, based on information such as source code line numbers, statement identifiers, or the location of code snippets in the source file, nodes corresponding to the same code entity in the abstract syntax tree, control flow graph, and data flow graph are identified, and functional role labels are assigned to these nodes to characterize their functional attributes in the program. These functional role labels include at least control roles, data definition roles, data usage roles, memory operation roles, and function call roles.

[0039] The cross-view node alignment relationship is established only when nodes from different views are in the same code position and their functional role labels meet the preset consistency or compatibility constraints. This introduces role consistency constraints between multiple views, avoids incorrect alignment of semantically inconsistent nodes, and ensures the consistency of the embedded representation of multi-view nodes at both the semantic and dimensional levels.

[0040] Step 4, Building the Node Embedsion Based on the multi-view code graph after semantically aware cropping and cross-view node alignment, a unified node embedding representation is constructed for the nodes in each view to serve as the input features for subsequent graph convolutional neural networks.

[0041] Node embedding includes two parts of information: node category information and code semantic information. The overall process is as follows: Figure 2 As shown. First, based on the type of the node in the corresponding graph view, the node type is encoded. One-hot encoding is used to map different types of nodes into discrete structural feature vectors, which are used to characterize the functional attributes of the node in the abstract syntax structure, control flow structure or data dependency structure.

[0042] Secondly, the source code fragments or code tag sequences corresponding to the nodes are extracted, and the code fragments are encoded using the pre-trained open-source code semantic representation model CodeBert to generate code semantic embedding vectors corresponding to the nodes, which are used to characterize the contextual semantic information contained in the nodes.

[0043] Finally, the one-hot encoded vector of the node type is concatenated with the code semantic embedding vector and linearly transformed to form the final embedded representation of the node. This node embedding is then used as the initial feature input of the node in the corresponding view to the subsequent graph convolutional neural network for feature learning.

[0044] Step 5: Multi-channel image embedding and feature learning For the pruned and node-aligned abstract syntax tree, control flow graph, and data flow graph, independent graph convolutional neural networks are constructed to learn features from different views of the code graph, fully exploring the differences in code syntax structure, control behavior, and data dependencies. Graph convolutional neural networks perform local feature aggregation and nonlinear transformations on the graph structure, enabling each node to continuously integrate structural and semantic information from its neighborhood during layer-by-layer propagation, thereby achieving modeling of code context relationships.

[0045] In the In the +1 layer graph convolution, the node representation is updated by normalizing and weighting the features of itself and its neighboring nodes, and then applying a linear transformation and a nonlinear mapping. The calculation process is shown in Equation 1.

[0046]

[0047] in, This represents the adjacency matrix after adding self-loops, used to preserve node information during feature propagation; The corresponding degree matrix is symmetrically normalized for adjacency relationships to alleviate the problem of inconsistent feature scales caused by differences in the degree of different nodes. Let be the learnable parameter matrix of the l-th layer, used for linear mapping of the aggregated features; It is a non-linear activation function used to enhance the expressive power of the model. Indicates the view index.

[0048] Through multi-layer graph convolution operations, node representations are progressively expanded from local neighborhoods to a wider structural context, enabling the model to simultaneously capture the local features of nodes and their positional relationships within the overall graph structure. Since abstract syntax trees, control flow graphs, and data flow graphs differ significantly in structural form and semantic emphasis, each view is modeled using graph convolutional networks with independent parameters, thus avoiding interference between different structural information during the feature learning stage.

[0049] After obtaining the final node representations under each view, the node-level features are aggregated into a fixed-dimensional view-level graph representation vector through graph-level pooling operations, as shown in Formula 2.

[0050]

[0051] The graph-level pooling operation performs global aggregation of node features to obtain a vector representation that can characterize the structural features of the entire code segment, providing a unified input format for subsequent multi-view feature fusion and code defect identification.

[0052] Step 6: Multi-view feature fusion and defect identification based on attention mechanism After obtaining the view-level graph representation vectors corresponding to the abstract syntax tree, control flow graph, and data flow graph, the graph representations of each view are input into the attention fusion module to adaptively model the relative importance of different views in the current code sample. Through the attention mechanism, the model can dynamically allocate the contribution weight of each view in the final judgment based on the differences in code structure features, thereby avoiding simple equal-weight fusion of information from different views.

[0053] Specifically, for the first The graph representation vector obtained from each view First, its importance score is calculated using the attention scoring function, as shown in Formula 3.

[0054]

[0055] in, and For learnable parameters, This is the attention query vector, used to measure the correlation between different view features and the defect discrimination task.

[0056] Subsequently, the attention scores of each view are normalized to obtain the corresponding attention weights, as shown in Formula 4.

[0057]

[0058] in, Indicates the first Each view has an importance weight in the current code sample, and the sum of the weights of all views is 1.

[0059] Based on the attention weights, the graph representation vectors of each view are weighted and fused to obtain a unified code representation vector, the calculation process of which is shown in Formula 5.

[0060]

[0061] The fused code representation vector The code's features are comprehensively characterized from multiple perspectives, including syntactic structure, control behavior, and data dependencies. Finally, the code representation is input into the classification module, and through a fully connected layer and a sigmoid activation function, a binary classification result indicating whether the code has defects is output. The calculation method is shown in Equation 6.

[0062]

[0063] in, This represents the predicted probability that the code contains a defect. and For the learnable parameters of the classification layer, This represents the Sigmoid function.

[0064] Compared with the prior art, the present invention has the following technical advantages: First, a semantically aware multi-view code graph pruning mechanism. After constructing the abstract syntax tree, control flow graph, and data flow graph, this invention introduces a graph pruning strategy based on code semantics and functional roles to remove nodes and edges that are irrelevant to defect detection or contribute little, thereby reducing graph structure noise and highlighting key semantic relationships.

[0065] Second, independent modeling and parallel embedding of multi-view code graphs. The pruned abstract syntax tree, control flow graph, and data flow graph are modeled as independent views, and a multi-channel graph convolutional neural network is used to perform parallel embedding learning on each view to characterize code defect features from different structural levels.

[0066] Third, adaptive fusion and discrimination of multi-view features based on attention mechanism. By assigning adaptive weights to the graph-level representations of different views through attention mechanism, weighted fusion of multi-view features is achieved, and binary classification of defects is completed based on the fused code representation.

[0067] In summary, the above are merely preferred embodiments of the present invention and are not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A code defect detection method based on a multi-view graph convolutional neural network, characterized in that, The specific process is as follows: Multi-view construction and multi-view code graph construction and trimming: For the source code to be analyzed, using functions as the basic input unit, three graph structures are constructed respectively: abstract syntax tree, control flow graph and data flow graph; For abstract syntax trees, graph pruning is performed based on a predefined set of syntax node types, while maintaining the connectivity of the syntax structure. For control flow graphs, control decision nodes in the program are identified, and continuous basic block paths that only represent linear sequential execution and do not contain control decision information are compressed or merged to achieve graph pruning. For data flow graphs, key variables related to defect risks are identified based on static rules, and data dependencies unrelated to key variables are removed to achieve graph pruning. Cross-view node alignment and node embedding construction: Identify nodes corresponding to the same code entity in the abstract syntax tree, control flow graph, and data flow graph, and establish the cross-view node alignment relationship; Construct a unified node embedding representation for nodes in each view; Feature learning fusion and defect discrimination: Independent graph convolutional neural networks are constructed to learn features from code graphs under different views to obtain view-level representation vectors; these vectors are then fused to obtain a unified code representation vector, which is then input into the classification module to determine whether there are defects in the code.

2. The code defect detection method based on multi-view graph convolutional neural network according to claim 1, characterized in that, The abstract syntax tree is designed to identify key syntax nodes related to conditional statements, loop control, function calls, memory access, and return statements based on a predefined set of syntax node types. While maintaining the connectivity of the syntax structure, only the key syntax nodes and their necessary upstream and downstream syntax paths are retained to form a semantically focused syntax subtree structure.

3. The code defect detection method based on a multi-view graph convolutional neural network according to claim 1, characterized in that, For the control flow graph, the control decision nodes in the program are identified, including branch nodes, loop entry nodes, and exception handling-related nodes. Control decision nodes that only represent linear sequential execution and do not contain control decision information are compressed or merged to form a trimmed control flow graph.

4. The code defect detection method based on a multi-view graph convolutional neural network according to claim 1, characterized in that, For the data flow graph, key variables related to defect risk are identified based on static rules, and only the definitions of the key variables are retained. Then, the dependency chain and its associated data flow path are used to remove data dependencies that are not related to the key variables, forming a data flow subgraph centered on the key variables.

5. The code defect detection method based on a multi-view graph convolutional neural network according to claim 1, characterized in that, The process involves identifying nodes in the abstract syntax tree, control flow graph, and data flow graph that correspond to the same code entity, and assigning functional role labels to these nodes to characterize their functional attributes within the program. When nodes from different views are in the same code position and their functional role labels meet preset consistency or compatibility constraints, the cross-view node alignment relationship is established.

6. The code defect detection method based on a multi-view graph convolutional neural network according to claim 1, characterized in that, The node embedding includes two parts of information: node category information and code semantic information. First, based on the type of the node in the corresponding graph view, the node type is one-hot encoded, and different types of nodes are mapped to discrete structural feature vectors, which are used to characterize the functional attributes of the node in the abstract syntax structure, control flow structure or data dependency structure. Secondly, the source code fragments or code tag sequences corresponding to the nodes are extracted, and the code fragments are encoded using the pre-trained open-source code semantic representation model CodeBert to generate code semantic embedding vectors corresponding to the nodes, which are used to characterize the contextual semantic information contained in the nodes. Finally, the one-hot encoded vector of the node type is concatenated with the code semantic embedding vector and linearly transformed to form the final embedded representation of the node. This node embedding is then used as the initial feature input of the node in the corresponding view to the subsequent graph convolutional neural network for feature learning.

7. The code defect detection method based on a multi-view graph convolutional neural network according to claim 6, characterized in that, The method involves constructing independent graph convolutional neural networks to learn features from code graphs under different views and obtain view-level representation vectors. The graph convolutional neural networks perform local feature aggregation and nonlinear transformation on the graph structure, enabling each node to continuously integrate structural and semantic information within its neighborhood during the layer-by-layer propagation process. This achieves the modeling of code context relationships. After obtaining the final node representations under each view, the node-level features are aggregated into fixed-dimensional view-level graph representation vectors through graph-level pooling operations.

8. The code defect detection method based on a multi-view graph convolutional neural network according to claim 7, characterized in that, The graph representation vectors of each view are input into the attention fusion module to adaptively model the relative importance of different views in the current code sample, obtain a unified code representation vector, and input it into the classification module to determine whether there are defects in the code.

9. The code defect detection method based on a multi-view graph convolutional neural network according to claim 1, characterized in that, For the The graph representation vector obtained from each view , First, its importance score is calculated using an attention scoring function: in, and For learnable parameters, For attention query vectors, For activation functions; Subsequently, the attention scores of each view are normalized to obtain the corresponding attention weights. Based on the attention weights, the graph representation vectors of each view are weighted and fused to obtain a unified code representation vector. ; Finally, the code representation is input into the classification module, and through a fully connected layer and a Sigmoid activation function, the binary classification result indicating whether the code has defects is output. in, This represents the predicted probability that the code contains a defect. and For the learnable parameters of the classification layer, This represents the Sigmoid function.

10. A code defect detection device based on a multi-view graph convolutional neural network, characterized in that, include: The multi-view construction and multi-view code graph construction and trimming module is used to construct three graph structures—abstract syntax tree, control flow graph, and data flow graph—for the source code to be analyzed, using functions as the basic input unit. For abstract syntax trees, graph pruning is performed based on a predefined set of syntax node types, while maintaining the connectivity of the syntax structure. For control flow graphs, control decision nodes in the program are identified, and continuous basic block paths that only represent linear sequential execution and do not contain control decision information are compressed or merged to achieve graph pruning. For data flow graphs, key variables related to defect risks are identified based on static rules, and data dependencies unrelated to key variables are removed to achieve graph pruning. A cross-view node alignment and node embedding building module is used to identify nodes corresponding to the same code entity in the abstract syntax tree, control flow graph, and data flow graph, and to establish the cross-view node alignment relationship; Construct a unified node embedding representation for nodes in each view; The feature learning fusion and defect discrimination module uses independent graph convolutional neural networks to learn features from code graphs under different views to obtain view-level representation vectors; these vectors are then fused to obtain a unified code representation vector, which is then input into the classification module to determine whether there are defects in the code.