Binary function similarity detection method using hypergraph twin neural network
By constructing directed heterogeneous hypergraphs and deep hypergraph twin neural networks, the problem of binary function similarity detection across compilation environments and architectures is solved, effectively capturing common semantic and structural features of functions and improving detection accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING INST OF TECH
- Filing Date
- 2023-12-06
- Publication Date
- 2026-06-23
Smart Images

Figure CN117475180B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a binary function similarity detection method using a hypergraph twin neural network, belonging to the field of computer and information science and technology. Background Technology
[0002] Binary function similarity detection is a key technology in security fields such as vulnerability detection, malware detection, and patch analysis. To improve software development efficiency, code is reused extensively in open-source projects, introducing potential security risks. Vulnerabilities in a code snippet can spread widely through code reuse, causing significant harm. Due to software intellectual property protection, source code is often difficult to obtain. Binary function similarity detection methods compare binary target functions with known vulnerable binary functions. This allows for automated analysis of existing projects for known vulnerable functions even without source code, possessing significant theoretical and practical value for effectively monitoring and analyzing the reuse of vulnerable functions in projects.
[0003] In existing research, deep learning technology has been widely applied to binary function similarity detection, demonstrating significant advantages over traditional methods, such as higher accuracy and faster speed. Existing deep learning-based binary function similarity detection methods can be mainly divided into the following two categories:
[0004] 1. Using the method of embedding assembly instruction sequences
[0005] Inspired by natural language processing techniques, this method first disassembles binary functions into a sequence of assembly instructions, treats it as natural language, and then uses a pre-trained language model (such as the Transformer-based BERT model) to embed the instructions into a vector space to capture the semantic information between instructions and achieve binary function similarity detection.
[0006] These methods exhibit excellent performance in detection scenarios within a single compilation environment (compiler type, version, optimization options) and architecture. However, their performance degrades significantly in scenarios involving cross-compilation environments or cross-architecture scenarios, resulting in numerous false negatives. The extraction of semantic information from these methods relies entirely on the assembly instruction sequences obtained from disassembling the binary code. Due to inherent compiler conventions and differences in instruction set architectures, the same source code will produce significantly different binary code under different compilation environments and architectures, leading to significant differences in the corresponding assembly instruction sequences. Therefore, methods utilizing assembly instruction sequence embedding cannot accurately capture the common semantic features of functions across compilation environments and architectures, making it difficult to accurately identify similar function pairs and resulting in poor usability in complex and ever-changing real-world application scenarios.
[0007] 2. Using graph embedding methods
[0008] The graph embedding method is used to first extract various graph representations of binary functions; then, basic block features are extracted through feature engineering and other techniques to construct attribute graph representations; then, graph neural networks are used to embed the attribute graphs into the vector space to generate graph embedding vectors; finally, the similarity between two graph embedding vectors is calculated and compared with a threshold to achieve similarity detection.
[0009] These methods introduce multiple attribute graphs, extracting some common structural features of functions across compilation environments or architectures compared to directly using assembly instruction sequences. This makes them more robust to changes, but accuracy still decreases. Furthermore, existing methods use shallow graph neural networks, focusing only on local neighboring node information, making it difficult to effectively capture and model relationships between distant nodes. They also ignore the mutual influence between distant statements within a function, resulting in insufficient graph embedding vector information and affecting the method's detection accuracy.
[0010] In summary, existing binary function similarity detection methods mainly suffer from the following problems: (1) The assembly instruction sequence embedding method fails to generate sufficient common information representations of function semantics in cross-compilation environments or cross-architecture scenarios, resulting in a significant decrease in detection accuracy and a large number of false negatives. The graph embedding method also falls short in extracting common structural information of functions in the aforementioned scenarios; (2) The graph embedding method uses shallow graph neural networks, focusing only on local neighboring statement information within the function, making it difficult to accurately model the mutual influence between distant statements, resulting in insufficient graph embedding vector information and ultimately a decrease in detection accuracy. Therefore, this invention proposes a binary function similarity detection method using a hypergraph twin neural network. Summary of the Invention
[0011] The purpose of this invention is to address the problems of existing methods failing to generate sufficient common information representations of function semantics and structure in cross-compilation environments or cross-architecture scenarios; and the difficulty in accurately modeling the mutual influence between far-distance statements by focusing only on local neighboring statements within a function during graph embedding. This invention proposes a binary function similarity detection method using a hypergraph twin neural network.
[0012] The design principle of this invention is as follows: First, the binary function is decompiled into pseudocode, and its Abstract Syntax Tree (AST), Control Flow Graph (CFG), and Data Flow Graph (CFG) are extracted. The Graph (DFG) is based on an abstract syntax tree and constructs a directed heterogeneous hypergraph representation that includes data flow, control flow, and three other types of information-enhanced hyperedges. The DFG is built on pseudocode, independent of the compilation environment and architecture, and can effectively capture the common semantic and structural features of functions under different compilation environments and architectures. Then, a deep hypergraph Siamese network is used to obtain hypergraph embedding vectors and calculate cosine distance as a similarity measure for function pairs. This network integrates two improved deep DFG neural networks with identical structures and shared parameters. By increasing the network depth, multi-layer convolutions are used to effectively learn the structural and semantic features of distant statements, capturing the interrelationships between distant statements. Initial residual connections and identity mappings are introduced to preserve the initial features of statements and avoid the performance degradation caused by stacking network layers, mitigating the inherent oversmoothing and model degradation problems of deep graph neural networks. Finally, the similarity of function pairs is compared with a threshold to determine their similarity.
[0013] The technical solution of the present invention is achieved through the following steps:
[0014] Step 1: For a given binary function f in the training and test sets, construct its directed heterogeneous hypergraph representation DHHG.
[0015] Step 1.1: Use the disassembler IDA Pro 7.5 and the decompile plugin Hex-Rays to decompile the binary function f into C-style pseudocode.
[0016] Step 1.2: Use the disassembler IDA Pro 7.5 and the IDA Python plugin to write scripts to analyze the pseudocode and extract its Abstract Syntax Tree (AST), Control Flow Graph (CFG), and Data Flow Graph (DFG).
[0017] Step 1.3: Based on the Abstract Syntax Tree (AST), and combining the Control Flow Graph (CFG) and Data Flow Graph (DFG), define various types of nodes and directed edges to construct a Directed Heterogeneous Graph (DHG).
[0018] Step 1.4: Based on the directed heterogeneous graph DHG, the regular directed edges in it are extended into directed super edges that can connect any number of nodes, thus constructing a directed heterogeneous hypergraph DHHG.
[0019] Step 2: Construct and train a deep hypergraph twin neural network.
[0020] Step 2.1: Integrate two structurally identical and parameter-shared deep hypergraph neural networks DHHGN1 and DHHGN2 to construct a deep hypergraph twin neural network, and then integrate the training set function pairs (f) p ,f′ p Directed heterogeneous hypergraph representation DHHG (p = 1, 2, ..., P, where P is the total number of samples in the training set) p ,DHHG′ p Input the two networks respectively and execute steps 2.2 to 2.4 in parallel.
[0021] Step 2.2, calculate DHHG p ,DHHG′ p Initial eigenvectors of each node and hyperedge and
[0022] Step 2.2.1, use the embedding layer to process DHHG p ,DHHG′ p The nodes and hyperedges in the vector space are embedded with their discrete eigenvalues to generate an initial embedding vector β. n and β e .
[0023] Step 2.2.2, use linear projection and nonlinear activation to process β n and β e To optimize the initial embeddings of different types of nodes and hyperedges, and place them in the same vector space, we obtain the initial feature vectors of nodes and hyperedges. and
[0024] Step 2.3: Iteratively update DHHG in each convolutional layer l. p ,DHHG′ p The node feature vector is obtained through L iterations.
[0025] Step 2.3.1: Calculate the information vector passed from node n to hyperedge e, and apply the multi-head scaling dot product attention mechanism to aggregate the information of all head nodes n∈H(e) and tail nodes n∈T(e) to hyperedge e.
[0026] Step 2.3.2: Calculate the information vector that hyperedge e passes back to node n, and apply the multi-head scaling dot product attention mechanism to aggregate all the information of hyperedge e associated with node n back to node n.
[0027] Step 2.3.3: In the node feature vector update stage, an initial residual connection and identity mapping are introduced to update the node feature vectors.
[0028] Step 2.3.4: For each convolutional layer l, repeat steps 2.3.1 to 2.3.3, undergoing a total of L iterations to update the final node feature vector.
[0029] Step 2.4: Aggregate DHHG using attention pooling p ,DHHG′ p all node feature vectors Obtain graph embedding vector
[0030] Step 2.5, Calculation The cosine distance, and use it as a function pair (f p ,f′ p Sim(f) is a similarity measure for ) p ,f′ p ).
[0031] Step 2.6, in order to train the model, combine the function pair (f) p ,f′ p ) tag y p (y p =1 indicates that they are similar, y p =-1 indicates dissimilarity), and the objective function is optimized using stochastic gradient descent to train the model parameters.
[0032] Step 3, set the function pairs (f) in the test set. q ,f′ q (q = 1, 2, ..., Q, where Q is the total number of samples in the test set) are input into the trained hypergraph twin neural network, and steps 2.2 to 2.5 are repeated to obtain the function pair (f q ,f′ q Sim(f) is a similarity measure for ) q ,f′ q ), and combine the set threshold ε to determine the input function pair (f) q ,f′ q Are they similar?
[0033] Beneficial effects
[0034] Compared to other binary function similarity detection methods, this invention proposes using directed heterogeneous hypergraphs as information representations of binary functions. First, the directed heterogeneous hypergraph is built based on pseudocode, independent of the compilation environment and architecture, and exhibits strong robustness to its changes, effectively capturing the common semantic and structural features of functions under different compilation environments or architectures. Second, by combining abstract syntax trees, control flow graphs, data flow graphs, and other information-enhancing edges, it achieves deep fusion of rich syntactic, structural, and semantic information of the code. Finally, by defining different types of directed hyperedges, it effectively captures higher-order correlations between nodes. This invention effectively improves the problem of existing methods failing to generate sufficient information representations of function semantics and structure in detection scenarios across compilation environments or architectures, significantly improving detection accuracy in cross-compilation environment and cross-architecture scenarios, and enhancing usability in practical applications.
[0035] Compared to other structural feature-based methods, this invention proposes using a deep directed heterogeneous hypergraph Siamese network to generate embedding vectors for binary function directed heterogeneous hypergraphs. This network integrates two improved deep directed heterogeneous hypergraph neural networks with identical structures and shared parameters. By increasing network depth, the receptive field of nodes is expanded, and multi-layer convolutions are used to deeply mine and integrate structural associations and semantic context between distant nodes, accurately capturing potential connections in complex graph structures. An initial residual connection and identity mapping are introduced into this deep hypergraph neural network to effectively alleviate the inherent oversmoothing and model degradation problems of deep graph neural networks. This invention effectively improves upon existing methods that primarily focus on local neighboring information within functions during graph embedding, making it difficult to accurately model the mutual influence between distant statements, thus improving the quality of generated graph embedding vectors and increasing detection accuracy. Attached Figure Description
[0036] Figure 1 This is a schematic diagram of the binary function similarity detection method using a hypergraph twin neural network according to the present invention.
[0037] Figure 2 This is an example diagram of the directed heterogeneous hypergraph representation of the binary function extracted in this invention. Detailed Implementation
[0038] To better illustrate the purpose and advantages of the present invention, the implementation methods of the present invention will be further described in detail below with reference to examples.
[0039] The experiment was conducted on a server with the following configuration: CPU: AMD EPYC 7543, frequency: 2.8GHz, number of cores: 32; RAM: 80GB; GPU: A40; Operating system: Linux Ubuntu 64-bit.
[0040] The experiment uses two datasets, the details of which are as follows:
[0041] (1) Dataset I is used for model training and evaluation and contains 7 open source projects (see Table 1 for details).
[0042] Table 1. Details of the selected open-source projects in Dataset I
[0043]
[0044] Each open-source project uses two compilers (each containing four different versions) and five optimization options, compiling 32-bit and 64-bit versions for three different architectures respectively. See Table 2 for specific settings.
[0045] Table 2. Specific compilation requirements for open source projects
[0046]
[0047] The above compilation process generates 5489 binary files, which contain 2.68M binary functions. Functions with ≥5 basic blocks are selected, deduplicated, and divided into training and test sets. The ratio of similar function pairs to dissimilar function pairs is 1:1. Detailed information on the training and test sets is shown in Table 3.
[0048] Table 3. Details of the training set, validation set, and test set
[0049]
[0050] (2) Dataset II was used to evaluate the model's performance in the binary vulnerability function clone detection task (one of the main application tasks of binary function similarity detection methods), and consisted of the following parts: 1) 1766 binary functions in the libcrypto library of the Netgear R7000 (ARM) firmware; 2) 4 vulnerability functions in OpenSSL 1.0.2d, details of which are shown in Table 4. Compilation was performed for x86, x64, ARM and MIPS architectures to obtain 16 binary vulnerability functions as query functions; 3) 28256 test function pairs generated based on the above binary functions.
[0051] Table 4. Details of vulnerable functions in Dataset II
[0052]
[0053] The experiment used the area under the receiver operating characteristic curve (AUC), recall@K at different thresholds K, and mean reciprocal rank (MRR) as evaluation metrics.
[0054] (1) AUC: The area under the receiver operating characteristic curve (ROC curve), often used to evaluate the performance of binary classification models. The ROC curve is a graphical representation of the performance of a model in a binary classification problem. Its horizontal axis is the false positive rate (FPR) (number of false positives / total number of actual negative samples), and its vertical axis is the true positive rate (TPR) (number of true positives / total number of actual positive samples). The AUC value is between 0 and 1. The closer it is to 1, the better the model's performance, reflecting the overall performance of the model under different classification thresholds.
[0055] (2) Recall@K: The proportion of positive samples retrieved in the first K results out of all positive samples. The value of Recall@K is between 0 and 1. The closer it is to 1, the stronger the model's ability to identify positive samples in the first K results. The calculation method is shown in formula (1).
[0056]
[0057] (3) MRR: The mean of the last two rank values of all test samples. The MRR value is between 0 and 1. The closer it is to 1, the stronger the model's ability to quickly and accurately detect relevant results. The calculation method is shown in formula (2).
[0058]
[0059] Where N is the total number of test samples, p i This represents the ranking result of the i-th test sample.
[0060] The specific process is as follows:
[0061] Step 1: For a given binary function f in the training and test sets, construct its directed heterogeneous hypergraph representation DHHG.
[0062] Step 1.1: Use the disassembler IDA Pro 7.5 and the decompile plugin Hex-Rays to decompile the binary function f into C-style pseudocode.
[0063] Step 1.2: Use the disassembler IDA Pro 7.5 and the IDA Python plugin to write scripts to analyze the pseudocode and extract its Abstract Syntax Tree (AST), Control Flow Graph (CFG), and Data Flow Graph (DFG).
[0064] Different subtrees in an Abstract Syntax Tree (AST) correspond to different ranges of code. Each node represents a specific structure (statement or expression) of the code. Different structures belong to different AST node types. A node has 0, 1, or more attributes. When a node's attributes include other nodes, the latter is called a child node of the former.
[0065] A control flow graph consists of basic blocks (i.e., a set of instructions executed sequentially without branches or jumps in between) nodes, with edges representing possible execution flows between basic blocks, depicting the program execution logic and all possible execution paths within functions.
[0066] The data flow graph is also composed of basic block nodes. Edges represent the flow of data and operations between basic blocks, depicting the flow of variables within functions and their dependencies. If an instruction in one basic block writes data to a memory address, and an instruction in another basic block reads data from the same address, it indicates that there is data flow interaction between the two basic blocks.
[0067] Step 1.3: Based on the Abstract Syntax Tree (AST), and combining the Control Flow Graph (CFG) and Data Flow Graph (DFG), define various types of nodes and directed edges to construct a Directed Heterogeneous Graph (DHG).
[0068] Directed heterogeneous graphs contain more than two types of nodes and directed edges, and can accurately depict the diverse entities and their complex relationships within a function.
[0069] The node type of DHG is defined as follows:
[0070] Define nodes in the AST other than identifiers as “AST” node type, and define the value as the node type name (e.g., Assign, Expr, etc.); define identifiers in the AST as “identifier” node type, and define the value as its content (e.g., a, b, c, etc.).
[0071] The directed edge type of DHG is defined as follows:
[0072] First, the original parent-child relationship in the AST is represented as directed edges. The direction is from the child node to the parent node, and the type is defined as the corresponding attribute name (e.g., body, value, etc.), which is used to reflect the special semantics of each directed edge as well as the program hierarchy and structure.
[0073] Second, add control flow edges and data flow edges from the control flow graph (CFG) and data flow graph (DFG) to the corresponding nodes in the AST. The direction is the same as the original flow direction, and the types are defined as "CFI" and "DFI" respectively, to represent the control flow and data flow information of the function.
[0074] Third, in order to more comprehensively represent the semantic and structural information of the function, the following three types of information enhancement edges are added to the AST.
[0075] (1) Connect the positions where the variable appears twice consecutively, with the direction from the left node to the right node. The type is defined as "NU" to represent the usage mode and context information of the variable.
[0076] (2) Connect two adjacent sibling nodes, with the direction from the left node to the right node. The type is defined as "NB" to represent the order information of the child nodes.
[0077] (3) Connect two adjacent terminal nodes, with the direction from the left node to the right node. The type is defined as "NT", which is used to represent the token sequence information of the function.
[0078] Step 1.4: Based on the directed heterogeneous graph DHG, the regular directed edges in it are extended into directed super edges that can connect any number of nodes, thus constructing a directed heterogeneous hypergraph DDHG.
[0079] First, regular directed edges are expanded into directed super edges, preserving their original type and direction. Then, directed super edges with the same child nodes or the same parent nodes and of the same type are merged to form new directed super edges containing multiple directly related nodes, thereby capturing higher-order correlations between nodes.
[0080] The constructed directed heterogeneous hypergraph DDHG can be represented as (N, E), where N represents the set of nodes in the graph. E represents the set of directed hyperedges in the graph. Each node in N can be represented as n = (α, x), where α represents the type of node n (i.e., "AST" node type or "identifier" node type), and x represents the value of node n (i.e., AST node type name or identifier content). Each directed hyperedge in E can be represented as e = (τ, T(e), H(e)), with the direction from T(e) to H(e), where τ represents the type of the directed hyperedge e (body, value, CFI, DFI, NU, NB, NT, etc.), and T(e) = {n1, n2, ..., n}. |T(e)| Let} denote the set of tail nodes of a directed superedge e, and H(e) = {n1, n2, ..., n}. |H(e)| Let} represent the set of head nodes of the directed superedge e.
[0081] Step 2: Construct and train a deep hypergraph twin neural network.
[0082] Step 2.1: Integrate two structurally identical and parameter-shared deep hypergraph neural networks DHHGN1 and DHHGN2 to construct a deep hypergraph twin neural network, and then integrate the training set function pairs (f) p ,f ′p Directed heterogeneous hypergraph representation DHHG (p = 1, 2, ..., P, where P is the total number of samples in the training set) p ,DHHG′ p Input the two networks respectively and execute steps 2.2 to 2.4 in parallel.
[0083] Step 2.2, calculate DHHG p,DHHG′ p Initial eigenvectors of each node and hyperedge and
[0084] Step 2.2.1, use the embedding layer to process DHHG p ,DHHG′ p The nodes and hyperedges in the vector space are embedded with their discrete eigenvalues to generate an initial embedding vector β. n and β e .
[0085] For each node n, generate a node embedding vector β based on the node type α and the value x. n See formula (3):
[0086] β n =Embedding α (x) (3)
[0087] Among them, Embedding α This represents the embedding layer corresponding to node type α. To capture the heterogeneity information of different node types, different embedding layers are used for different node types; β n The dimension is D N (Hyperparameters set up in the experiment).
[0088] For each hyperedge e, generate a hyperedge embedding vector β according to type τ. e See formula (4):
[0089] β e =Embedding E (τ) (4)
[0090] Among them, Embedding E This indicates the embedding layer used for hyperedge embedding; all types of hyperedges use the same embedding layer; β e The dimension is D E (Hyperparameters set up in the experiment).
[0091] Step 2.2.2, use linear projection and nonlinear activation to process β n and β e To optimize the initial embeddings of different types of nodes and hyperedges, and place them in the same vector space, we obtain the initial feature vectors of nodes and hyperedges. and
[0092] The calculation methods are shown in formulas (5) and (6):
[0093]
[0094]
[0095] Among them, W α and b α W represents the weight matrix and bias vector corresponding to node type α, respectively. τ and b τ These represent the weight matrix and bias vector corresponding to the hyperedge type τ, respectively; σ represents the non-linear activation function, with ReLU activation function used by default. and The dimension is D C (Hyperparameters set up in the experiment).
[0096] Step 2.3: Iteratively update DHHG in each convolutional layer l. p ,DHHG′ p The node feature vector is obtained through L iterations.
[0097] Step 2.3.1: Calculate the information vector passed from node n to hyperedge e, and apply the multi-head scaling dot product attention mechanism to aggregate the information of all head nodes n∈H(e) and tail nodes n∈T(e) to hyperedge e.
[0098] The calculation method for the information vector passed from node n to hyperedge e is shown in formula (7):
[0099]
[0100] in, These represent the weight matrix and bias vector of the head node, respectively. These represent the weight matrix and bias vector of the tail node, respectively; This represents the information vector passed from node n to hyperedge e. Different weight matrices and bias vectors are used to distinguish between head nodes and tail nodes, enabling the model to learn the information propagation methods of nodes at different positions.
[0101] During aggregation, the method uses a total of I attention heads. In the i-th head, the attention coefficient of each node to the edge is... The calculation method is shown in formula (8):
[0102]
[0103] in, and These are the weight matrices used to adjust the query and the key, respectively. The attention score is scaled to adapt to changes in dimensionality; Softmax is used to standardize the attention weights.
[0104] The i-th head output vector The calculation method is shown in formula (9):
[0105]
[0106] in, It is the weight matrix used to adjust the value; σ represents the non-linear activation function, and the ReLU activation function is used by default.
[0107] Aggregate the output vectors of each attention head, and scale the output vector of the multi-head dot product attention mechanism. The calculation method is shown in formula (10):
[0108]
[0109] Among them, W o1 It is a weight matrix that performs a linear transformation on the results, used to enhance the expressive power of the model.
[0110] Output vector Directly used as the feature vector of hyperedge e See formula (11):
[0111]
[0112] Step 2.3.2: Calculate the information vector that hyperedge e passes back to node n, and apply the multi-head scaling dot product attention mechanism to aggregate all the information of hyperedge e associated with node n back to node n.
[0113] The calculation method for the information vector passed back to node n by the hyperedge e is shown in formula (12):
[0114]
[0115] in, These represent the weight matrix and bias vector from the hyperedge to the head node, respectively; These represent the weight matrix and bias vector from the hyperedge to the tail node, respectively; This represents the information vector that the hyperedge e passes back to node n.
[0116] The aggregation method is similar to that in step 2.3.1, see formulas (13), (14), and (15):
[0117]
[0118]
[0119]
[0120] Step 2.3.3: In the node feature vector update stage, an initial residual connection and identity mapping are introduced to update the node feature vectors.
[0121] Update the feature vector of each node according to formula (16):
[0122]
[0123] Where, ρ l λ and λ are the hyperparameters corresponding to the initial residual connection and the identity mapping, respectively; I represents the identity matrix; σ represents the weight matrix used for transforming the node feature vectors in the l-th layer; σ represents the non-linear activation function, with ReLU activation function used by default.
[0124] The initial residual connection constructs a skip connection from the input layer to each layer. When updating the node feature vector, it combines the information vector from the hyperedge with the initial feature vector of the node to ensure that after stacking multiple convolutional layers, the final representation of each node still contains some information from the input layer. This effectively alleviates the problem of oversmoothing, where the node feature vectors tend to become homogeneous as the number of layers increases, making them difficult to distinguish.
[0125] Identity mapping is achieved by adding an identity matrix to the weight matrix of each layer and controlling the weight of the identity matrix to reduce the influence of the weight matrix in the deep layers. This ensures that the deep model can at least achieve the performance of the shallow model, effectively alleviates the model degradation problem, and avoids the performance decline caused by the increase in the number of layers.
[0126] Step 2.3.4: For each convolutional layer l, repeat steps 2.3.1 to 2.3.3, undergoing a total of L iterations to update the final node feature vector.
[0127] Step 2.4: Aggregate DHHG using attention pooling p ,DHHG′ p all node feature vectors Obtain graph embedding vector
[0128] First, calculate the attention score for each node, as shown in formula (17):
[0129]
[0130] Where A T It is a learnable weight vector.
[0131] Then, aggregate the node feature vectors, as shown in formulas (18) and (19):
[0132]
[0133]
[0134] Where, N p Indicates DHHG p Node set; N′ p Indicates DHHG′ p Node set.
[0135] Step 2.5, Calculation The cosine distance, and use it as a function pair (f p ,f′ p Sim(f) is a similarity measure for ) p ,f′ p See formula (20).
[0136]
[0137] Step 2.6, in order to train the model, combine the function pair (f) p ,f′ p ) tag y p (y p =1 indicates that they are similar, y p =-1 indicates dissimilarity), use stochastic gradient descent to optimize the following objective function to train the model parameters, see formula (21):
[0138]
[0139] Step 3, set the function pairs (f) in the test set. q ,f′ q (q = 1, 2, ..., Q, where Q is the total number of samples in the test set) are input into the trained hypergraph twin neural network, and steps 2.2 to 2.5 are repeated to obtain the function pair (f q ,f′ q Sim(f) is a similarity measure for ) q ,f′ q ), and combine the set threshold ε to determine the input function pair (f) q ,f′ q Are they similar?
[0140] When Sim(f) q ,f′ q When )≥ε, determine (f) q ,F′ q Similar; when Sim(f) q ,f′ q When ) < ε, determine (f) q ,f′ q They are not similar.
[0141] The experimental hyperparameter settings are shown in Table 5.
[0142] Table 5. Experimental Hyperparameter Settings
[0143]
[0144] The above detailed description further illustrates the purpose, technical solution, and beneficial effects of the invention. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A binary function similarity detection method using a hypergraph twin neural network, characterized in that... The method includes the following steps: Step 1: For the given binary functions in the training and test sets, firstly, decompile them into pseudocode using open-source tools and extract their abstract syntax tree (AST), control flow graph (CLP), and data flow graph. Secondly, add CLP edges, data flow graph edges, and three types of information augmentation edges to the AST, and define multiple node types and directed edge types to construct a directed heterogeneous graph representation. The node types in the constructed directed heterogeneous graph representation include two categories: "AST" and "identifier". Non-identifier nodes are of type "AST" and use their type name as the node value. Identifier nodes are of type "identifier" and use their identifier content as the node value. Directed edges include the original parent-child relationship edges of the AST, control flow graph edges, data flow graph edges, and information augmentation edges. The type of the child relation edge is its corresponding attribute name in the abstract syntax tree, and the direction is from the child node to the parent node. The control flow graph edge and the data flow graph edge are of type "CFI" and "DFI" respectively, and the direction is consistent with the original flow direction. The information enhancement edge includes the edge connecting the position of the variable appearing twice, the edge connecting two adjacent sibling nodes, and the edge connecting two adjacent terminal nodes. The types of the three are "NU", "NB" and "NT" respectively, and the direction is from the left node to the right node. Then, the regular directed edge is extended into a directed super edge, retaining the original type and direction. The directed super edges with the same child node or the same parent node and the same type are merged to form a new directed super edge containing multiple directly related nodes, thus constructing a directed heterogeneous hypergraph representation of the function. Step 2: Construct and train a deep hypergraph Siamese neural network. This network integrates two deep hypergraph neural networks with identical structures and shared parameters. First, calculate the initial feature vectors of nodes and hyperedges in the hypergraph. Second, set up multi-layer convolution to iteratively update the node feature vectors. At the same time, introduce initial residual connections and identity mappings during the node feature vector update stage. Then, use attention pooling to aggregate all node feature vectors in the graph to obtain graph embedding vectors. Finally, calculate the cosine distance between the two graph embedding vectors as a similarity measure of the function pair. Combined with the function pair labels, use stochastic gradient descent to optimize the objective function to train the model parameters. Step 3: Based on the directed heterogeneous hypergraph of the test set function pairs obtained in Step 1, use the hypergraph twin neural network trained in Step 2 to calculate the similarity measure of the function pairs, and combine it with the set threshold to determine whether the function pairs are similar.
2. The binary function similarity detection method using a hypergraph twin neural network according to claim 1, characterized in that: In step 2, a deep hypergraph Siamese neural network architecture is constructed, which integrates two deep hypergraph neural networks with identical structures and shared parameters. Based on the Siamese network architecture, the deep hypergraph neural network originally used for graph classification tasks is trained so that it can be adapted to binary function similarity detection tasks.
3. The binary function similarity detection method using a hypergraph twin neural network according to claim 1, characterized in that: In step 2, a deep hypergraph twin neural network is constructed. This involves expanding the receptive field of nodes by increasing the network depth, setting up multi-layer convolutions to iteratively update node feature vectors, deeply mining and integrating the structural relationships and semantic context between distant nodes, effectively capturing the interrelationships between distant statements, and accurately capturing the potential connections in complex graph structures.
4. The binary function similarity detection method using a hypergraph twin neural network according to claim 1, characterized in that: In step 2, during the node feature vector update stage of the directed heterogeneous hypergraph network, initial residual connections and identity mappings are introduced to preserve the initial features of the statements and avoid the performance degradation caused by stacking network layers. This alleviates the inherent oversmoothing and model degradation problems of deep graph neural networks, as shown in the formula. The updated node feature vector is shown below: in, and These are the hyperparameters corresponding to the initial residual connection and the identity mapping, respectively. Represents the identity matrix. Indicates the first The weight matrix used for transforming node feature vectors in the layer. This indicates a non-linear activation function; the ReLU activation function is used by default.