Binary code vulnerability detection method based on hazard function parameter dependence

By extracting dangerous function unions and parameter dependencies from binary functions and using a neural network model for fine-grained analysis, the problem of high false alarm rate in existing technologies is solved, and more accurate binary code vulnerability detection is achieved.

CN115033884BActive Publication Date: 2026-06-26ZHONGYUAN ENGINEERING COLLEGE

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHONGYUAN ENGINEERING COLLEGE
Filing Date
2022-05-17
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing binary function similarity comparison methods have a high false positive rate in vulnerability detection, mainly due to coarse analysis granularity and interference caused by changes in function functionality, making it difficult to accurately detect vulnerabilities in binary code.

Method used

By extracting the dangerous function union from the binary function and utilizing the parameter dependencies of the dangerous functions, semantic vectors are calculated and compared using a neural network model, thus reducing the false alarm rate.

Benefits of technology

It improves the accuracy of binary code vulnerability detection, reduces the false positive rate, and enables fine-grained vulnerability detection across platforms.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115033884B_ABST
    Figure CN115033884B_ABST
Patent Text Reader

Abstract

The application provides a binary code vulnerability detection method based on a dangerous function parameter dependence. The method comprises the following steps: step 1: given an unknown binary file, all binary functions in the unknown binary file are obtained, and all dangerous functions are extracted; step 2: a function complex of each dangerous function is extracted, wherein the function complex is composed of a function name and a parameter slice of the dangerous function; the parameter slice refers to a set of instructions that form a data dependence relationship with the dangerous function in the binary function; step 3: a semantic vector of the parameter slice of each function complex is calculated; step 4: the function name of the function complex and the semantic vector of the corresponding parameter slice are compared with the function name and the semantic vector in a pre-constructed vulnerability library respectively, and a vulnerability report of the unknown binary file is obtained. The content compared in the method only involves core instructions related to vulnerabilities, the analysis granularity is finer, and the accuracy of binary code vulnerability detection can be effectively improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of binary vulnerability technology, and in particular to a binary code vulnerability detection method based on dangerous function parameter dependencies. Background Technology

[0002] With the acceleration of digital transformation and the widespread application of IoT devices, attacks targeting the software supply chain are rapidly increasing, leading to widespread concern about software supply chain security. The inaccessibility of source code and the diversity of platform architectures present new challenges for detecting firmware vulnerabilities in IoT devices.

[0003] Currently, binary function code similarity technology is widely used in vulnerability detection. Its main idea is to compare the similarity between the semantic information of binary code and the semantic information of binary code carrying vulnerabilities. To alleviate the semantic gap caused by platform architecture, Yaniv David et al., based on LLVM IR, elevated binary files from different architectures to an intermediate language, performing similarity comparisons on a binary function basis, effectively solving the instruction differences caused by different architectures. The Gemini approach introduces graph neural networks into binary function code similarity comparison, manually extracting basic block attributes from binary functions and using Structure2vec technology to calculate function semantic vectors, thus improving the efficiency of function similarity comparison.

[0004] Vulnerability detection methods based on binary function similarity comparison have a high false positive rate. Figure 3 This demonstrates the differences before and after the CVE-2018-14714 vulnerability was patched. The patching process involved replacing the dangerous function `sprintf` with `snprintf`. Because the changes in the patching process were minor, vulnerability detection based on the similarity of the entire binary function was prone to false positives. Therefore, the inventors believe that false positives are mainly caused by two reasons: first, the granularity of binary function-based analysis is too coarse, making it difficult to detect minor modifications within the binary function, which is the main cause of false positives; second, the modification of the function's functionality introduces more interference into the binary function-based similarity comparison.

[0005] Therefore, in order to improve the accuracy of binary vulnerability detection and reduce the false positive rate, there is an urgent need for a new binary code vulnerability detection method. Summary of the Invention

[0006] To reduce the false positive rate of binary vulnerability mining using binary function similarity, this invention provides a binary code vulnerability detection method based on dangerous function parameter dependency.

[0007] The binary code vulnerability detection method based on dangerous function parameter dependency provided by this invention includes:

[0008] Step 1: Given an unknown binary file, obtain all binary functions in the unknown binary file, and then extract all dangerous functions from each binary function;

[0009] Step 2: Extract the function union of each dangerous function and name it the dangerous function union. The dangerous function union consists of the function name of the dangerous function and the parameter slice. The parameter slice refers to the set of instructions in the binary function that form a data dependency relationship with the dangerous function.

[0010] Step 3: Calculate the semantic vector of the parameter slice for each dangerous function consortium;

[0011] Step 4: Compare the function name and semantic vector of the corresponding parameter slice of the dangerous function union with the function name and semantic vector in the pre-built vulnerability library to obtain the vulnerability report of the unknown binary file.

[0012] Furthermore, in step 2, extracting each dangerous function union specifically includes:

[0013] Obtain all simple paths in the binary function control flow graph (CFG) and trace all parameters of dangerous functions along these simple paths;

[0014] According to the method of generating parameter slices, obtain the set of all slices for each parameter, and for each parameter, only keep the set of slices with the most elements for the next operation;

[0015] The set of slices for each parameter is taken as a branch of the tree, and the parameter slices of the dangerous function are obtained by combining all branches.

[0016] Furthermore, the specific method for generating the slice set of the parameters includes:

[0017] Suppose a dangerous function carries a parameter A, and there exists an instruction n that assigns a value to the parameter variable A, with a slice set S. <prede(n),A> Defined as:

[0018] S <prede(n),A> ={n}

[0019]

[0020]

[0021] Where, the function parameter A is the starting variable for slice generation, prede(n) represents the set of predecessor instructions of instruction n, and S <prede(n),Vi> Represents variable V iThe slice set in prede(n), refs(n) is the set of basic type variables in instruction n, and points(n) is the set of pointer type variables in instruction n.

[0022] Further, step 3 specifically includes:

[0023] The tree depth-first traversal algorithm is used to transform the parameter slices of each dangerous function from a tree structure into a sequential structure, and the sequential structure is named the parameter sequence.

[0024] Each instruction is treated as a word, and each parameter slice is treated as a sentence. The word2vec model is used to obtain the word embedding vector of each instruction in the parameter sequence.

[0025] The word embedding vectors of each instruction in the parameter sequence are sequentially used as row vectors of the word embedding matrix to obtain the word embedding matrix corresponding to the parameter slice.

[0026] The word embedding matrix is ​​used as input to a trained neural network model, and the output of the neural network model is the semantic vector of the parameter slice.

[0027] Furthermore, after converting the parameter slices of each dangerous function from a tree structure to a sequential structure using a depth-first traversal algorithm, the algorithm also includes:

[0028] Each instruction in the parameter sequence is normalized, including the normalization of variables, addresses, and binary function names in the instruction.

[0029] Furthermore, the two parameter sequences are simultaneously input into a Siamese network containing a bidirectional LSTM for learning, thus obtaining the neural network model.

[0030] Furthermore, step 1 also includes: performing data preprocessing on all binary functions.

[0031] Furthermore, the data preprocessing for all binary functions specifically includes:

[0032] Each binary function is converted into several assembly instructions using a disassembler; these assembly instructions are then converted into several corresponding VEX IR instructions; and finally, these VEX IR instructions are converted into several corresponding LLVMIR instructions.

[0033] Furthermore, it also includes:

[0034] The compiler backend optimizer is used to optimize several LLVM IR instructions obtained from the conversion.

[0035] The beneficial effects of this invention are:

[0036] This invention provides a binary code vulnerability detection method based on dangerous function parameter dependencies. By preprocessing binary functions and converting them into an intermediate language, it enables cross-platform binary function similarity comparison. By analyzing dangerous functions and their parameter dependencies, it obtains the core code related to the vulnerability. Compared to traditional binary function-based similarity comparisons, this invention compares only the core instructions related to the vulnerability, providing finer-grained analysis and effectively improving the accuracy of binary code vulnerability detection. Attached Figure Description

[0037] Figure 1 A flowchart illustrating the binary code vulnerability detection method based on dangerous function parameter dependency provided in this embodiment of the invention;

[0038] Figure 2 The relationship between binary functions and dangerous functions provided in the embodiments of the present invention;

[0039] Figure 3 Example of vulnerability CVE-2018-14714 provided for embodiments of the present invention;

[0040] Figure 4 The process of extracting function unions provided in the embodiments of the present invention;

[0041] Figure 5 A schematic diagram of the generated parameter slice provided in an embodiment of the present invention;

[0042] Figure 6 This is the process of calculating the semantic vector of the parameters provided in the embodiments of the present invention;

[0043] Figure 7 This is a schematic diagram illustrating the training process of the neural network model provided in an embodiment of the present invention. Detailed Implementation

[0044] To make the objectives, technical solutions, and advantages of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.

[0045] To facilitate understanding of the technical solution of this invention, the relevant technical terms involved in this invention will be explained.

[0046] Binary file: An executable file generated by compiling source code.

[0047] Binary code: bytecode generated by the compiler from source code that can be directly executed by the CPU.

[0048] Binary functions: The compiler compiles user-defined functions in the source code to generate corresponding binary code, which is called binary function.

[0049] Dangerous functions: Functions that may cause program vulnerabilities due to improper use.

[0050] Function Union (FU): FU = (FuncName, ParamSli), where FuncName is the function name and ParamSli is a slice of parameters. This term is a proprietary concept of this invention.

[0051] Parameter slice: In binary functions, a set of instructions that forms a data dependency with dangerous functions. This term is a proprietary concept of this invention.

[0052] Example 1

[0053] like Figure 1 As shown, this embodiment of the invention provides a binary code vulnerability detection method based on dangerous function parameter dependency, including the following steps:

[0054] S101: Given an unknown binary file, obtain all binary functions in the unknown binary file, and then extract all dangerous functions from each binary function;

[0055] S102: Extract the function union of each dangerous function and name it dangerous function union. The dangerous function union consists of the function name of the dangerous function and parameter slices. The parameter slices refer to the set of instructions in the binary function that form a data dependency relationship with the dangerous function.

[0056] S103: Calculate the semantic vector of the parameter slice for each dangerous function consortium;

[0057] S104: Compare the function name and semantic vector of the corresponding parameter slice of the dangerous function union with the function name and semantic vector in the pre-built vulnerability library to obtain the vulnerability report of the unknown binary file.

[0058] Specifically, vulnerabilities in binary files are manually marked. The marked binary files are processed to obtain the function names and semantic vectors of the parameter slices of the dangerous function unions. These function names and semantic vectors are stored in a database to create a vulnerability database. For each dangerous function union in an unknown binary file, a comparison is made. First, functions with the same function names as the dangerous function unions are searched in the vulnerability database. Second, semantic vectors in the vulnerability database that are semantically similar to the parameter slices of the function unions are found. Finally, a vulnerability report is generated.

[0059] Compared to vulnerability detection methods based on binary function similarity, the binary code vulnerability detection method based on dangerous function parameter dependency provided in this embodiment of the invention has finer granularity, and dangerous function parameter dependency provides a more accurate description of binary vulnerabilities. Figure 2 This relates to the relationship between binary functions and dangerous functions.

[0060] by Figure 3 Taking the vulnerability CVE-2018-14714 as an example, the patch for this vulnerability modifies the dangerous function sprintf to snprintf, thereby limiting the number of characters copied. Using the method provided in this embodiment of the invention, the parameter slices and function names of the dangerous functions sprintf and snprintf are extracted respectively. These are two completely different function unions, which can accurately distinguish between the vulnerability and the patch.

[0061] Example 2

[0062] Based on the above embodiments, this invention provides a binary code vulnerability detection method based on dangerous function parameter dependency, comprising the following steps:

[0063] S201: Given an unknown binary file, obtain all binary functions in the unknown binary file, perform data preprocessing on all binary functions, and extract all dangerous functions from the processed binary functions.

[0064] Specifically, even the same software will produce different binary code under different system architectures, compilers, and optimization levels. Directly analyzing the decompiled assembly instructions will introduce significant errors. Data preprocessing refers to converting instructions from different architectures into a unified intermediate language, and then using the compiler's back-end optimizer to optimize the binary functions.

[0065] As one possible implementation method, such as Figure 4 As shown, the data preprocessing for all binary functions specifically includes:

[0066] Each binary function is converted into several assembly instructions using a disassembler; these assembly instructions are then converted into corresponding VEX IR instructions; these VEX IR instructions are further converted into corresponding LLVMIR instructions; and finally, the LLVMIR instructions obtained from the binary function conversion are optimized using a compiler backend optimizer.

[0067] It should be noted that converting assembly instructions to LLVMIR masks the differences between instructions and platforms; the optimization mentioned refers to applying the same compiler optimization rules to binary functions that originally used different compiler optimization rules. These optimizations include those for code branches, constants, and expressions; optimizations for registers and instructions; and optimizations for loops and inline functions.

[0068] S202: Extract each dangerous function union;

[0069] Specifically, such as Figure 4 As shown, the specific steps include: obtaining all simple paths in the binary function control flow graph (CFG), tracing all parameters of the dangerous function on the simple paths; obtaining all slice sets for each parameter according to the parameter slice generation method, and retaining only the slice set with the most elements for each parameter for the next operation; using the slice set of each parameter as a branch of a tree, and combining all branches to obtain the parameter slice of the dangerous function.

[0070] In this step, one parameter can retrieve multiple slice sets, and the slice set with the most elements contains the most complete data dependency information.

[0071] As one possible implementation method, the method for generating the slice set of parameters specifically includes:

[0072] Suppose a dangerous function carries a parameter A, and there exists an instruction n that assigns a value to the parameter variable A: A = * B+C, the slice set of variable A in prede(n) is S. <prede(n),A> Specifically, it is expressed as:

[0073] S <prede(n),A> ={n}

[0074]

[0075]

[0076] Where, the function parameter A is the starting variable for slice generation, prede(n) represents the set of predecessor instructions of instruction n, and S <prede(n),Vi> Represents variable V i The set of slices in prede(n).

[0077] refs(n) are basic data type variables (e.g., {B, C}) that perform read operations in instruction n, and points(n) are the set of addresses that can be pointed to at instruction n (e.g., {B, C}). * Furthermore, if prede(n) contains an instruction that performs a write operation on the content pointed to by the pointer-type variable in instruction n, then the variable storing the content pointed to by that pointer-type variable is the address that instruction n can point to. The pointer-type variable refers to a variable that stores a memory address. The basic type variable refers to a variable that stores basic data types (such as integer type, floating-point type, character type, and boolean type).

[0078] As can be seen from the above formula, the parameter slice set is the union of the current instruction and the three instruction sets, which includes: the current instruction n; the slice set of basic variables referenced in instruction n; and the slice set of content pointed to by pointer type variables.

[0079] Each parameter will result in the following: Figure 5 The slice set shown. If the parameter source is a memory variable (such as a temporary variable that exists in memory), then the slice set contains the source of the address where that variable is stored ( Figure 5 The source of the variable value (aa = call malloc(ab)) and the origin of the variable value ( Figure 5 in ae=add ad,-176).

[0080] S203: Calculate the semantic vector of the parameter slice for each dangerous function consortium;

[0081] Specifically, such as Figure 6 As shown, specifically, it includes: using a tree depth-first traversal algorithm to convert the parameter slices of each dangerous function from a tree structure into a sequential structure, and naming the sequential structure a parameter sequence;

[0082] Normalize each instruction in the parameter sequence in sequence, including normalizing the variables, addresses and binary function names in the instructions;

[0083] It's important to note that instructions contain many variables, function calls, and addresses. If these instructions are directly converted into a vector dictionary, the numerous sparse words they contain will negatively impact accuracy. Therefore, before using the Word2vec model to convert instructions into a vector dictionary, they need to be normalized. For example, all variables in the instructions should be numbered, such as aa, ab, ..., ba, bb, ... according to their order of appearance. Addresses and binary function names should also be normalized, such as replacing addresses with MEM and binary function names with FUNC.

[0084] Each instruction i iAs words, each parameter sequence is treated as a sentence, and the Word2vec model is used to obtain the word embedding vector I for each instruction in the parameter slice. i The word embedding vectors of each instruction in the parameter slice are sequentially used as row vectors of the word embedding matrix to obtain the word embedding matrix corresponding to the parameter slice. The word embedding matrix is ​​used as the input of the trained neural network model, and the output of the neural network model is the semantic vector V of the parameter slice of the function union.

[0085] Two parameter sequences are simultaneously input into a Siamese network containing a bidirectional LSTM for learning, resulting in the neural network model. For example... Figure 7 As shown.

[0086] Specifically, the source code of the same file and the same function is compiled using different compilers and optimization levels. The corresponding dangerous function unions are extracted from the compiled binary functions, and the label is set to 1. Different dangerous function unions are randomly selected, and their corresponding labels are set to 0. The corresponding dangerous function unions are processed by a bidirectional LSTM network to calculate a semantic vector V, and the vector comparison is performed. The comparison result is then compared with the label to calculate the loss. After adjusting the model parameters, the iteration continues to complete the gradient descent process.

[0087] For example, multiple open-source software such as OpenSSL and HTTPD can be compiled with different optimization levels, and their labels can be determined by the file source, thus forming a training set.

[0088] Step S204 is the same as step S104 in Example 1, and will not be repeated here.

[0089] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A binary code vulnerability detection method based on dangerous function parameter dependency, characterized in that, include: Step 1: Given an unknown binary file, obtain all binary functions in the unknown binary file, and then extract all dangerous functions from each binary function; Step 2: Extract the function union of each dangerous function and name it the dangerous function union. The dangerous function union consists of the function name of the dangerous function and the parameter slice. The parameter slice refers to the set of instructions in the binary function that form a data dependency relationship with the dangerous function. In step 2, the function union for extracting each dangerous function specifically includes: Obtain all simple paths in the binary function control flow graph (CFG) and trace all parameters of dangerous functions along these simple paths; According to the parameter slice generation method, a set of all slices for each parameter is obtained. For each parameter, only the set of slices with the most elements is retained for the next operation. The specific method for generating the parameter slice set includes: Suppose a dangerous function carries a parameter A, and there exists an instruction n that assigns a value to the parameter variable A, and its slice set... Defined as Here, the parameter A of the function is the starting variable for slice generation, and prede(n) represents the set of predecessor instructions of instruction n. Representing variables The set of slices in prede(n) For the set of basic type variables in instruction n, This is the set of pointer-type variables in instruction n; The set of slices for each parameter is taken as a branch of the tree, and the parameter slices of the dangerous function are obtained by combining all branches. Step 3: Calculate the semantic vector of the parameter slice for each dangerous function union; Step 3 specifically includes: The tree depth-first traversal algorithm is used to transform the parameter slices of each dangerous function from a tree structure into a sequential structure, and the sequential structure is named the parameter sequence. Each instruction is treated as a word, and each parameter sequence is treated as a sentence. The word2vec model is used to obtain the word embedding vector of each instruction in the parameter sequence. By sequentially taking the word embedding vector of each instruction in the parameter sequence as the row vector of the word embedding matrix, the word embedding matrix corresponding to the parameter sequence is obtained. The word embedding matrix is ​​used as the input to a trained neural network model, and the output of the neural network model is the semantic vector of the parameter slice; wherein, two parameter sequences are simultaneously input into a Siamese network containing a bidirectional LSTM for learning, and the neural network model is obtained. Step 4: Compare the function name and semantic vector of the corresponding parameter slice of the dangerous function union with the function name and semantic vector in the pre-built vulnerability library to obtain the vulnerability report of the unknown binary file.

2. The binary code vulnerability detection method based on dangerous function parameter dependency as described in claim 1, characterized in that, After converting the parameter slices of each dangerous function from a tree structure to a sequential structure using a depth-first traversal algorithm, the algorithm also includes: Each instruction in the parameter sequence is normalized, including the normalization of variables, addresses, and binary function names in the instruction.

3. The binary code vulnerability detection method based on dangerous function parameter dependency according to claim 1, characterized in that, Step 1 also includes: preprocessing all binary functions.

4. The binary code vulnerability detection method based on dangerous function parameter dependency according to claim 3, characterized in that, The data preprocessing for all binary functions specifically includes: Each binary function is converted into several assembly instructions using a disassembler; these assembly instructions are then converted into several corresponding VEX IR instructions; and finally, these VEX IR instructions are converted into several corresponding LLVM IR instructions.

5. The binary code vulnerability detection method based on dangerous function parameter dependency according to claim 4, characterized in that, Also includes: The compiler backend optimizer is used to optimize several LLVM IR instructions obtained from the conversion.