A method, system, device, and medium for adversarial program obfuscation
By fusing target code and adversarial code into an intermediate language to generate an executable file, the problem of obfuscation methods being easily identifiable and unable to withstand artificial intelligence analysis in existing technologies is solved, thereby improving the effectiveness of code obfuscation and software security.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SICHUAN UNIV
- Filing Date
- 2023-09-26
- Publication Date
- 2026-06-30
Smart Images

Figure CN117311715B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of software security technology, and in particular to an adversarial program obfuscation method, system, device, and medium. Background Technology
[0002] Software obfuscation is a technique that uses the transformation and reconstruction of binary code to hide the true intent and implementation details of software. This technique is widely used in the field of software security to prevent attacks such as reverse engineering, decompilation, and code analysis.
[0003] Control flow obfuscation is a common software obfuscation technique that randomizes the control flow of a program, thereby increasing its security and making it difficult for attackers to understand the program's execution flow. Control flow obfuscation is typically implemented by inserting randomized jump statements between basic blocks of the program, thus altering the execution order. A basic block is a continuous sequence of instructions without branching statements, from which control flow shifts. Commonly used control flow obfuscation techniques include randomizing conditional statements, randomizing jump statements, flattening the control flow, and randomizing logical expressions to make the program's logic difficult for attackers to understand. During randomization, the original program's actual blocks are often cloned, and some instructions are randomly replaced to generate branch blocks that will never be executed. This approach generates a large number of junk instructions, making it difficult for attackers to reverse engineer the program's logic.
[0004] String obfuscation increases the difficulty of reverse engineering and improves software security by using techniques such as encryption, splitting, embedding, and dynamic decryption of readable strings.
[0005] However, due to the random and uncontrollable nature of control flow obfuscation in existing technologies, some features are repeated. Furthermore, in the string obfuscation process, all readable strings are obfuscated and encrypted into ciphertext, resulting in a distinct characteristic in the program's information entropy. Therefore, current obfuscation schemes make it easy for humans to determine the obfuscation method during reverse engineering and are unable to effectively counter feature and similarity analysis by artificial intelligence. Summary of the Invention
[0006] The purpose of this invention is to provide an adversarial program obfuscation method, system, device, and medium, which improves the effectiveness of code obfuscation and thus provides software security.
[0007] To achieve the above objectives, the present invention provides the following solution:
[0008] An adversarial program obfuscation method includes:
[0009] A code compiler is used to convert the target code into an intermediate language.
[0010] Adversarial code is randomly selected from a set of adversarial code; the set of adversarial code is a collection of open-source code filtered based on usage features and code features.
[0011] A code compiler is used to convert the selected adversarial code into an adversarial code intermediate language;
[0012] The target code intermediate language and the adversarial code intermediate language are fused to obtain a fused intermediate language;
[0013] The fused intermediate language is compiled using the code compiler to generate an executable file.
[0014] Optionally, the construction of the adversarial code set includes:
[0015] Download open-source code repositories from the internet;
[0016] Discard any open-source code repositories downloaded that do not belong to the target code language or are different from the target code compilation platform, and obtain the first code set;
[0017] The first characteristic of each open-source code repository in the first code set is calculated based on usage characteristics; the usage characteristics include the number of people who have collected the code, the number of people who have copied it, and the number of times the code has been updated.
[0018] Calculate the second feature of each open-source code repository in the first code set based on each code feature;
[0019] Discard open-source code repositories in the first code set whose first feature is lower than a first set value or whose second feature is lower than a second set value, to obtain a second code set;
[0020] Add the second code set to the adversarial code set.
[0021] Optionally, the code features include an index of the number of readable strings, an index of the number of array variables, an index of the number of decision branches, an index of function call complexity, and an index of function call complexity.
[0022] Optionally, both the target code intermediate language and the adversarial code intermediate language are LLVM IR.
[0023] Optionally, the target code intermediate language and the adversarial code intermediate language are fused to obtain a fused intermediate language, specifically including:
[0024] For each string in the target code intermediate language: randomly select a string from the adversarial code intermediate language as a key, encrypt the string in the target code intermediate language using a preset encryption algorithm, replace the string in the target code intermediate language with the encryption result, and obtain the first fusion result;
[0025] For each array variable in the first fusion result: randomly select an array variable from the adversarial code intermediate language as a key, encrypt the array variable in the first fusion result using a preset encryption algorithm, and replace the array variable in the first fusion result with the encrypted result to obtain the second fusion result;
[0026] Insert a decryption function corresponding to the preset encryption algorithm at a random position in the second fusion result;
[0027] By traversing each instruction in the second fusion result after inserting the decryption function, the call position of each variable in the second fusion result after inserting the decryption function is obtained, and each call position is replaced with the decryption function to obtain the third fusion result;
[0028] The instructions in the three-way fusion result are traversed by a traverser to construct the control flow graph of the three-way fusion result and obtain the basic block corresponding to the control flow graph. The basic block corresponding to the control flow graph is recorded as the first basic block.
[0029] Randomly select multiple first basic blocks, insert a conditional branch that is never reachable between each selected first basic block, and insert a second basic block after the conditional branch that is never reachable. The second basic block is an adversarial code intermediate language basic block. The conditional branch that is never reachable is a placeholder that is always false or always true, denoted as a comparison placeholder.
[0030] Generate a mixed comparison formula with 1 to 5 random numbers by repeatedly generating constants, operators, Boolean operations, and comparison operators;
[0031] The mixed comparison formula with 1 to 5 random numbers is combined with AND logic and OR logic to generate a predicate corresponding to the comparison placeholder; and the comparison placeholder is replaced with the predicate to obtain the fused intermediate language.
[0032] Optionally, the preset encryption algorithm is one of XOR encryption, shift encryption, and AES encryption.
[0033] This invention also discloses an adversarial program obfuscation system, comprising:
[0034] The target code conversion module is used to convert target code into target code intermediate language using a code compiler.
[0035] The adversarial code selection module is used to randomly select adversarial codes from the set of adversarial codes;
[0036] The adversarial code conversion module is used to convert selected adversarial code into an adversarial code intermediate language using a code compiler;
[0037] A code fusion processor is used to fuse the target code intermediate language and the adversarial code intermediate language to obtain a fused intermediate language;
[0038] An executable file generation module is used to compile the fused intermediate language using the code compiler to generate an executable file.
[0039] The present invention also discloses an electronic device, characterized in that it includes a memory and a processor, the memory being used to store a computer program, and the processor running the computer program to cause the electronic device to perform the adversarial program obfuscation method according to the invention.
[0040] The present invention also discloses a computer-readable storage medium, characterized in that it stores a computer program, which, when executed by a processor, implements the adversarial program obfuscation method as described above.
[0041] According to specific embodiments provided by the present invention, the present invention discloses the following technical effects:
[0042] This invention discloses an adversarial program obfuscation method, which randomly selects adversarial code from a set of adversarial code; the set of adversarial code is a collection of open-source code filtered based on usage characteristics and code characteristics; the intermediate language of the target code and the intermediate language of the adversarial code are fused to obtain a fused intermediate language; a code compiler is used to compile the fused intermediate language to generate an executable file, which improves the randomness of the adversarial code, improves the effectiveness of code obfuscation, increases the difficulty of reverse engineering, and improves the security of the software. Attached Figure Description
[0043] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0044] Figure 1 This is a schematic flowchart of an adversarial program obfuscation method provided in an embodiment of the present invention;
[0045] Figure 2 A simplified flowchart of an adversarial program obfuscation method provided in an embodiment of the present invention;
[0046] Figure 3 This is a schematic diagram of the adversarial code set construction process provided in an embodiment of the present invention;
[0047] Figure 4 A schematic diagram of the code fusion device structure provided in an embodiment of the present invention;
[0048] Figure 5 This is a schematic diagram of the variable obfuscation process for the code fusion device provided in an embodiment of the present invention;
[0049] Figure 6 This is a schematic diagram of the code fusion controller control flow obfuscation process provided in an embodiment of the present invention;
[0050] Figure 7 This is a schematic diagram of an adversarial program obfuscation system provided in an embodiment of the present invention. Detailed Implementation
[0051] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0052] The purpose of this invention is to provide an anti-obfuscation method, system, device, and medium, which improves the effectiveness of code obfuscation.
[0053] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0054] Example 1
[0055] like Figure 1 and Figure 2 As shown in the figure, this embodiment provides an adversarial program obfuscation method, which includes the following steps.
[0056] Step 101: Use a code compiler to convert the target code into target code intermediate language.
[0057] Step 102: Randomly select adversarial code from the adversarial code set; the adversarial code set is a collection of open-source code filtered based on usage features and code features.
[0058] As a specific implementation method, the target code is C++.
[0059] like Figure 3 As shown, the construction of the adversarial code set includes:
[0060] Use a code collector to download open-source code repositories from the internet. Specifically, a code collector is a code crawler. The downloaded open-source code repositories are then filtered to obtain a set of adversarial code.
[0061] The code language is determined by the file extension of the open-source code repository file, and files that are not written in the target language are discarded. Platform-specific libraries and frameworks, as well as platform-specific application programming interfaces (APIs), used in the code are obtained through description files, build scripts, and the actual code; open-source code repositories that are not compatible with the target code's platform are discarded. More specifically, this includes:
[0062] Discard any open-source code repositories in the downloaded repository that do not belong to the target code language, and you will get the first code set.
[0063] The first feature of each open-source code repository in the first code set is calculated based on the usage features; the usage features include the number of people who have collected the code, the number of people who have copied it, and the number of times the code has been updated.
[0064] The formula for calculating the first feature is:
[0065]
[0066] Where score1 represents the first feature, x A Indicates the number of people who have collected the item, x B Indicates the number of copies, x C Indicates the number of code updates, x D f(x) represents the number of times it is referenced (the number of times it is referenced by other code repositories). A ) represents x A The value mapped to the interval [0, 1], f(x) B ) represents x B The value mapped to the interval [0, 1], f(x) C ) represents x C The value mapped to the interval [0, 1], f(x) D ) represents x D The value mapped to the interval [0, 1], w A The weight representing the number of people who have collected the item, w B w represents the weight of the number of copies. C The weight w represents the number of code updates. D The weight represents the number of times a reference is made.
[0067] Calculate the second feature of each open-source code repository in the first code set based on each code feature.
[0068] The code characteristics include the index of the number of readable strings, the index of the number of array variables, the index of the number of decision branches, the index of function call complexity, and the index of function call complexity.
[0069] The formula for calculating the second feature is expressed as follows:
[0070]
[0071] Where score2 represents the second feature, IF is the weighted score of the total features, and N rs Score the number of readable strings, N fv Score the number of array variables, N dc To determine the score based on the number of branches, CCF fc Scoring function call complexity, CCF cd Score the code dependencies, k1 is N rs The corresponding weight coefficients, k2 is N fv The corresponding weight coefficients, k3 is N dc The corresponding weighting coefficient, k4 is CCF fc The corresponding weighting coefficient, k5 is CCF cd The corresponding weighting coefficient, IF min The minimum value in the weighted score, IF max This represents the maximum weighted score.
[0072] Discard open-source code repositories in the first code set whose first feature is lower than a first preset value or whose second feature is lower than a second preset value. Figure 3 (Whether the score is greater than the set value) to obtain the second code set.
[0073] As a specific implementation method, codes with scores greater than a set value undergo a second manual screening (randomly selecting adversarial codes based on a manually set code structure) to obtain a second code set.
[0074] The second code set is added to the adversarial code set. The code in the adversarial code set has obvious identifiable features.
[0075] Step 103: Use a code compiler to convert the selected adversarial code into an adversarial code intermediate language.
[0076] Both the target code intermediate language and the adversarial code intermediate language are LLVM IR.
[0077] Step 104: Fuse the target code intermediate language and the adversarial code intermediate language to obtain the fused intermediate language.
[0078] Step 104 specifically includes: using a code fusion processor to process strings, variables, and control flow to fuse the target code intermediate language and the adversarial code intermediate language. The principle of the code fusion processor is as follows: Figure 4 As shown.
[0079] Among them, such as Figure 5 and Figure 6 As shown, step 104 specifically includes:
[0080] Iterate through the string and array variables in the intermediate language of the target code.
[0081] For each string in the target code intermediate language: randomly select a string from the adversarial code intermediate language as a key, encrypt the string in the target code intermediate language using a preset encryption algorithm, replace the string in the target code intermediate language with the encryption result, and obtain the first fusion result.
[0082] For each array variable in the first fusion result: randomly select an array variable from the adversarial code intermediate language as a key, encrypt the array variable in the first fusion result using a preset encryption algorithm, and replace the array variable in the first fusion result with the encrypted result to obtain the second fusion result.
[0083] Insert a decryption function corresponding to the preset encryption algorithm at a random position in the second fusion result.
[0084] By traversing each instruction in the second fusion result after inserting the decryption function, the call position of each variable in the second fusion result after inserting the decryption function is obtained, and each call position is replaced with the decryption function to obtain the third fusion result.
[0085] The third fusion result code is accessed through the LLVM analysis framework. All instruction sets and functions are analyzed to construct the control flow graph of the three fusion results and obtain the basic blocks corresponding to the control flow graph. The basic blocks corresponding to the control flow graph are recorded as the first basic blocks.
[0086] The iterator traverses all first basic blocks and randomly selects multiple first basic blocks. It inserts an unreachable conditional branch between each selected first basic block and inserts a second basic block after the unreachable conditional branch. The second basic block is an adversarial code intermediate language basic block. The unreachable conditional branch is a placeholder for always false (1!=1) or always true (1==1), denoted as a comparison placeholder.
[0087] The basic blocks of the adversarial code intermediate language are the basic blocks that constitute the adversarial code intermediate language.
[0088] This generates a mixed comparison expression with 1 to 5 random numbers by repeatedly generating constants, operators, Boolean operations, and comparison operators. For example: if((x&y)+(x^y)==0||(x|y)==0), where x and y are constants, operators, Boolean operations, or comparison operators.
[0089] The mixed comparison formula with 1 to 5 random numbers is combined with AND logic and OR logic to generate a predicate corresponding to the comparison placeholder; and the comparison placeholder is replaced with the predicate to obtain the fused intermediate language.
[0090] The preset encryption algorithms include, but are not limited to, XOR encryption, shift encryption, and AES encryption.
[0091] The number of adversarial codes inserted by the code fusion device is randomly controlled by the program.
[0092] Step 105: Compile the fused intermediate language using the code compiler to generate an executable file.
[0093] Specifically, step 105 includes: compiling the final intermediate language generated by the code fusion machine into an executable file for the corresponding runtime platform using a code compiler.
[0094] like Figure 2 As shown, the output of the fusion machine (code fusion machine) is encoded into an executable file after compilation and transformation.
[0095] This invention provides an adversarial program obfuscation method and system to address the problems of random and uncontrollable instruction substitution, easily identifiable obfuscation methods, and ineffective resistance to feature and similarity analysis under artificial intelligence in existing technologies. The method uses a code collector to filter and download open-source code from the internet, and performs secondary filtering to obtain code with clearly identifiable features. A code compiler is used to convert the target code and adversarial code into LLVM IR intermediate language. A code fusion processor processes strings, variables, and control flow to fuse the intermediate languages of the target code and adversarial code. Finally, the code compiler compiles the fused intermediate language into an executable file. This method employs an adversarial code collector, an adversarial code selector, a code compiler, and a code fusion processor. The adversarial code collector crawls relevant code and filters code that meets the requirements to add it to the adversarial code set. The adversarial code selector selects relevant adversarial code during the compilation of the target code. The code compiler compiles the target code and adversarial code into LLVM IR intermediate language and generates an executable file. The code fusion processor includes a variable module and a control flow module, used to fuse the strings, array variables, and obfuscated control flow of the target code and adversarial code. Compared to existing technologies, this invention uses a combination of randomness and manual intervention to specifically address the uncontrollable randomness of instruction substitution and spurious jumps. By counteracting the fusion compilation of the code set and the target code, it effectively solves the problems that obfuscation methods are easily identified by humans and that existing obfuscation methods cannot withstand feature and similarity analysis under artificial intelligence.
[0096] Example 2
[0097] like Figure 7 As shown, this embodiment provides an adversarial program obfuscation system, including:
[0098] The target code conversion module 201 is used to convert target code into target code intermediate language using a code compiler.
[0099] The adversarial code selection module 202 is used to randomly select adversarial codes from the set of adversarial codes.
[0100] The adversarial code conversion module 203 is used to convert selected adversarial code into adversarial code intermediate language using a code compiler.
[0101] Code fusion unit 204 is used to fuse the target code intermediate language and the adversarial code intermediate language to obtain a fused intermediate language.
[0102] The executable file generation module 205 is used to compile the fused intermediate language using the code compiler to generate an executable file.
[0103] Example 3
[0104] This example provides an electronic device, including a memory and a processor. The memory is used to store a computer program, and the processor runs the computer program to cause the electronic device to perform the adversarial program obfuscation method according to Embodiment 1.
[0105] This example also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the adversarial program obfuscation method as described in Example 1.
[0106] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the systems disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the descriptions are relatively simple; relevant parts can be referred to the method section.
[0107] This document uses specific examples to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of the present invention. Furthermore, those skilled in the art will recognize that, based on the ideas of the present invention, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of the present invention.
Claims
1. A method of adversarial program obfuscation, the method comprising: include: A code compiler is used to convert the target code into an intermediate language. Randomly select adversarial code from the set of adversarial codes; The adversarial code set is a collection of open-source code filtered based on usage features and code features; A code compiler is used to convert the selected adversarial code into an adversarial code intermediate language; The target code intermediate language and the adversarial code intermediate language are fused to obtain a fused intermediate language; The fused intermediate language is compiled using the code compiler to generate an executable file; The construction of the adversarial code set includes: Download open-source code repositories from the internet; Discard any open-source code repositories downloaded that do not belong to the target code language or are different from the target code compilation platform, and obtain the first code set; The first characteristic of each open-source code repository in the first code set is calculated based on usage characteristics; the usage characteristics include the number of people who have collected the code, the number of people who have copied it, and the number of times the code has been updated. Calculate the second feature of each open-source code repository in the first code set based on each code feature; Discard open-source code repositories in the first code set whose first feature is lower than a first set value or whose second feature is lower than a second set value, to obtain a second code set; Add the second code set to the adversarial code set; The target code intermediate language and the adversarial code intermediate language are fused to obtain a fused intermediate language, specifically including: For each string in the target code intermediate language: randomly select a string from the adversarial code intermediate language as a key, encrypt the string in the target code intermediate language using a preset encryption algorithm, replace the string in the target code intermediate language with the encryption result, and obtain the first fusion result; For each array variable in the first fusion result: randomly select an array variable from the adversarial code intermediate language as a key, encrypt the array variable in the first fusion result using a preset encryption algorithm, and replace the array variable in the first fusion result with the encrypted result to obtain the second fusion result; Insert a decryption function corresponding to the preset encryption algorithm at a random position in the second fusion result; By traversing each instruction in the second fusion result after inserting the decryption function, the call position of each variable in the second fusion result after inserting the decryption function is obtained, and each call position is replaced with the decryption function to obtain the third fusion result; The instructions in the three-way fusion result are traversed by a traverser to construct the control flow graph of the three-way fusion result and obtain the basic block corresponding to the control flow graph. The basic block corresponding to the control flow graph is recorded as the first basic block. Randomly select multiple first basic blocks, insert a conditional branch that is never reachable between each selected first basic block, and insert a second basic block after the conditional branch that is never reachable. The second basic block is an adversarial code intermediate language basic block. The conditional branch that is never reachable is a placeholder that is always false or always true, denoted as a comparison placeholder. Generate a mixed comparison formula with 1 to 5 random numbers by repeatedly generating constants, operators, Boolean operations, and comparison operators; The mixed comparison formula with 1 to 5 random numbers is combined with AND logic and OR logic to generate a predicate corresponding to the comparison placeholder; and the comparison placeholder is replaced with the predicate to obtain the fused intermediate language.
2. The adversarial program obfuscation method of claim 1, wherein, The code characteristics include the index of the number of readable strings, the index of the number of array variables, the index of the number of decision branches, the index of function call complexity, and the index of function call complexity.
3. The adversarial program obfuscation method according to claim 1, characterized in that, Both the target code intermediate language and the adversarial code intermediate language are LLVM IR.
4. The adversarial program obfuscation method according to claim 1, characterized in that, The preset encryption algorithm is one of XOR encryption, shift encryption, and AES encryption.
5. An adversarial program obfuscation system, characterized in that, The adversarial obfuscation system employs the adversarial obfuscation method according to any one of claims 1-4, and the adversarial obfuscation system comprises: The target code conversion module is used to convert target code into target code intermediate language using a code compiler. The adversarial code selection module is used to randomly select adversarial codes from the set of adversarial codes; The adversarial code conversion module is used to convert selected adversarial code into an adversarial code intermediate language using a code compiler; A code fusion processor is used to fuse the target code intermediate language and the adversarial code intermediate language to obtain a fused intermediate language; An executable file generation module is used to compile the fused intermediate language using the code compiler to generate an executable file.
6. An electronic device, characterized in that, The device includes a memory and a processor, the memory being used to store a computer program, and the processor running the computer program to cause the electronic device to perform the adversarial program obfuscation method according to any one of claims 1 to 4.
7. A computer-readable storage medium, characterized in that, It stores a computer program that, when executed by a processor, implements the adversarial program obfuscation method as described in any one of claims 1 to 4.