Code obfuscation method and dataset enhancement method based on llvm compiler

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using a code obfuscation method based on the LLVM compiler, the problem of insufficient expansion of malware datasets was solved, generating executable files that are difficult to reverse engineer, thus achieving efficient and low-cost dataset expansion and improvement of malware detection models.

CN117421713BActive Publication Date: 2026-06-26YUNNAN UNIV

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: YUNNAN UNIV
Filing Date: 2023-10-31
Publication Date: 2026-06-26

Application Information

Patent Timeline

31 Oct 2023

Application

26 Jun 2026

Publication

CN117421713B

IPC: G06F21/14; G06F8/41; G06F8/53

CPC: G06F21/14; G06F8/41; G06F8/53

AI Tagging

Technology Topics

Data set Theoretical computer science

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A three-dimensional spatial organ medical image feature extraction system and method for reducing false positives
JP7877550B1Image analysis Sensors Data setImage code
Spectral equalization diffractive neural network for image classification
CN122242605APhysical realisation Data set Grating
Intelligent selection method for soil heavy metal pollution remediation agent based on data fusion
CN121919821BEnvironmental resource management Data set
A traffic car customer service marketing method and system based on a large language model
CN120851956Baccurate perceptionaccurate quantitative analysisInput/output for user-computer interaction Biological models Personalization Data set
Distributed adaptive framing output method based on large-scale remote sensing images
CN121214183BReduce read volumeAvoid the dilemma of being idleResource allocation Character and pattern recognition Data set Image resolution

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing malware detection technologies suffer from insufficient malware dataset collection scale, GAN-based malware enhancement methods are limited to executable files and cannot be expanded from the source code level, and traditional manual code obfuscation methods increase development costs and are inefficient.

Method used

The code obfuscation method based on the LLVM compiler is adopted. The source code is converted into an intermediate representation by the compiler under the LLVM framework. The code is compiled using a custom optimizer and seven obfuscation methods in the obfuscation module to generate an executable file that is difficult to reverse engineer and disassemble, and a large number of obfuscated variant samples are generated from the source code level.

Benefits of technology

It achieves automatic code obfuscation during the software compilation stage, and the generated executable file has strong reverse engineering and disassembly resistance capabilities. It can quickly expand the dataset, reduce development costs, and improve the learning ability and detection efficiency of malware detection models.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117421713B_ABST

Patent Text Reader

Abstract

The application provides a code obfuscation method and a data set enhancement method based on an LLVM compiler, first, a compiler under an LLVM framework is used to preprocess source code to obtain an intermediate representation in a uniform file format; then, the code of the file intermediate representation is classified, and an obfuscation module is used to obfuscate the file intermediate representation; finally, the obfuscated file is assembled through a backend assembler to obtain an executable file after obfuscation. The application is developed based on the LLVM compiler framework, and can efficiently, quickly and accurately automatically realize code obfuscation in the software code compilation stage. The application can be used to quickly and automatically generate a large number of malicious software obfuscation samples from the source code level, and realize an automatic and efficient malicious software sample enhancement method. The application enhances the ability of malicious software to resist reverse engineering and disassembly, and improves the accuracy and robustness of a malicious software detection model.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of software security technology, and in particular relates to a code obfuscation method and a dataset enhancement method based on the LLVM compiler. Background Technology

[0002] Code obfuscation is a classic software protection method. The traditional approach to code obfuscation involves reading and analyzing the code, then manually modifying portions of it to achieve the desired obfuscation. This manual approach requires software developers to possess certain knowledge and skills in software protection, as well as software security professionals to have a thorough understanding of the project's source code. Therefore, this manual protection method undoubtedly increases the cost of software development and software dataset expansion, and reduces the efficiency of software development.

[0003] The development of code obfuscation techniques has a significant impact on malware detection. Currently, malware detection primarily relies on intelligent and automated detection technologies based on machine learning. Machine learning algorithms and models are crucial for malware detection; simultaneously, a high-quality malware sample dataset greatly influences the model's training results. However, the main methods for collecting malware samples currently rely on publicly available datasets or manual collection, resulting in datasets far smaller than the number and types of existing malware. Therefore, dataset augmentation has been widely proposed. Currently, the mainstream approach to augmenting malware datasets is using Generative Adversarial Networks (GANs). By optimizing the generator and discriminator in GANs, the executable files of malware can be expanded and enhanced, ultimately achieving the goal of malware enhancement. However, GAN-based malware augmentation methods are still limited to the original malicious code executable files and cannot expand and enhance data at the source code level.

[0004] To address the shortcomings of existing malware enhancement techniques, this invention focuses on the source code level and designs a code obfuscation method based on the LLVM compiler architecture. Using this invention to compile source code eliminates the need for software developers to protect the software during the source code writing process, generating an executable file protected by code obfuscation. Compared to executable files generated by traditional compilers, the enhanced file produced by this invention has stronger resistance to reverse engineering and disassembly, without requiring additional code writing for software security. Furthermore, this invention can rapidly generate a large number of obfuscated and variant samples from the source code of malware, thereby expanding the dataset and demonstrating significant practical value and promising application prospects. Summary of the Invention

[0005] The purpose of this invention is to provide a code obfuscation method based on the LLVM compiler, which generates an executable file protected by code obfuscation by compiling the source code.

[0006] Another objective of this invention is to provide a dataset augmentation method based on LLVM compiler code obfuscation, which can rapidly generate a large number of obfuscated variant samples from the source code level of malware.

[0007] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is a code obfuscation method based on the LLVM compiler, which specifically includes the following steps:

[0008] S1. Use the compiler under the LLVM framework to convert the source code into a unified intermediate file representation;

[0009] S2. Classify the code and obfuscation methods, and call the obfuscation methods encapsulated in the obfuscation module to obfuscate the intermediate representation of the file obtained in S1;

[0010] S3. The obfuscated intermediate representation of the file is assembled by the back-end assembler to obtain the obfuscated executable file.

[0011] Furthermore, the obfuscation module in S2 encapsulates a custom optimizer and seven obfuscation methods, including control flow flattening, spurious random control flow, instruction substitution, constant substitution, string obfuscation, function name obfuscation, and symbolic execution adversarial. The obfuscation module works by customizing the rules for the source code modification by the custom optimizer, allowing the source code to be obfuscated during compilation in a way that optimizes the compiler.

[0012] Furthermore, in S2, when calling the obfuscation method encapsulated in the obfuscation module, the user sets parameters to call a specific obfuscation method; different parameters are used to refer to obfuscation methods; the order of parameters selected by the user when using the obfuscation module is used to generate an obfuscation unit pipeline, and the obfuscation unit pipeline is loaded by the optimizer under LLVM during obfuscation.

[0013] Furthermore, the code classification standard in S2 is as follows: the code in the disassembled code is divided into statements and basic blocks according to the granularity level.

[0014] Furthermore, the classification criteria for obfuscation methods in S2 are as follows: based on different granularity levels, obfuscation methods are divided into data flow obfuscation and control flow obfuscation.

[0015] Furthermore, data flow obfuscation is used for statement obfuscation; control flow obfuscation is used for basic block obfuscation; and the data flow obfuscation method is called in higher order than the control flow obfuscation method.

[0016] Furthermore, the specific methods for invoking the obfuscation module include:

[0017] A. Seven specific obfuscation methods are encapsulated separately, and the obfuscation module can select one or more of them by parameters;

[0018] B. When the obfuscation module calls multiple obfuscation methods and calls both control flow flattening and spurious random control flow obfuscation methods, control flow flattening and spurious random control flow obfuscation methods are used last.

[0019] C. The obfuscation module obtains the control flow graph of the executable file as a feature and generates different control flow graphs by modifying the order of control flow obfuscation methods.

[0020] A dataset augmentation method based on LLVM compiler code obfuscation is proposed as follows: a malicious file detection model upgrade framework is constructed, and an initial model is obtained by learning from ordinary samples; samples are taken from obfuscated samples, and the initial model is used to predict the samples; data that can escape the detection model are selected based on the prediction results of the initial model as obfuscation augmentation dataset; the original malicious file detection model dataset and the obfuscation augmentation dataset are combined to form an enhanced dataset.

[0021] Furthermore, the obfuscated sample is specifically an obfuscated sample obtained by obfuscating each sample in the original malware detection model dataset using the code obfuscation method based on the LLVM compiler described above.

[0022] The beneficial effects of this invention are as follows: Based on the LLVM compiler framework, this invention can efficiently, quickly, and accurately automate code obfuscation during the software code compilation stage. Software developers do not need to protect the software while writing the source code; a code-obfuscated executable file can be generated immediately. Compared with executable files generated by traditional compilers, the enhanced file generated by this invention has stronger resistance to reverse engineering and disassembly, without requiring additional code writing for software security. Using this invention, a large number of obfuscated and mutated samples can be quickly generated from the source code of malware, thereby expanding the dataset and demonstrating significant practical value and application prospects. Attached Figure Description

[0023] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0024] Figure 1 This is a system execution flowchart of the present invention;

[0025] Figure 2This is a schematic diagram of the framework of the enhanced malware detection model implemented in this invention;

[0026] Figure 3 This is a schematic diagram illustrating the operation of the code obfuscation module in this invention.

[0027] Figure 4 This is a comparison diagram of the control flow of an executable file after control flow flattening and code obfuscation; where (a) is before obfuscation and (b) is after obfuscation.

[0028] Figure 5 These are comparison diagrams of the control flow of executable files after being obfuscated with fake random control flow code; where (a) is before obfuscation and (b) is after obfuscation.

[0029] Figure 6 These are comparison diagrams showing the effect of obfuscating malware according to embodiments of the present invention; (a) is normal compilation, and (b) is after obfuscation. Detailed Implementation

[0030] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present invention.

[0031] like Figure 1 This invention provides a code obfuscation method based on the LLVM compiler. First, using compiler component projects under the LLVM framework, such as Clang for compiling C / C++, swiftc for compiling Swift, Rustc for compiling Rust, etc., the source code is generated into an intermediate representation (IR) with a unified file format. The LLVM framework includes compilers for various programming languages. Using the front-end interpreter under the LLVM framework, the source code is converted into a unified intermediate representation. By processing the intermediate representation, code-level obfuscation can be achieved.

[0032] Secondly, using the obfuscation module of this invention to obfuscate IR files can obfuscate the source code while preserving its normal functionality, making it difficult to read, logically complex, but functionally unchanged, thus completing the obfuscation of the source code. To ensure the strength and flexibility of code obfuscation, the obfuscation module incorporates seven specific code obfuscation methods: control flow flattening, spurious random control flow, instruction substitution, constant substitution, string obfuscation, function name obfuscation, and symbolic execution adversarial (3x+1 obfuscation). Each obfuscation method is encapsulated separately, allowing for flexible selection of the desired obfuscation method based on parameters, and simultaneous application to the same source code.

[0033] Finally, the obfuscated IR file is handed over to the backend assembler, and the target platform (X86 / X64, ARM, MIPS, etc.) is specified for assembly, linking and other operations. The result is an executable file with obfuscated code, which has a certain ability to resist reverse engineering and disassembly.

[0034] This invention leverages the LLVM compiler framework, which allows the compiler to modify source code implementations according to certain rules. By developing a code obfuscation module and customizing the optimizer's rules for source code modification, the source code is obfuscated during compilation through compiler optimization, ultimately generating an obfuscated executable file. This invention effectively resists reverse engineering and disassembly analysis of executable files. Furthermore, it can be applied to models for detecting malicious code samples. By generating a large number of random, obfuscated malicious samples through compilation, it expands and enhances the dataset, thereby assisting in malicious code detection research. The malicious samples generated by this method avoid the drawbacks of traditional enhancement methods, such as time-consuming, costly, and limited effectiveness in expanding datasets. By constructing a higher-quality malicious code dataset, it expands the learning capabilities of malicious file detection models.

[0035] This invention modifies the source code using a custom optimizer, obfuscates the source code, and generates an obfuscated executable file. The processing of the dataset and the hardening steps in this invention are as follows:

[0036] Step 1: Divide the disassembled code into statements (data) and basic blocks according to granularity. A basic block consists of multiple statements. Based on different granularities, categorize obfuscation methods into data flow obfuscation and control flow obfuscation. Data flow obfuscation applies to statements (data), while control flow obfuscation applies to basic blocks. The order in which the two types of obfuscation are called is also specified: data flow obfuscation should be called before control flow obfuscation to improve obfuscation efficiency.

[0037] Step Two: Set up the basic calling method, where the user sets parameters to call a specific code obfuscation module. In this invention, all supported obfuscation methods are encapsulated as independent obfuscation units, with no dependencies between them. Different parameters are used to refer to obfuscation methods (e.g., RFC for function name obfuscation, FLA for control flow flattening obfuscation, BRCF for spurious random control flow obfuscation, SUB for instruction obfuscation, CSU for constant obfuscation, SR for string obfuscation, and CMPX for 3x+1 obfuscation). The order of parameters selected by the user when using the obfuscation module generates an obfuscation unit pipeline. During obfuscation, the LLVM optimizer loads the obfuscation unit pipeline. In this invention, taking obfuscation categories as an example, data flow obfuscation has a higher priority than control flow obfuscation. Within data flow obfuscation, string obfuscation has a higher priority than other data flow obfuscation methods. Within control flow obfuscation, 3x+1 obfuscation has a higher priority than other control flow obfuscation methods. Obfuscation schemes containing these methods all follow this principle. The order in which different methods are combined affects the final file obfuscation effect, thus affecting the dataset quality and the detection rate of the malicious file detection model. Taking symbol hiding methods in data flow obfuscation as an example, string obfuscation methods should precede function name obfuscation methods to achieve the best symbol hiding effect. Similarly, in control flow obfuscation, the 3x+1 obfuscation method should be placed before other control flow obfuscation methods to hide the 3x+1 loop code and improve obfuscation efficiency. This step completes the core logic of project obfuscation, ultimately generating an IR file with obfuscation protection capabilities.

[0038] Step 3: An augmentation dataset is created by combining the malware detection model dataset and the obfuscated dataset of the augmentation model. The malware detection model using this augmentation dataset is then tested. The effects of different combinations on model accuracy and robustness are analyzed, and further improvements are made to meet specific requirements. For example... Figure 2 A model upgrade framework is constructed. The malware detection model that needs to be enhanced is selected. First, the model is obtained by learning from ordinary samples. Samples are taken from the obfuscated samples. The obfuscated samples are predicted using the normal malware detection model. The obfuscated ranking useful to the current model is obtained from the model prediction results. The obfuscated dataset used to enhance the model is updated. The obfuscated dataset of the enhanced model is the part of the obfuscated samples that can escape the detection model. The obfuscated samples are obtained by obfuscating each sample in the original program dataset. The model is retrained and re-predicted using the obfuscated enhanced dataset. The above process is repeated to finally obtain the enhanced model.

[0039] Figure 3This is a schematic diagram illustrating the specific operation of the code obfuscation module of this invention. IR stands for Intermediate Representation under the LLVM compiler framework. In the compiler's middleware, the optimizer calls obfuscation units to make certain modifications to the IR. The code obfuscation module is essentially a collection of commonly used obfuscation units, possessing code obfuscation functionality, and will obfuscate the IR file. Parameters can be used to select and invoke some or all of the code obfuscation methods. When implementing these code obfuscation methods, the same source code is compiled using both a traditional compiler and this invention, and the effects of the two methods are compared. The above code obfuscation methods and their specific steps include:

[0040] (1): Control flow flattening;

[0041] Modifying the program's flowchart obscures the logical relationships between basic program blocks, essentially using switch statements to implement conditional jump instructions. First, all basic blocks of the program are collected, and each block executes sequentially. Then, the basic blocks are randomly numbered, and each block's execution is set to jump to a switch statement. Next, based on the original program's block execution order, the variable `var` used to determine the jump in the switch statement after each block is modified so that `var`'s value equals the original block number, thus ensuring the original program's execution flow remains unchanged. Figure 4 The image shows a comparison of the control flow diagrams of the two executable files after code obfuscation using this method. It can be seen that compared to before obfuscation... Figure 4 (a)) control flow diagram, after obfuscation ( Figure 4 (b)) The executable uses switch statements to control the jump relationships between basic blocks, thus flattening the program flow structure. This obfuscation method can obscure the logical relationships between basic program blocks. The malicious code dataset generated in this way helps to expand the model's ability to identify malicious code through the logical relationships between basic blocks.

[0042] (2): Spurious random control flow;

[0043] This approach involves randomly adding specific conditional jump statements to each basic block in the program, thus complicating the program's control flow graph. An opaque predicate is added to the jumps between each basic block; a true predicate results in a normal jump, while a false predicate leads to the cloned basic block. When the opaque predicate is a perpetually true expression, it's called spurious control flow. When the jump condition is a random expression, it's called random control flow. First, all basic blocks are traversed, splitting each basic block into three parts (head block, middle block, and tail block). The middle block is copied as a clone, resulting in identical code between the two basic blocks. An opaque predicate is inserted before the jump statements in the head and middle blocks, modifying the jump statements to conditional jumps. A perpetually false jump condition is set for the cloned block, preventing its execution. Mutation operations can also be performed on the cloned block code. The tautology here incorporates several conclusions from number theory. One example is illustrated below: Given two distinct prime numbers (p1 and p2), two distinct positive integers (a1 and a2), two random values (x and y), and ∨ representing disjunction, the following inequality holds:

[0044] p1*((x∨a1) 2 )≠p2*((y∨a2) 2 )

[0045] Figure 5 This is a comparison chart of the obfuscation methods. It can be observed that after code obfuscation ( Figure 5 (b) The control flow diagram of this method increases the number of basic code blocks, requiring the program to go through multiple branches to complete execution. This obfuscation method increases the number of basic blocks in the program, both normal and malicious code, compared to the original method. Figure 5 (a) Both have increased, and the code in the cloned block can be mutated into normal code, thereby reducing the proportion of malicious code in the program. The malicious code dataset generated in this way helps to enhance the model's ability to extract malicious features from basic blocks, thereby obtaining more feature relationships in the program.

[0046] (3): Instruction substitution;

[0047] This method replaces computational instructions in the program with more complex forms. For computational instructions (arithmetic operations such as addition, subtraction, multiplication, and division, and bitwise operations such as AND, OR, and NOT), more complex expressions with the same result are used. The construction of these substitutions employs mixed Boolean arithmetic. This obfuscation modifies the actual code of the program execution process, replacing arithmetic and bitwise operations with more complex expressions that produce the same result as the original operations. This method can influence the judgment of function functionality by complicating the expressions, leading to errors in function evaluation. The dataset generated by this method can improve the model's adaptability.

[0048] (4): Constant substitution;

[0049] Constants in the program are replaced with complex expressions. Whether it's a macro-defined constant or a constant used for counting in a loop, it can be replaced with a more complex expression. For example, the constant 100 can be replaced by the expression 99+1. Of course, the constant substitution formulas used in practice will be quite complex to confuse the analyst. First, all instructions are scanned, and each instruction is evaluated. When an instruction's operation satisfies the substitution scheme, the instruction is replaced. This operation can be repeated to make the constants more complex. Similar to instruction substitution, constant substitution also increases the difficulty of analysis for the analyst. The difference is that this method only confuses constants appearing in the program, not the operational expressions. At the same time, the dataset generated by this method weakens the role of specific numerical features in the discrimination process, which helps improve the robustness of the model.

[0050] (5): String substitution;

[0051] Strings in the program are encrypted and stored. Strings are encrypted and stored within the program, decrypted when used, and then encrypted again after use. This minimizes the need to use strings as a reference point for reverse engineering. First, global variables in the IR file are traversed, processed using an encryption function, and the resulting strings are encrypted and stored. Then, a corresponding decryption function is added to the program's initial startup function, ensuring the correct strings are obtained during dynamic execution. String substitution encrypts string constants appearing in the program, specifically the string constant parameters passed to the `printf` function. During reverse engineering, these strings become an address identical to the address of the original string, but the content is no longer the original string, effectively increasing the difficulty of analysis during the static analysis phase. Backdoor viruses contain strings related to network communication and local file paths. Malicious code detection models often determine that remote communication and information transmission have occurred based on the presence of these strings, thus identifying them as malicious code. However, techniques such as self-modifying code (SMC) and string concatenation can mask string information, causing errors in model judgment. This method can also mask the string information of the program. The dataset generated by this method reduces the influence of the string itself on the judgment of malicious code, making the model training biased towards behavioral meaning and improving the generalization ability of the program.

[0052] (6): Function name obfuscation;

[0053] This method obfuscates the names of developer-defined functions. Randomly generated strings are used to replace the original function names, eliminating any implicit function functions. This reduces the scope for reverse engineering. However, it is not suitable for the main function, as the program's entry point is identified by the operating system through its name. Modifying the main function name would prevent the operating system from finding the program's entry point. The process involves iterating through all functions, modifying the names of all functions except main by replacing them with strings of random length. The key to this modification is that it must be done globally; otherwise, function calls will be lost. In this function name obfuscation, the name of the custom function `encrypt` is encrypted with a randomly generated string, preventing analysis from using the function name to determine its function's purpose, thus achieving protection. The dataset generated in this way reduces the influence of function names on malicious code detection, as function names do not represent function functions; the model should focus more on function functionality and the call relationships between functions.

[0054] (7): Countering symbolic execution tools;

[0055] The "3x+1" method is used to protect if statements, increasing the time cost for symbolic execution tools to collect constraints and execute specified code within the program. A loop is added to each if statement, generating a large random number variable as the loop termination condition. If the input value is true, the if statement jumps to the normally executed code; if false, it jumps to the loop. Each loop checks the random number variable; if it's even, it divides by 2; if it's odd, it multiplies by 3 and adds 1, until the variable equals 1. First, all basic blocks are obtained to prepare for subsequent code insertion and function flow modifications. Conditional jump statements are found and separated from the original basic blocks. The necessary basic blocks for the loop (x generation block, x judgment block, x division block, even x processing block, odd x processing block) are created, and code logic is established between these blocks to handle the basic block jump relationships, forming a 3x+1 loop. This method traps symbolic execution attacks in numerous loops, increasing their time cost. During the obfuscation process, the program's control flow diagram changed, adding many branches and a loop. Using symbolic execution tools (such as angr) to locate and brute-force the program, the original obfuscated program took 43 seconds, while the obfuscated program took 475 seconds, an increase of approximately 10 times, effectively increasing the time cost of symbolic execution tools. For malicious code to achieve its purpose, it must manifest itself in the code. All models analyze API calls to understand malware behavior, but many malware programs add redundant API calls to mislead model analysis. This obfuscation method disrupts the order of API calls and can insert redundant APIs. The generated dataset improves the model's adaptability and discrimination capabilities for more generalized situations.

[0056] When calling the code obfuscation module, the specific calling method is as follows:

[0057] A: The seven specific obfuscation methods are encapsulated separately, and one or more of them can be selected to be applied to the project before compilation via parameters.

[0058] B: If multiple obfuscation methods are selected for a project, it is important to use control flow flattening and spoofed random control flow last to improve obfuscation efficiency and avoid the final executable file being too large.

[0059] C: Control flow flattening and spurious random control flow are two code obfuscation methods that hide each other's features within their own features. The model obtains the control flow graph of the executable file as a feature and generates different control flow graphs by modifying the order of the control flow obfuscation methods to expand the dataset, avoiding too many single flow graphs that could cause the model to overfit.

[0060] Example 1

[0061] In this embodiment, a backdoor virus named kaiten was selected as an example. It was compiled normally and obfuscated to obtain normal executable files and obfuscated executable files. VirusTotal was then used to test the executable files before and after obfuscation. The test results are as follows: Figure 6 Before confusion ( Figure 6 Compared to (a), after confusion () Figure 6 The low detection rate of the sample in (b) indicates that the method used in this invention can enable malicious code to escape detection, and that training the model with the malicious code dataset generated by this invention can improve the generalization ability of the model and improve the detection rate of malicious code.

[0062] Existing methods for generating obfuscated and mutated malicious code at the source code level mainly involve manually modifying the source code, such as adding junk code, self-modifying code, and encrypting code blocks. These methods require authors to manually manipulate the source code or introduce code obfuscation frameworks. As the number of obfuscated and mutated samples required increases, the author's workload increases, impacting efficiency. This invention eliminates the need to focus on how to modify the source code to retain functionality; simply changing the parameter order yields an obfuscated and mutated program that retains functionality. Furthermore, the effects of obfuscation methods are cumulative and can be combined, thus avoiding the problem of decreased author efficiency due to an increased demand for obfuscated and mutated samples.

[0063] As can be seen, using this invention for code obfuscation offers faster efficiency, easier usage, and a lower barrier to entry compared to traditional code obfuscation methods. It can directly obfuscate existing source code without requiring developers to have security-related knowledge. As a data augmentation method to improve model robustness, by adding an expanded dataset to the dataset, the features extracted by the detection method become more generalizable, thereby improving the detection model's discriminative ability.

[0064] The various embodiments in this specification are described in a related manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the system embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions of the method embodiments.

[0065] The above description is merely a preferred embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention are included within the scope of protection of the present invention.

Claims

1. A code obfuscation method based on the LLVM compiler, characterized in that, Specifically, the following steps are included: S1. Use the compiler under the LLVM framework to convert the source code into a unified intermediate file representation; S2. Classify the code and obfuscation methods, and call the obfuscation methods encapsulated in the obfuscation module to obfuscate the intermediate representation of the file obtained in S1; The obfuscation module encapsulates a custom optimizer and seven obfuscation methods, including control flow flattening, spurious random control flow, instruction substitution, constant substitution, string obfuscation, function name obfuscation, and symbolic execution adversarial. The obfuscation module works by using the custom optimizer to modify the source code according to the rules, so that the source code is obfuscated during the compilation process in a way that optimizes the compiler. When calling the obfuscation method encapsulated in the obfuscation module, the user sets the parameters to call the obfuscation method; different parameters are used to refer to the obfuscation method; the order of the parameters selected by the user when using the obfuscation module is used to generate an obfuscation unit pipeline, and the obfuscation unit pipeline is loaded by the optimizer under LLVM during obfuscation; The specific methods for calling the obfuscation module include: A. Seven specific obfuscation methods are encapsulated separately, and the obfuscation module can select one or more of them by parameters; B. When the obfuscation module calls multiple obfuscation methods and calls both control flow flattening and spurious random control flow obfuscation methods, control flow flattening and spurious random control flow obfuscation methods are used last. C. The obfuscation module obtains the control flow graph of the executable file as a feature and generates different control flow graphs by modifying the order of the control flow obfuscation methods. S3. The obfuscated intermediate representation of the file is assembled by the back-end assembler to obtain the obfuscated executable file.

2. The code obfuscation method based on the LLVM compiler according to claim 1, characterized in that, The code classification standard in S2 is as follows: the code in the disassembled code is divided into statements and basic blocks according to the granularity level.

3. The code obfuscation method based on the LLVM compiler according to claim 1, characterized in that, The classification criteria for obfuscation methods in S2 are as follows: based on different granularity levels, obfuscation methods are divided into data flow obfuscation and control flow obfuscation.

4. The code obfuscation method based on the LLVM compiler according to claim 3, characterized in that, Data flow obfuscation is used for statement obfuscation; control flow obfuscation is used for basic block obfuscation; data flow obfuscation methods are called in higher order than control flow obfuscation methods.

5. A dataset augmentation method based on LLVM compiler code obfuscation, characterized in that, The specific method is as follows: Build a malicious file detection model upgrade framework, use ordinary samples to learn and obtain an initial model; sample from the obfuscated samples, use the initial model to predict the sample, and select data that can escape the detection model as the obfuscation enhancement dataset based on the prediction results of the initial model; The original malicious file detection model dataset and the obfuscation enhancement dataset are combined to form the enhanced dataset; The obfuscated sample is specifically an obfuscated sample obtained by obfuscating each sample in the original malware detection model dataset using a code obfuscation method based on the LLVM compiler as described in any one of claims 1 to 4.