Method for evaluating performance of disassembling tool based on dynamic multi-path test
By combining fuzz testing and dynamic symbolic execution with path exploration and basic block-level instrumentation, and in conjunction with static disassembly, a high-coverage benchmark truth value is constructed. This solves the problems of inaccurate benchmark truth values and insufficient coverage in the performance evaluation of disassembler tools, and enables quantitative analysis and multi-tool comparison in complex obfuscated scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XIDIAN UNIV
- Filing Date
- 2026-04-02
- Publication Date
- 2026-06-16
Smart Images

Figure CN122220232A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of software security technology, and further relates to a performance evaluation method for disassemblers based on dynamic multi-path testing in the field of binary program analysis technology. This invention can be widely applied to disassembler performance analysis, software security assessment, and binary program analysis, providing high-precision, low-overhead, and knowledge-free automated evaluation in complex obfuscated environments. Background Technology
[0002] With the continuous development of software security analysis, reverse engineering, and binary program understanding technologies, disassemblers play a crucial role in vulnerability discovery, malware analysis, and program behavior recovery. However, in practical applications, target programs often undergo complex obfuscation processes such as control flow flattening, instruction substitution, spoofed control flow, and even virtualization protection, leading to significant deviations in disassembly results. Therefore, how to accurately and objectively evaluate the performance of disassemblers in the absence of source code has become a critical problem that urgently needs to be solved in the current binary analysis field. Existing evaluation methods typically rely on compiler-generated information or the output of existing disassemblers as baseline truth values. However, compiler-based methods often require customization of the compiler or acquisition of additional debugging information, resulting in high implementation costs and limited scalability. Furthermore, in obfuscated scenarios, the analysis results of existing disassemblers themselves cannot guarantee correctness, leading to deviations in baseline truth values and affecting the reliability of evaluation results. Simultaneously, traditional precision and recall metrics are insufficient to reflect differences in program structure complexity and cannot demonstrate the actual impact of obfuscation on analysis results.
[0003] In their published paper "Evaluating Disassembly Errors With Only Binaries" (Proceedings of the 20th ACM Asia Conference on Computer and Communications Security. 2025: 1741-1755.), author Lambang Akbar Wijayadi proposed a method to identify disassembly errors by dynamically instrumenting instruction traces and comparing them with disassembly results, relying solely on binary programs. This method improves applicability to closed-source binaries, and since the collected instructions all originate from the actual execution path, the results are guaranteed to be correct. However, this method still has shortcomings. Its execution path depends on program input-driven processes, primarily triggering program execution through test inputs provided by the program itself. It lacks an effective path expansion mechanism, making it difficult to cover deep or complex branches in the program, resulting in insufficient coverage of the constructed reference set. Furthermore, this method does not consider the differences in program structural complexity under obfuscated environments and does not establish a unified evaluation metric system, thus making it difficult to support quantitative comparison and evaluation between different disassemblers in obfuscated scenarios.
[0004] The patent document "Software Supply Chain Code Security Detection Method Based on Static and Dynamic Hybrid Program Slicing" (Application Date: 2024-06-28, Application No.: 202410340689.6, Publication No.: CN 118260758 A) discloses a software supply chain code security detection method based on static and dynamic hybrid program slicing. This method constructs a control flow graph by combining static and dynamic analysis, and extracts control dependencies and data dependencies to generate a program dependency graph. Driven by fuzz testing, this method introduces dynamic instrumentation to obtain indirect jump information and combines it with static analysis of the control flow graph to improve the completeness and coverage of program analysis. However, this method still has shortcomings. Firstly, its core objective is code security detection, and its analysis focuses on control dependencies and data dependencies. It does not construct a baseline truth value for disassembly tool evaluation, and therefore cannot be used to measure the accuracy of disassembly results. Secondly, although this method combines dynamic analysis to obtain execution path information, the dynamic information is mainly used to supplement the control flow graph structure and is not further used to construct the instruction sequence on the complete execution path. Furthermore, this method does not incorporate symbolic execution to further expand the execution path of fuzz testing. Its dynamic analysis coverage remains limited by the finite execution paths provided by the program input samples, making it difficult to cover complex branching structures. Finally, during dynamic instrumentation, this method does not optimize for efficiency. Frequent context switching between the instrumentation logic and the target program may introduce significant performance overhead, thus affecting the overall analysis efficiency.
[0005] Agricultural Bank of China Limited disclosed a binary code semantic parsing method in its patent application "A Binary Code Semantic Parsing Method, Apparatus, Electronic Device and Medium" (Application Date: 2025-12-02, Application No.: 202511151575.8, Publication No.: CN 121050732 A). This method constructs a target control flow graph by combining static disassembly information and dynamic instrumentation information, introduces a semantic model to semantically represent the nodes of the control flow graph, and then combines a heterogeneous attention mechanism to classify and perform adversarial analysis on the nodes, thereby achieving the separation of code and data in binary code. Although this method incorporates dynamic instrumentation, it still has shortcomings. It lacks mechanisms for optimizing path coverage, such as fuzz testing and symbolic execution. Furthermore, it does not introduce structural complexity weights or a unified evaluation standard, making it difficult to measure the structural complexity of different binary programs. In addition, its overhead control mainly relies on limiting the number of instrumented functions, rather than optimizing at the instrumentation granularity level, such as using basic block-level instrumentation to more efficiently obtain key execution information without affecting the original functionality of the program.
[0006] In summary, the shortcomings of existing disassembler evaluation techniques are as follows:
[0007] 1. Existing methods for constructing baseline truth values based on compiler or debugging information usually require customized modifications to the compiler or rely on additional symbol information. The implementation process is complex and costly, and it is difficult to promote and apply them when facing unknown compilation environments.
[0008] 2. Existing methods for constructing benchmark truth values based on the output of disassembly tools essentially rely on the analytical capabilities of existing tools. In complex and obfuscated scenarios, the tools themselves may have high errors, resulting in inaccurate benchmark truth values. This further affects the objectivity and reliability of the evaluation results, making it difficult to serve as a unified standard to support fair comparisons between multiple tools.
[0009] 3. Existing methods for constructing baseline truth values through dynamic instrumentation typically rely on program test input driver execution, lacking a path exploration mechanism. This makes it difficult to cover deep branches and complex control flow structures in the program, resulting in insufficient coverage of the collected instruction or basic block set, thus affecting the integrity of the baseline truth values.
[0010] 4. Existing evaluation methods generally use statistical indicators such as precision and recall to measure disassembly results. These indicators can only reflect the overall recognition accuracy and do not take into account the differences in program structure complexity and the impact of obfuscation intensity on the analysis difficulty. They cannot reasonably distinguish between samples of different complexities.
[0011] 5. Most existing methods have not built a unified automated evaluation system. They still rely on human experience for result analysis and interpretation during the evaluation process, which makes it difficult to support the needs of large-scale, multi-tool systematic evaluation. Summary of the Invention
[0012] The purpose of this invention is to address the shortcomings of the existing technologies mentioned above by proposing a performance evaluation method for disassemblers based on dynamic multi-path testing. This method aims to solve the problems of unreliable benchmark truth acquisition, inability of evaluation metrics to reflect program structure complexity, lack of effective support for obfuscated scenarios, and reliance on human experience in the performance evaluation process of existing disassemblers.
[0013] The technical approach to achieving the objectives of this invention is as follows: First, this invention employs a path exploration method that combines fuzz testing and dynamic symbolic execution. By introducing symbolic execution when fuzz testing enters a path-starved state, it solves for directional constraints on key branches and feeds back the generated high-quality inputs to the fuzz testing seed pool. Its key advantage lies in the fact that traditional methods rely solely on existing test inputs or randomly mutated inputs, making it difficult to reach deep and complex branch paths. Symbolic execution, on the other hand, can directly construct inputs that satisfy the conditions of uncovered branches based on path constraints, thereby effectively overcoming the coverage bottleneck of input-driven methods in existing technologies. This significantly improves the execution path coverage rate, providing a sufficient data foundation for constructing a more complete benchmark, and ultimately solving the problem of evaluation results relying on incomplete execution paths in existing technologies. Secondly, this invention adopts a strategy combining concrete execution and dynamic symbolic execution of actual execution paths. It only performs symbolic modeling and constraint solving on key branches of the fuzzy test-reached path. This effectively suppresses path explosion and reduces computational overhead while ensuring analysis accuracy. Its key feature is that traditional full-path symbolic execution requires enumerating a large number of potential paths, with the constraint size growing exponentially with the number of paths. In contrast, this invention selectively symbolizes only key branches of the actual execution path, limiting the constraint solution scope to the vicinity of the actually reachable path. This maintains the effectiveness of path exploration while reducing redundant path analysis overhead. Thirdly, this invention uses a dynamic instrumentation method based on basic block granularity. By recording only the entry address of the basic block to obtain the program execution trajectory, it significantly reduces instrumentation overhead while ensuring the integrity of control flow information, thus avoiding the performance bottleneck caused by instruction-by-instruction instrumentation. Its key feature is that the basic block, as the smallest execution unit without internal control transfer, uniquely determines the range of instruction sequences within the basic block through its entry address. Compared to instruction-by-instruction instrumentation, it does not require triggering instrumentation processing logic for each instruction, thus greatly reducing the number of instrumentations and context switching overhead, improving operational efficiency while ensuring the integrity of key control flow information in the execution trajectory. Finally, this invention combines dynamic instrumentation and static linear disassembly to construct a benchmark truth value. Based on the entry address of the basic block, it recovers the complete instruction sequence within the basic block, thereby constructing a high-precision benchmark truth value that covers the actual execution path and is unaffected by obfuscation. This achieves automated benchmark truth value generation without the need for source code. Its key feature is that the execution path obtained by dynamic instrumentation originates from the actual program execution process and has inherent correctness. Meanwhile, static linear disassembly, given the known boundaries of the basic block, can uniquely determine the instruction sequence within the block. The combination of these two methods avoids the misjudgment problem of pure static analysis in obfuscated scenarios and overcomes the limitations of relying on compiler or debugging information to construct the benchmark, thus solving the problem of inaccurate or unavailable benchmark truth values in existing technologies.
[0014] To achieve the above objectives, the specific implementation steps of the present invention include the following:
[0015] Step 1: Alternately iterate between fuzz testing and symbolic execution on the evaluation dataset, and perform symbolic modeling on the key branches on the execution path of the fuzz test-generated input to form a high-coverage input set;
[0016] Step 2: Dynamically instrument the binary files in the evaluation dataset based on the dynamic binary instrumentation framework to form a basic block address set covering multi-path execution;
[0017] Step 3: Construct the baseline truth value based on basic block-level dynamic instrumentation and static analysis;
[0018] Step 4: Based on the baseline truth values, perform static analysis on the binary files in the evaluation dataset to construct a control flow graph;
[0019] Step 5: Based on the control flow graph, calculate the binary file structure feature parameters, including cyclomatic complexity and average number of instructions per basic block; assign weights to different binary files in the evaluation dataset based on the structure feature parameters.
[0020] Step 6: Use the disassembler to be evaluated to perform static disassembly on the binary files in the evaluation dataset;
[0021] Step 7: Compare the static disassembly results of the binary files in the evaluation dataset with the benchmark true values using the disassembler to be evaluated. Based on the difference between the disassembly results of the disassembler to be evaluated and the benchmark true values, calculate the evaluation index for each binary file in the disassembler to be evaluated.
[0022] Step 8: Perform a weighted average of the evaluation metrics based on the weights of the binary files in the evaluation dataset to obtain the comprehensive performance metrics of the disassembler to be evaluated on the evaluation dataset.
[0023] Furthermore, the step of symbolically modeling the key branches on the execution path of the fuzzy test-generated input by alternating between fuzz testing and symbolic execution is as follows:
[0024] The first step is to use a fuzzing tool to mutate the binary files in the evaluation dataset based on the initial seed input, generate diverse test cases, drive the execution, and record the execution path coverage information.
[0025] The second step is to determine that the path starvation state has been entered when the fuzz test no longer finds a new control flow path.
[0026] The third step is to take the specific input generated by the current fuzz test as the starting point, perform dynamic symbolic execution on the corresponding actual execution path, and only perform symbolic modeling on the input bytes related to the branch conditions on the execution path, and collect path constraints.
[0027] The fourth step is to reverse the corresponding path constraints for the uncovered branches and generate specific inputs that satisfy the new path conditions through constraint solving.
[0028] The fifth step is to add the new input generated by symbolic execution to the fuzzy test seed pool to continue mutation and path exploration;
[0029] The sixth step involves iteratively expanding the program execution path through alternating fuzz testing and symbolic execution until the expected coverage percentage, preset time, or number of rounds is achieved, thus forming a high-coverage input set.
[0030] Furthermore, the instrumentation of binary files based on the dynamic binary instrumentation framework refers to using a dynamic binary instrumentation tool to insert instrumentation processing logic at the entry point of a basic block during program execution; when the program executes under different input drives of the input set, the instrumentation processing logic records the address of the first instruction of the basic block actually executed by the CPU in real time; the instrumentation is performed at the basic block granularity, and recording is only performed at the entry point of the basic block, without instrumenting each instruction, so as to reduce runtime overhead and reduce the number of context switches; after multiple rounds of program execution, the basic block entry addresses recorded during all execution processes are summarized to form a set of basic block addresses covering multi-path execution.
[0031] Furthermore, the steps for constructing the baseline truth value based on dynamic piling and static analysis are as follows:
[0032] The first step is to deduplicate the basic block address set to obtain a unique set of basic block start addresses;
[0033] The second step involves performing linear disassembly and parsing based on the instruction set architecture of the binary files in the evaluation dataset, starting from the entry address of each basic block.
[0034] The third step is to parse the instruction byte stream one by one according to the instruction decoding rules until a control flow transfer instruction, the start instruction of another basic block, or the code section termination position is encountered, thereby restoring the complete instruction sequence in the corresponding basic block.
[0035] The fourth step is to perform the above instruction recovery process on all basic blocks and integrate the instruction sequences of each basic block to construct a complete instruction set covering the actual execution path;
[0036] The fifth step is to use the complete instruction set as the benchmark truth for evaluating the performance of the disassembler.
[0037] Furthermore, the steps for performing static analysis on the binary file and constructing the control flow graph are as follows:
[0038] The first step is to perform static disassembly on the binary file to identify the instruction addresses and control flow transfer relationships in the program;
[0039] The second step is to extract basic blocks based on the disassembly results and determine the jump relationships between basic blocks based on control flow transfer instructions;
[0040] The third step is to construct the program's control flow graph, using basic blocks as nodes and control flow transfer relationships as edges.
[0041] The fourth step is to perform a structured representation of the control flow graph for the calculation of structural characteristic parameters of cyclomatic complexity and average number of instructions per basic block.
[0042] Furthermore, the cyclomatic complexity is obtained by the following formula:
[0043] ;
[0044] Where M represents cyclomatic complexity, E represents the number of edges in the control flow graph, N represents the number of nodes in the control flow graph, and P represents the number of connected subgraphs.
[0045] Furthermore, the average number of instructions per basic block is obtained by the following formula:
[0046] ;
[0047] Where AvgInst represents the average number of instructions per basic block, m represents the number of basic blocks, and the number of instructions contained in each basic block is as follows: .
[0048] Furthermore, the weighting of different binary files in the evaluation dataset is achieved by the following formula:
[0049] ;
[0050] in, This represents the weight of the i-th binary file in the dataset being evaluated, where n represents the number of binary files in the dataset being evaluated. This indicates that the cyclomatic complexity of the i-th binary file in the dataset is being evaluated. This represents the average number of basic blocks of the i-th binary file in the evaluation dataset after normalization;
[0051] The normalized average number of instructions per basic block is as follows:
[0052] ;
[0053] in, This represents the average number of instructions per basic block in the i-th binary file of the evaluation dataset. and These represent the minimum and maximum values of the indicator in the evaluation dataset, respectively.
[0054] Furthermore, the formula for calculating the evaluation metric for each binary file in the disassembler to be evaluated is as follows:
[0055] ;
[0056] ;
[0057] in, This indicates the accuracy of the disassembler being evaluated on the i-th binary file in the evaluation dataset. This represents the recall rate of the disassembler being evaluated for the i-th binary file in the evaluation dataset. This indicates the number of elements to retrieve from the set. The intersection operator represents the set intersection symbol. This represents the set of instruction boundaries identified by the disassembler on the i-th binary file in the evaluation dataset. This represents the baseline truth value of the i-th binary file in the evaluation dataset.
[0058] Furthermore, the weighted average of the evaluation indicators is obtained by the following formula:
[0059] ;
[0060] Where Score represents the overall performance metric of the disassembler being evaluated, and n represents the number of binary files in the evaluation dataset. This indicates a summation operation.
[0061] Compared with the prior art, the present invention has the following advantages:
[0062] First, this invention combines fuzzing with dynamic symbolic execution. When fuzzing enters a path-starved state, symbolic execution is introduced to solve key branches in a targeted manner, and the generated high-quality input is fed back into the fuzzing seed pool. This enhances path exploration capabilities and overcomes the limitation of existing technologies that rely solely on test cases provided by the program itself, resulting in insufficient coverage. This invention can more fully cover deep control flow paths in the program and continuously improve execution path coverage. The resulting dynamic execution path set is more comprehensive, providing higher-coverage data support for the construction of subsequent benchmark truth values, thereby improving the completeness and reliability of disassembler performance evaluation results.
[0063] Second, by adopting a strategy that combines actual execution with dynamic symbolic execution, this invention only performs symbolic modeling and constraint solving on branches on the actual execution path of fuzzing, effectively avoiding the computational overhead caused by the path explosion problem in traditional symbolic execution. This allows the invention to reduce computational overhead while ensuring the accuracy of path exploration, thereby improving the overall analysis efficiency. This strategy achieves more efficient path expansion under limited resource conditions, enabling the evaluation process to cover more effective paths within a controllable time, thus improving the efficiency and scalability of disassembler performance evaluation.
[0064] Third, this invention employs a basic block-level dynamic instrumentation method, recording only the entry address of the basic block and combining it with static linear disassembly to recover the complete instruction sequence. This avoids the performance bottleneck caused by instruction-by-instruction instrumentation in traditional techniques, significantly reducing runtime overhead and performance loss from frequent context switching compared to instruction-by-instruction instrumentation methods. This allows for efficient data acquisition while ensuring the integrity of critical execution information. This method can significantly improve data acquisition efficiency while maintaining the accuracy of the execution trajectory, enabling the evaluation process to support large-scale binary file analysis, thereby enhancing the practicality and engineering feasibility of disassembly tool performance evaluation.
[0065] Fourth, by combining the real execution path information obtained by dynamic instrumentation with the static disassembly recovery mechanism, this invention constructs a high-precision benchmark truth value that covers the actual running trajectory and is not affected by obfuscation. Compared with evaluation methods that rely on compiler information or existing disassembly tool results, it can provide a more accurate and objective evaluation benchmark under the condition of no source code.
[0066] Fifth, by introducing program structure features such as cyclomatic complexity and the average number of instructions per basic block, this invention assigns weights to binary files, enabling the evaluation results to reflect the differences in structural complexity and analysis difficulty among different binary programs. This overcomes the problem that traditional Precision and Recall metrics cannot characterize complexity differences, thus improving the rationality and discriminativeness of the performance evaluation results for disassemblers.
[0067] Sixth, this invention constructs a complete automated evaluation framework of "high-coverage input generation - dynamic execution trajectory acquisition - benchmark truth construction - structure-weighted evaluation", which realizes quantitative analysis of the performance of disassemblers in complex obfuscated scenarios. Compared with existing methods, it has stronger universality, scalability and practical value. Attached Figure Description
[0068] Figure 1 This is a flowchart of the method of the present invention;
[0069] Figure 2 This is a flowchart illustrating the collaborative workflow of fuzz testing and symbolic execution in this invention.
[0070] Figure 3 This is a flowchart of the dynamic piling process in this invention. Detailed Implementation
[0071] The implementation steps of the present invention will be further described in detail below with reference to the accompanying drawings and embodiments.
[0072] Example 1: Automated evaluation method for disassemblers.
[0073] Reference Figure 1 The overall process framework of this invention includes stages such as benchmark truth extraction, construction of disassembler evaluation metrics, and disassembler evaluation. The method of this invention includes the following steps:
[0074] Step 1: Baseline truth extraction.
[0075] Step 1.1, Input generation and path expansion based on fuzz testing and symbolic execution.
[0076] Reference Figure 2 , Figure 2 The process of extending the path for fuzz testing and symbolic execution to work together is described.
[0077] This step is used to generate a high-coverage set of program inputs, thereby covering as many diverse execution paths of the target program as possible.
[0078] In this embodiment, a binary program processed by various obfuscation tools is selected as the evaluation dataset, including:
[0079] SPEC dataset based on OLLVM obfuscation;
[0080] SPEC dataset based on desync-cc obfuscation;
[0081] Datasets based on Tigress obfuscation cryptographic algorithms;
[0082] A dataset of cryptographic algorithms based on VMProtect obfuscation.
[0083] First, an initial input set is generated using the Qemu mode of AFL++ fuzzing. Based on the seed input, new test cases are continuously generated using mutation strategies such as cutting, insertion, and byte flipping. These inputs include both legal inputs that enable normal driver execution and malformed inputs that may trigger abnormal paths, thus forming a diverse set of path exploration candidates.
[0084] During path exploration, fuzzing, as the first phase, is responsible for broadly covering the program's control flow. When fuzzing fails to discover new control flow edges in several consecutive iterations, it is determined to enter a path starvation state. At this point, the Angr symbolic execution mechanism is introduced for supplementary analysis.
[0085] Specifically, starting with the concrete input generated by fuzzing, dynamic symbolic execution is performed on the corresponding real execution path. The input is abstracted into symbolic variables, and branch constraints are collected along the execution path. When an uncovered branch is encountered, the corresponding constraint is inverted, and a constraint solver is used to generate concrete input that satisfies the conditions of the new path.
[0086] To control computational overhead, this invention employs a strategy combining concrete execution and symbolic execution. Symbolic modeling is performed only on critical branches of the execution path that generate the fuzzy test input, rather than enumerating the entire path. New inputs generated by symbolic execution are reintroduced into the fuzzy test seed pool to participate in subsequent mutations, thereby guiding the program into deeper control flow regions.
[0087] By iteratively alternating between fuzz testing and symbolic execution, path coverage can be gradually improved, ultimately resulting in a high-coverage input set.
[0088] Step 1.2: Construction of the baseline truth based on basic block-level dynamic instrumentation and static analysis.
[0089] Reference Figure 3 , Figure 3 The process of obtaining the starting address of a basic block through dynamic instrumentation is described.
[0090] This step is used to construct accurate instruction-level baseline truth values based on the actual execution trajectory.
[0091] After obtaining the high-coverage input set generated in the previous step, dynamic instrumentation technology is used to collect the execution trajectory of the target program. Specifically, the instrumentation tool of the corresponding platform is used to insert instrumentation logic during program execution and record the address of the basic block start instruction actually executed by the CPU.
[0092] During the execution of each input driver, the instrumentation module records the basic block entry information in the corresponding execution path. After multiple rounds of program execution, the basic block entry addresses recorded in all execution trajectories are summarized, merged, and deduplicated to obtain the set of basic block starting addresses.
[0093] Subsequently, by combining static disassembly methods and according to the decoding rules of the instruction set architecture, a linear scan is performed starting from the entry address of each basic block to recover the instructions one by one until a control flow transfer instruction is encountered, thereby obtaining the complete instruction sequence within the basic block.
[0094] By performing the above process on all basic blocks, a complete set of instructions covering the dynamic execution path is finally constructed as the baseline truth value.
[0095] It should be noted that this invention employs a basic block-level instrumentation strategy, rather than instruction-by-instruction instrumentation. Instruction-by-instruction instrumentation requires triggering the instrumentation logic before each instruction is executed, leading to frequent context switching during program execution and resulting in significant performance overhead. In contrast, basic block-level instrumentation significantly reduces performance loss while ensuring the acquisition of critical control flow information, offering better execution efficiency and engineering feasibility.
[0096] Step 2: Construction of evaluation metrics for disassemblers.
[0097] This step is used to establish an evaluation index system that takes into account the complexity of the program structure.
[0098] For each binary program in the evaluation dataset, its control flow and structural features are first extracted through static analysis, including metrics such as cyclomatic complexity and average number of instructions per basic block. Cyclomatic complexity measures the complexity of program control flow branches, while the average number of instructions per basic block characterizes the local structural density.
[0099] Based on the aforementioned structural characteristics, a weight coefficient is assigned to each binary program to reflect its analytical difficulty and importance in the overall evaluation. By introducing a weighting mechanism, the excessive influence of structurally simple programs on the overall evaluation results can be avoided, making the evaluation results more reasonable.
[0100] Step 3: Disassembler evaluation.
[0101] This step is used to evaluate and compare the performance of different disassemblers.
[0102] In this embodiment, the following typical disassemblers are selected as evaluation objects:
[0103] XDA; DeepDi; bi-RNN; IDA Pro.
[0104] First, the disassembler to be evaluated is applied to the target binary program to obtain its disassembled output. Then, this output is compared instruction-by-instruction with the baseline truth value constructed in step 1 to identify correct and incorrect parsing in the disassembled result.
[0105] Based on this, the evaluation results of each test sample are weighted and summarized in combination with the program weights calculated in step 2 to obtain the comprehensive performance index of the disassembler on the entire evaluation dataset.
[0106] Under the above dataset and evaluation process, the weighted evaluation results of each disassembler are as follows:
[0107] XDA: 95.86; bi-RNN: 92.79; DeepDi: 95.56; IDA Pro: 66.51.
[0108] The evaluation results show that XDA and DeepDi outperformed XDA overall. This is mainly due to their structural advantages: XDA, based on the Transformer architecture, can fully model long-range contextual dependencies; DeepDi, on the other hand, uses a superset disassembly combined with graph neural networks, which can more comprehensively characterize the structural relationships and semantic information between instructions. Therefore, these two methods are more adaptable to complex control flow and various obfuscation patterns, maintaining high disassembly accuracy even in highly obfuscated scenarios. In contrast, bi-RNN, due to its relatively simple model structure, has limited ability to express long-range dependencies and complex control flow, resulting in a slight performance drop when facing strongly obfuscated structures. IDA Pro, relying primarily on heuristic rules and traditional static analysis strategies, struggles to accurately recover the true instruction sequence when facing complex obfuscation techniques such as control flow flattening, spoofed control flow, and virtualization, thus receiving a relatively lower overall evaluation score.
[0109] This evaluation process allows for an objective comparison of different disassemblers under a unified standard, thereby accurately reflecting their actual analytical capabilities in complex and obfuscated binary environments.
Claims
1. A performance evaluation method for disassemblers based on dynamic multi-path testing, characterized in that, The steps of this evaluation method include the following: Step 1: Alternately iterate between fuzz testing and symbolic execution on the evaluation dataset, and perform symbolic modeling on the key branches on the execution path of the fuzz test-generated input to form a high-coverage input set; Step 2: Dynamically instrument the binary files in the evaluation dataset based on the dynamic binary instrumentation framework to form a basic block address set covering multi-path execution; Step 3: Construct the baseline truth value based on basic block-level dynamic instrumentation and static analysis; Step 4: Based on the baseline truth values, perform static analysis on the binary files in the evaluation dataset to construct a control flow graph; Step 5: Based on the control flow graph, calculate the binary file structure characteristic parameters, including cyclomatic complexity and the average number of instructions per basic block. Weights are assigned to different binary files in the evaluation dataset based on structural feature parameters; Step 6: Use the disassembler to be evaluated to perform static disassembly on the binary files in the evaluation dataset; Step 7: Compare the static disassembly results of the binary files in the evaluation dataset with the benchmark true values using the disassembler to be evaluated. Based on the difference between the disassembly results of the disassembler to be evaluated and the benchmark true values, calculate the evaluation index for each binary file in the disassembler to be evaluated. Step 8: Perform a weighted average of the evaluation metrics based on the weights of the binary files in the evaluation dataset to obtain the comprehensive performance metrics of the disassembler to be evaluated on the evaluation dataset.
2. The disassembler performance evaluation method according to claim 1, characterized in that, The steps described in step 1, which involve alternating between fuzz testing and symbolic execution to perform symbolic modeling of key branches on the execution path of the fuzz test-generated input, are as follows: The first step is to use a fuzzing tool to mutate the binary files in the evaluation dataset based on the initial seed input, generate diverse test cases, drive the execution, and record the execution path coverage information. The second step is to determine that the path starvation state has been entered when the fuzz test no longer finds a new control flow path. The third step is to take the specific input generated by the current fuzz test as the starting point, perform dynamic symbolic execution on the corresponding actual execution path, and only perform symbolic modeling on the input bytes related to the branch conditions on the execution path, and collect path constraints. The fourth step is to reverse the corresponding path constraints for the uncovered branches and generate specific inputs that satisfy the new path conditions through constraint solving. The fifth step is to add the new input generated by symbolic execution to the fuzzy test seed pool to continue mutation and path exploration; The sixth step involves iteratively expanding the program execution path through alternating fuzz testing and symbolic execution until the expected coverage percentage, preset time, or number of rounds is achieved, thus forming a high-coverage input set.
3. The disassembler performance evaluation method according to claim 2, characterized in that, Step 2, which describes instrumenting binary files based on a dynamic binary instrumentation framework, refers to using a dynamic binary instrumentation tool to insert instrumentation logic at the entry point of a basic block during program execution. When the program executes under different input drives from the input set, the instrumentation logic records the address of the first instruction of the basic block actually executed by the CPU in real time. The instrumentation is performed at the basic block granularity, recording only at the entry point of the basic block, rather than instrumenting each instruction individually, to reduce runtime overhead and context switching frequency. After multiple rounds of program execution, all basic block entry addresses recorded during execution are summarized to form a set of basic block addresses covering multiple execution paths.
4. The disassembler performance evaluation method according to claim 3, characterized in that, The steps for constructing the baseline truth value based on dynamic piling and static analysis in step 3 are as follows: The first step is to deduplicate the basic block address set to obtain a unique set of basic block start addresses; The second step involves performing linear disassembly and parsing based on the instruction set architecture of the binary files in the evaluation dataset, starting from the entry address of each basic block. The third step is to parse the instruction byte stream one by one according to the instruction decoding rules until a control flow transfer instruction, the start instruction of another basic block, or the code section termination position is encountered, thereby restoring the complete instruction sequence in the corresponding basic block. The fourth step is to perform the above instruction recovery process on all basic blocks and integrate the instruction sequences of each basic block to construct a complete instruction set covering the actual execution path; The fifth step is to use the complete instruction set as the benchmark truth for evaluating the performance of the disassembler.
5. The disassembler performance evaluation method according to claim 4, characterized in that, The steps for performing static analysis on the binary file and constructing the control flow graph in step 4 are as follows: The first step is to perform static disassembly on the binary file to identify the instruction addresses and control flow transfer relationships in the program; The second step is to extract basic blocks based on the disassembly results and determine the jump relationships between basic blocks based on control flow transfer instructions; The third step is to construct the program's control flow graph, using basic blocks as nodes and control flow transfer relationships as edges. The fourth step is to perform a structured representation of the control flow graph for the calculation of structural characteristic parameters of cyclomatic complexity and average number of instructions per basic block.
6. The disassembler performance evaluation method according to claim 5, characterized in that, The cyclomatic complexity mentioned in step 5 is obtained from the following formula: ; Where M represents cyclomatic complexity, E represents the number of edges in the control flow graph, N represents the number of nodes in the control flow graph, and P represents the number of connected subgraphs.
7. The disassembler performance evaluation method according to claim 6, characterized in that, The average number of instructions per basic block mentioned in step 5 is obtained by the following formula: ; Where AvgInst represents the average number of instructions per basic block, m represents the number of basic blocks, and the number of instructions contained in each basic block is as follows: .
8. The disassembler performance evaluation method according to claim 7, characterized in that, The weighting of different binary files in the evaluation dataset, as described in step 5, is achieved by the following formula: ; in, This represents the weight of the i-th binary file in the dataset being evaluated, where n represents the number of binary files in the dataset being evaluated. This indicates that the cyclomatic complexity of the i-th binary file in the dataset is being evaluated. This represents the average number of basic blocks of the i-th binary file in the evaluation dataset after normalization; The normalized average number of instructions per basic block is as follows: ; in, This represents the average number of instructions per basic block in the i-th binary file of the evaluation dataset. and These represent the minimum and maximum values of the indicator in the evaluation dataset, respectively.
9. The disassembler performance evaluation method according to claim 8, characterized in that, The formula for calculating the evaluation metric for each binary file in the disassembler to be evaluated, as described in step 7, is as follows: ; ; in, This indicates the accuracy of the disassembler being evaluated on the i-th binary file in the evaluation dataset. This represents the recall rate of the disassembler being evaluated for the i-th binary file in the evaluation dataset. This indicates the number of elements to retrieve from the set. The intersection operator represents the set intersection symbol. This represents the set of instruction boundaries identified by the disassembler on the i-th binary file in the evaluation dataset. This represents the baseline truth value of the i-th binary file in the evaluation dataset.
10. The disassembler performance evaluation method according to claim 9, characterized in that, The weighted average of the evaluation indicators described in step 8 is obtained by the following formula: ; Where Score represents the overall performance metric of the disassembler being evaluated, and n represents the number of binary files in the evaluation dataset. This indicates a summation operation.