A method, apparatus and device for testing a disassembler
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ALIPAY (HANGZHOU) INFORMATION TECH CO LTD
- Filing Date
- 2026-03-11
- Publication Date
- 2026-06-30
Smart Images

Figure CN122309356A_ABST
Abstract
Description
Technical Field
[0001] This document relates to the field of computer technology, and in particular to a testing method, apparatus and equipment for a disassembler. Background Technology
[0002] Disassembly is the process of translating machine code into human-readable assembly code, and it is also a core step in binary reverse engineering. Disassembly can empower many important downstream security applications, such as decompiling, binary rewriting, and vulnerability detection and privacy data breach detection for binary programs only.
[0003] Disassembly tasks involve a series of disassembly primitives, primarily instruction recovery, function boundary detection, function signature recognition, control flow graph recovery, and function call graph recovery, with instruction recovery forming the foundation for the subsequent primitives. If accurate instruction recognition is not achieved, function boundaries (i.e., function entry and exit instructions) will be difficult to determine. Similarly, the jump relationships between basic blocks in the control flow graph will be difficult to accurately identify (i.e., errors will appear on the edges of the control flow graph). In the task of identifying instructions within basic blocks, disassemblers such as IDA-Pro and Ghidra are typically used. However, these disassemblers often misidentify inline data as code or vice versa, leading to operand errors in instruction calculations and resulting in incorrect results. Sometimes, even non-existent ghost instructions may appear, causing variable mismatches and even control flow desynchronization. Therefore, a testing scheme for disassemblers is needed to improve their testing efficiency and accuracy. Summary of the Invention
[0004] The purpose of the embodiments in this specification is to provide a testing scheme for disassemblers, thereby improving the testing efficiency and accuracy of disassemblers.
[0005] To achieve the above technical solution, the embodiments in this specification are implemented as follows: This specification provides a method for testing a disassembler tool, comprising: acquiring a target binary file; disassembling the binary file using multiple different disassemblers to obtain assembly code corresponding to each disassembler, wherein the multiple different disassemblers include a target disassembler whose disassembly effect needs to be tested; analyzing the obtained assembly code using a first intelligent agent to determine the code analysis result corresponding to the binary file, wherein the code analysis result corresponding to the binary file includes one or more of the following: abnormal assembly code blocks, descriptive information of assembly code with differences, and heuristic information corresponding to the multiple assembly codes; wherein the heuristic information is a series of semantic constraints that together constitute a reference for judging correctness; and determining the test result of the disassembly effect of the target disassembler tool based on the code analysis result corresponding to the binary file.
[0006] This specification provides an embodiment of a testing device for a disassembler tool. The device includes: a first file acquisition module for acquiring a target binary file; a disassembly module for disassembling the binary file using multiple different disassemblers to obtain assembly code corresponding to each disassembler, wherein the multiple different disassemblers include the target disassembler for which disassembly effectiveness testing is required; a differential test analysis module for performing differential test analysis on the obtained assembly code using a first intelligent agent to determine the code analysis result corresponding to the binary file, wherein the code analysis result corresponding to the binary file includes one or more of the following: assembly code blocks with anomalies, descriptive information of assembly code with differences, and heuristic information corresponding to the multiple assembly codes; wherein the heuristic information is a series of semantic constraints that together constitute a reference for judging correctness; and a test result determination module for determining the test result of the disassembly effectiveness of the target disassembler tool based on the code analysis result corresponding to the binary file.
[0007] This specification provides an embodiment of a testing device for a disassembler tool. The testing device includes a processor and a memory configured to store computer-executable instructions. When executed, the executable instructions cause the processor to: acquire a target binary file; disassemble the binary file using multiple different disassemblers to obtain assembly code corresponding to each disassembler, including a target disassembler whose disassembly effectiveness needs to be tested; analyze the obtained assembly code using a first intelligent agent to determine the code analysis result corresponding to the binary file, the code analysis result including one or more of the following: abnormal assembly code blocks, descriptive information of differing assembly code, and heuristic information corresponding to the multiple assembly codes; wherein the heuristic information is a series of semantic constraints that together constitute a right / wrong judgment reference; and based on the code analysis result corresponding to the binary file, determine the test result of the disassembly effectiveness of the target disassembler tool.
[0008] This specification also provides a storage medium for storing computer-executable instructions. When executed by a processor, the executable instructions implement the following process: acquiring a target binary file; disassembling the binary file using multiple different disassemblers to obtain assembly code corresponding to each disassembler, wherein the multiple different disassemblers include a target disassembler for which disassembly effectiveness testing is required; analyzing the obtained assembly code using a first intelligent agent to determine the code analysis result corresponding to the binary file, wherein the code analysis result corresponding to the binary file includes one or more of the following: abnormal assembly code blocks, descriptive information of assembly code with differences, and heuristic information corresponding to the multiple assembly codes; wherein the heuristic information is a series of semantic constraints that together constitute a right / wrong judgment reference; and determining the test result of the disassembly effectiveness of the target disassembler based on the code analysis result corresponding to the binary file.
[0009] This specification also provides a computer program product, including a computer program that, when executed by a processor, implements the following process: acquiring a target binary file; disassembling the binary file using multiple different disassemblers to obtain assembly code corresponding to each disassembler, wherein the multiple different disassemblers include a target disassembler for which disassembly effectiveness testing is required; analyzing the obtained assembly code using a first intelligent agent to determine the code analysis result corresponding to the binary file, wherein the code analysis result corresponding to the binary file includes one or more of the following: assembly code blocks with anomalies, descriptive information of assembly code with differences, and heuristic information corresponding to the multiple assembly codes; wherein the heuristic information is a series of semantic constraints that together constitute a right / wrong judgment reference; and determining the test result of the disassembly effectiveness of the target disassembler based on the code analysis result corresponding to the binary file. Attached Figure Description
[0010] To more clearly illustrate the technical solutions in the embodiments or prior art of this specification, the drawings used in the description of the embodiments or prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this specification. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. Figure 1 This is a schematic diagram of the structure of a test system for a disassembler tool as described in this specification; Figure 2 This is a schematic diagram illustrating the testing process of a disassembler tool described in this manual; Figure 3 This is a schematic diagram illustrating the testing process of another disassembler tool described in this manual; Figure 4 This is a schematic diagram illustrating the testing process of another disassembler tool described in this manual; Figure 5 This is a schematic diagram illustrating the testing process of another disassembler tool described in this manual; Figure 6 This is a schematic diagram illustrating the testing process of another disassembler tool described in this manual; Figure 7 This is a schematic diagram illustrating the testing process of another disassembler tool described in this manual; Figure 8 This is a schematic diagram of a test apparatus for a disassembler tool as described in this specification; Figure 9 This is a schematic diagram of a test device for a disassembler tool described in this manual. Detailed Implementation
[0011] This specification provides a testing method, apparatus, and device for a disassembler.
[0012] To enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this specification, and not all embodiments. Based on the embodiments in this specification, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of this specification.
[0013] This specification provides a fuzzing mechanism for disassemblers based on differential testing. The disassembly task includes a series of disassembly primitives, mainly instruction recovery, function boundary detection, function signature recognition, control flow graph recovery, and function call graph recovery, among which instruction recovery is the foundation of the subsequent series of primitives. If accurate instruction recognition cannot be performed, function boundaries will be difficult to determine; similarly, the jump relationships between basic blocks in the control flow graph will also be difficult to accurately identify. In the task of identifying instructions within basic blocks, disassemblers such as IDA Pro and Ghidra are usually used for identification processing. However, these disassemblers often identify inline data as code, or misidentify code as inline data, resulting in incorrect operands in instruction calculations and thus incorrect results. Sometimes, even non-existent ghost instructions may appear, causing variable result mismatches and even control flow desynchronization.
[0014] Tests were conducted to address the above situation, and the results were analyzed, findings, and possible attributions were presented. However, current tests are based on standard test sets and common software tools (such as SPEC2000, MySQL, etc.) to construct test benchmarks, and do not involve the analysis of real-world malware (malicious binary files). Real-world malware often employs obfuscation, packing, stripping, and other protection mechanisms, resulting in significantly different control flow between the generated assembly code and the binary file. Current testing methods directly compare decompiler results with test elements collected from the compilation process or debug information, but this is not suitable for testing decompilers on malware or protected binary files because stripped binary executables without source code cannot easily recover test elements from the compilation process. This specification provides a testing method for disassemblers, thereby improving the testing efficiency and accuracy of disassemblers. Specific processing details can be found in the following embodiments.
[0015] The testing methods for disassemblers provided in one or more embodiments of this specification are applicable to the testing environment of disassemblers. (Refer to...) Figure 1 The implementation environment includes at least: Client 100 and server 200. Furthermore, server 200 may include a first intelligent agent 210 and various algorithms, etc., wherein: Client 100 can run on terminal devices, which can be mobile phones, personal computers, tablets, e-book readers, wearable devices, devices that interact with information based on AR (Augmented Reality) and VR (Virtual Reality), and laptop computers, etc. Client 100 can be installed on terminal devices. Client 100 can be an application, a browser, or a subroutine embedded in an application, etc.
[0016] The server 200 can run on a server, which can be one or more servers, a server cluster consisting of several servers, or a cloud server of a cloud computing platform, etc. The server can install the server 200, which can be an application or a subroutine embedded in an application, etc. The first intelligent agent 210 and various different algorithms can be integrated into the server 200, or the server 200 can call the first intelligent agent 210 and any one or more of the various different algorithms to perform corresponding operations.
[0017] In addition, it may include a database 300, which may be set in the server on which the server 200 runs, or outside the server on which the server 200 runs. The database 300 may store binary files, assembly code and other related information.
[0018] In this implementation environment, when technicians need to test a disassembler (i.e., the target disassembler), they can initiate a test request for the target disassembler through the client. The server 200 can obtain the target binary file and disassemble it using multiple different disassemblers to obtain the assembly code corresponding to each disassembler. Among the multiple disassemblers is the target disassembler for which the disassembly effect needs to be tested. Then, the first intelligent agent can analyze the multiple assembly codes to determine the code analysis result corresponding to the binary file. The code analysis result corresponding to the binary file includes one or more of the following: abnormal assembly code blocks, description information of assembly code with differences, and heuristic information corresponding to multiple assembly codes. The heuristic information is a series of semantic constraints that together constitute a reference for judging right and wrong (i.e., reference information that serves as a result of judging correctness and error). Finally, based on the code analysis result corresponding to the binary file, the test result of the disassembly effect of the target disassembler can be determined.
[0019] like Figure 2As shown in the embodiments of this specification, a testing method for a disassembler tool is provided. The execution subject of this method can be a terminal device or a server, etc. The terminal device can be a mobile terminal device such as a mobile phone or tablet computer, a computer device such as a laptop or desktop computer, or an IoT device (specifically, a smartwatch, in-vehicle device, etc.). The server can be a single server or a server cluster composed of multiple servers. The server can be a backend server in fields such as finance or online shopping, or a backend server of an application. This embodiment uses a server as the execution subject for detailed description. For the case where the execution subject is a terminal device, please refer to the following server case handling, which will not be repeated here. The method may specifically include the following steps: In step S202, the target binary file is obtained.
[0020] The target binary file can be any binary file capable of performing any function and containing arbitrary content, such as bin, elf, axf, or hex format binary files. Furthermore, the target binary file can be a binary file containing source assembly code, or it can be a binary file of a malicious application. Moreover, the target binary file can be a binary file processed using obfuscation, packing, stripping, or other protection mechanisms. In practical applications, these protection mechanisms may not be necessary; the choice depends on the specific circumstances. Additionally, the source of the target binary file can be diverse. For example, it could be a binary file from a specified open-source database, a binary file crawled by a web crawler, or a binary file written according to testing requirements.
[0021] In practice, when testing the disassembly effect of a target disassembler, the target binary file can be obtained. Specifically, when technicians need to test the disassembly effect of a target disassembler, they can launch a specified test program installed on the terminal device. This test program may include an information provision box for the disassembler and a test submit button. Technicians can input relevant information about the target disassembler through the information provision box, such as the target disassembler's identifier, version information, download address (or upload the target disassembler), etc. After inputting the information, they can click the test submit button. At this time, the terminal device can obtain the relevant information of the target disassembler entered by the technician and generate a test request based on it. This test request can then be sent to the server. When the server receives the test request, it responds to the test request and obtains the target binary file. Alternatively, technicians can directly obtain the target binary file based on the target disassembler to be tested, without any information input or other operations. The specific settings can be configured according to the actual situation.
[0022] Target binary files can be obtained in various ways. For example, a certain number of binary files can be crawled from the Internet using a web crawler, and the crawled binary files can be used as target binary files, or binary files that meet specified requirements can be selected from the crawled binary files as target binary files. Alternatively, one or more binary files can be selected from a specified open-source database, and the selected binary files can be used as target binary files. Or, binary files of malicious programs that have been intercepted or collected can be used as target binary files, etc. The specific settings can be configured according to the actual situation.
[0023] In step S204, the binary file is disassembled using a variety of different disassemblers to obtain the assembly code corresponding to each disassembler. Among the various disassemblers is the target disassembler that needs to be tested for disassembly effect.
[0024] Disassemblers are tools that convert machine code (such as binary files) into assembly instructions specific to the target processor (such as assembly code). They are the reverse operations of the assembly or cross-assembly process. Their core functions include instruction decoding, dynamic analysis, and symbolic execution. They support static disassembly to parse the program's logical structure and reverse translation to assembly mnemonics. Disassemblers can be such as IDA Pro, W32Dasm, Ghidra, MSILDisassembler, etc.
[0025] In implementation, such as Figure 3As shown, various disassemblers can be acquired depending on the specific circumstances. These disassemblers can include those different from the target disassembler, or, in practice, those of the same type as the target disassembler. The choice depends on the specific needs. Then, the binary file can be input into any disassembler. This disassembler will decompile the binary code in the binary file to obtain the corresponding assembly code. The binary file can then be input into another disassembler, which will also decompile the binary code to obtain the corresponding assembly code. This process can be repeated to obtain the assembly code output by each disassembler.
[0026] In step S206, the first intelligent agent performs differential test analysis on the obtained multiple assembly codes to determine the code analysis results corresponding to the binary file. The code analysis results corresponding to the binary file include one or more of the following: assembly code blocks with anomalies, description information of assembly codes with differences, and heuristic information corresponding to multiple assembly codes.
[0027] One approach is to perform differential testing analysis on multiple assembly code snippets. Differential testing is a special testing method that essentially involves random testing. During the testing process, it compares the results of multiple tests to determine if the tool under test (DUT) has any anomalies. Differential testing can be divided into two types: differential testing of multiple DUTs for a single test case and differential testing of a single DUT for multiple test cases. Differential testing of multiple DUTs for a single test case: In a narrow sense, differential testing refers to differential testing of multiple DUTs for a single test case. This type of testing reveals differences and potential defects by comparing the execution results of multiple functionally identical or similar DUTs under the same test case. Its research focus is on how to determine whether the DUT has anomalies through comparison of highly similar systems. Specifically, this type of differential testing involves selecting two or more functionally identical test objects, constructing the same test case that meets the requirements of the test objects, and applying the test case to these test objects respectively. By comparing and observing the execution results of these test objects under the same test case, the correct and incorrect behaviors of multiple test tools can be determined. The first intelligent agent can include multiple different models and / or algorithms. The first intelligent agent can use the multiple different models and / or algorithms set therein to analyze the obtained assembly code through differential testing to determine the correct and incorrect behaviors of multiple different disassemblers. Heuristic information is a series of semantic constraints that together constitute a reference for judging correctness and error (i.e., reference information that serves as a common basis for judging correctness and error).
[0028] In implementation, such as Figure 3As shown, a first intelligent agent can be pre-defined according to the actual situation. Considering the recent advancements in Large Language Models (LLMs), which have demonstrated significant capabilities in code understanding and reasoning, when provided with contextual information, LLMs can interpret error patterns, identify their root causes, and generate descriptive information about the errors. Therefore, the aforementioned LLM can be set in the first intelligent agent. Through the LLM, the first intelligent agent can analyze the obtained assembly code using differential testing. Specifically, multiple assembly codes can be input into the first intelligent agent, and the classification of the differential test results can be determined through the LLM within the first intelligent agent using a majority voting mechanism. The basic principle of this approach is that if a majority reaches a consensus on the classification decision, then this decision is considered correct. In this case, according to the majority voting principle, the judgment is tended to be considered correct. By directly comparing multiple assembly codes, differential testing can automatically detect and identify the correct and incorrect behaviors of various disassemblers that may appear. Ultimately, the differential test analysis results can be obtained, which are the code analysis results corresponding to the binary file. The code analysis results corresponding to the binary file can include one or more of the following: assembly code blocks with exceptions, description information of assembly code with differences, and heuristic information corresponding to multiple assembly codes.
[0029] In step S208, based on the code analysis results corresponding to the binary file, the test results of the disassembly effect of the target disassembler are determined.
[0030] In practice, one or more of the following can be analyzed in the code analysis results corresponding to the binary file: abnormal assembly code blocks, description information of differing assembly code, and heuristic information corresponding to multiple assembly codes. This will determine the situation of abnormal assembly code, differing assembly code, etc. Through comprehensive analysis, the test results of the disassembly effect of the target disassembler can be determined.
[0031] This specification provides a method for testing disassemblers. By acquiring a target binary file, disassembling the binary file using multiple different disassemblers, assembly code corresponding to each disassembler is obtained. Among the various disassemblers is the target disassembler whose disassembly performance needs to be tested. Then, a first intelligent agent performs differential testing analysis on the obtained assembly code to determine the code analysis result corresponding to the binary file. The code analysis result includes one or more of the following: abnormal assembly code blocks, descriptive information of differing assembly code, and heuristic information corresponding to multiple assembly codes. Finally, based on the code analysis result corresponding to the binary file, the test result of the disassembly performance of the target disassembler can be determined. In this way, by disassembling the binary file using multiple different disassemblers and analyzing the obtained assembly code through differential testing, the correct and incorrect behaviors of various disassemblers can be determined, thereby improving the testing efficiency and accuracy of disassemblers.
[0032] In practical applications, target binary files can be constructed in a variety of different ways. The following is an optional processing method, which may include the following steps A2 and A4.
[0033] In step A2, the initial binary file is obtained.
[0034] In implementation, such as Figure 4 As shown, initial binary files can be obtained in various ways. For example, a certain number of binary files can be crawled from the Internet using a web crawler, and the crawled binary files can be used as initial binary files, or binary files that meet specified requirements can be selected from the crawled binary files as initial binary files. Alternatively, one or more binary files can be selected from a specified open-source database, and the selected binary files can be used as initial binary files. Or, binary files of intercepted or collected malicious programs can be used as initial binary files, etc. The specific settings can be configured according to the actual situation.
[0035] In step A4, the second intelligent agent performs metadata extraction and / or architecture identification processing on the initial binary file, and generates a structured target binary file based on consensus-based instruction boundaries and semantic verification.
[0036] The second intelligent agent may include a variety of different models and / or algorithms. The second intelligent agent may use the various different models and / or algorithms set therein to perform metadata extraction processing and / or architecture recognition processing on the initial binary file and generate a structured target binary file.
[0037] In implementation, such as Figure 4 As shown, an initial binary file can be input into a second intelligent agent. The second intelligent agent can use various models and / or algorithms set therein. For example, a specified neural network model can be constructed, and the neural network model can be trained using a large amount of sample data. This enables the trained neural network model to perform metadata extraction processing on the initial binary file. Similarly, a model capable of performing architecture recognition processing on the initial binary file can also be trained. In this way, the trained models can be used to perform metadata extraction processing and / or architecture recognition processing on the initial binary file, thereby extracting metadata and data architecture-related content from the initial binary file. Based on the above information, and combined with consensus instruction boundaries and semantic verification, a structured target binary file can be generated.
[0038] In practical applications, there are many ways to process step A4 above. Here is another optional processing method, which may include the processing of steps A42 and A44.
[0039] In step A42, the initial binary file is processed by metadata extraction and architecture identification using the pyelftools and / or Capstone tools pre-set in the second agent, resulting in data including one or more of the following: instruction boundaries, function entry points, and architecture-specific features.
[0040] Instruction boundaries can be varied, including function boundaries. These boundaries are often ambiguous and can be determined through decoding heuristics or control flow analysis. Identifying instruction boundaries is a crucial part of instruction recovery in disassembly. The basic tasks of instruction recovery include identifying instruction boundaries and decoding byte sequences into valid assembly code (or assembly instructions). Function boundaries primarily involve identifying function entry and exit points for higher-level program analysis. Function entry points can be varied, including entry points within function boundaries, and can be specifically defined based on the actual situation. Architecture-specific features can include various types, such as control flow and data flow. Control flow can include intra-procedural (e.g., basic block transformations within functions) and inter-procedural (e.g., function call relationships) control flow. Precise control flow is crucial for program understanding and security analysis. Regarding data flow, although assembly code is not in statically single-assigned (SSA) form, identifying data dependencies and variable usage patterns is essential for understanding program semantics and detecting vulnerabilities. Functions provide the main abstraction for inter-procedural analysis, decompilation, and code understanding.
[0041] In implementation, to accurately identify metadata and architecture in the initial binary file, the pyelftools and / or Capstone tools can be pre-configured in the second agent. The pyelftools tool is used to parse and analyze the header, sections, symbol table, program header table, and other structures of ELF format binary files. Furthermore, it can extract DWARF debugging information and has no external dependencies. The Capstone tool is built on a lightweight, multi-platform, multi-architecture disassembly framework. Capstone can provide detailed information about disassembly instructions and some semantics of those instructions. By using the pyelftools and / or Capstone tools in the second agent to perform metadata extraction and architecture identification processing on the initial binary file, one or more data points, including instruction boundaries, function entry points, and architecture-specific features, can be obtained.
[0042] In step A44, based on the obtained data, a target binary file in ELF format is generated through consensus-based instruction boundaries and semantic verification.
[0043] In practical applications, the specific processing method of step S204 can vary. The following provides another optional processing method, which may specifically include the processing of steps S2042 and S2044. Based on this, in the above... Figure 2 Based on this, the specific steps included in this method can be as follows: Figure 5 As shown.
[0044] In step S2042, the binary file is disassembled by various disassemblers set in the third agent to obtain the assembly code corresponding to each disassembler set in the third agent. The target disassembler is not included in the various disassemblers set in the third agent.
[0045] In step S2044, the binary file is disassembled using the target disassembler to obtain the assembly code corresponding to the target disassembler.
[0046] In practical applications, the various disassemblers set up in the third intelligent agent include one or more of the following: IDA Pro disassembler, Ghidra disassembler, Radare2 disassembler, angr disassembler, Binary Ninja disassembler, Ddisasm disassembler, XDA disassembler, DeepDi disassembler, D-ARM disassembler, DASSA disassembler, DisasLLM disassembler, Ddisam WIS disassembler, Tady disassembler, Disa disassembler, and disassemblers built based on machine learning models.
[0047] In practical applications, the specific processing method of step S206 can vary. Here is another optional processing method, which may specifically include the processing of steps S20602 to S20610. Based on this, in the above... Figure 2 Based on this, the specific steps included in this method can be as follows: Figure 6 As shown.
[0048] In step S20602, the first sub-agent in the first intelligent agent performs consensus analysis on the instruction boundaries in the multiple assembly codes obtained, and determines the instruction boundaries in the multiple assembly codes with a confidence level higher than a preset threshold.
[0049] In practice, obtaining accurate disassembled code from real binary files (especially those lacking source assembly code, such as malicious binary files) typically relies on the source assembly code and debugging information. However, many real binary files are stripped, obfuscated, or lack accessible source assembly code. Therefore, the embodiments in this specification employ a progressive approach, systematically extracting and verifying assembly code from executable binary files through various complementary methods, specifically as follows: Figure 7 As shown, multiple different disassemblers can be run in parallel to decompile the target binary file, and consensus analysis can be applied to identify high-confidence instruction boundaries. This allows for the identification of instruction boundaries in multiple assembly codes with confidence levels exceeding a preset threshold (which can be set according to actual conditions, such as 90% or 80%). Furthermore, statistical consistency can be used to compensate for errors in individual disassemblers.
[0050] In step S20604, the second sub-agent in the first intelligent agent performs alignment processing on multiple assembly codes based on the determined instruction boundaries.
[0051] In implementation, such as Figure 7As shown, a second sub-agent can be set in the first agent. The second sub-agent can be used to align assembly code. Specifically, the determined instruction boundaries can be input into the second sub-agent. Through the model and / or algorithm in the second sub-agent, combined with the determined instruction boundaries, multiple assembly codes can be aligned.
[0052] In step S20606, the third sub-agent in the first agent extracts consistent and inconsistent code segments from multiple aligned assembly codes based on the determined instruction boundaries using differential testing.
[0053] In implementation, such as Figure 7 As shown, a third sub-agent can be set in the first agent. The third sub-agent can be used to extract identical and different code. Specifically, the determined instruction boundaries and multiple aligned assembly codes can be provided to the third sub-agent. Through the model and / or algorithm in the third sub-agent, combined with the determined instruction boundaries, the code segments with consistent content and the code segments with inconsistent content are extracted from the multiple aligned assembly codes using differential testing.
[0054] In step S20608, the fourth sub-agent in the first agent performs semantic verification on the extracted code segments with consistent content and code segments with inconsistent content to obtain the corresponding verification results.
[0055] In implementation, such as Figure 7 As shown, a fourth sub-agent can be set in the first agent. The fourth sub-agent can be used to perform semantic verification on the extracted code segments. Specifically, the fourth sub-agent can verify the extracted code segments with consistent content and those with inconsistent content through one or more of the following methods: control flow analysis, register usage patterns, and architecture consistency checks. This achieves semantic verification processing of the extracted code segments with consistent content and those with inconsistent content, and finally, the corresponding verification results can be obtained.
[0056] In step S20610, the fifth sub-agent in the first agent, based on the above verification results and the extracted code segments with consistent and inconsistent content, annotates multiple assembly codes to obtain the code analysis results corresponding to the binary file.
[0057] In implementation, such as Figure 7 As shown, a fifth sub-agent can be set in the first agent. The fifth sub-agent can be used to annotate and align the extracted code segments. Specifically, the fifth sub-agent can use one or more of the following methods: semantic tagging, function boundary and cross-reference alignment, and annotation instruction sequence to create code analysis results corresponding to the binary file.
[0058] In practical applications, a more comprehensive dataset can be constructed to provide a basis for the evaluation and improvement of disassembly tools or other intelligent agents. For details, please refer to the processing steps B2 to B6 below.
[0059] In step B2, the code analysis results corresponding to the binary file and the corresponding binary file are stored in a preset dataset.
[0060] In implementation, since binary files can be either those containing source assembly code or those without (such as binary files corresponding to malicious applications), the dataset can collect code analysis results for both binary files with and without source assembly code, thus enabling the construction of a more comprehensive dataset. In practical applications, such as... Figure 4 As shown, a fourth intelligent agent can also be set up. Through the fourth intelligent agent, the code analysis results and binary files corresponding to the binary files are preprocessed and classified. Then, the code analysis results and binary files corresponding to the binary files can be stored in a preset dataset. In this way, the fourth intelligent agent maintains a persistent knowledge base containing error modes, weaknesses of disassemblers and improvement strategies, realizing continuous learning and intelligent test case generation for specific fault modes.
[0061] In step B4, when a test request for the target agent or the preset disassembler is received, the binary file and the corresponding code analysis results used to test the target agent or the preset disassembler can be obtained from the dataset as test samples.
[0062] The target intelligent agent can include various types, such as an intelligent agent used for risk prevention and control, or an intelligent agent used for instant communication. The testing of the target intelligent agent can include various types, such as testing the risk prevention and control capabilities of the target intelligent agent, or testing the ability of the target intelligent agent to analyze the vulnerabilities or interoperability of closed-source software, etc. The specific settings can be set according to the actual situation.
[0063] In practice, when a test needs to be performed on the target intelligent agent or the preset disassembler, a test request can be initiated through the terminal device. The server can receive the test request and then obtain a certain number of binary files and corresponding code analysis results from the dataset. The obtained binary files and corresponding code analysis results can be used as test samples for testing the target intelligent agent or the preset disassembler.
[0064] In step B6, the target agent or a preset disassembler is tested using the acquired test samples.
[0065] The aforementioned multi-agent system, consisting of a first agent, a second agent, a third agent, and a fourth agent, transforms disassembly evaluation from a passive assessment to proactive improvement guidance, providing disassemblers with system weakness identification and targeted enhancement strategies.
[0066] In practical applications, target binary files include binary files that do not contain source assembly code, and target binary files are binary files that have been processed by one or more protection strategies such as obfuscation, packing, and stripping.
[0067] This specification provides a method for testing disassemblers. By acquiring a target binary file, disassembling the binary file using multiple different disassemblers, assembly code corresponding to each disassembler is obtained. Among the various disassemblers is the target disassembler whose disassembly performance needs to be tested. Then, a first intelligent agent performs differential testing analysis on the obtained assembly code to determine the code analysis result corresponding to the binary file. The code analysis result includes one or more of the following: abnormal assembly code blocks, descriptive information of differing assembly code, and heuristic information corresponding to multiple assembly codes. Finally, based on the code analysis result corresponding to the binary file, the test result of the disassembly performance of the target disassembler can be determined. In this way, by disassembling the binary file using multiple different disassemblers and analyzing the obtained assembly code through differential testing, the correct and incorrect behaviors of various disassemblers can be determined, thereby improving the testing efficiency and accuracy of disassemblers.
[0068] The above describes the testing method for disassemblers provided in the embodiments of this specification. Based on the same idea, the embodiments of this specification also provide a testing device for disassemblers, such as... Figure 8 As shown.
[0069] The testing apparatus for this disassembler includes: a first file acquisition module 801, a disassembly module 802, a differential test analysis module 803, and a test result determination module 804, wherein: The first file acquisition module 801 acquires the target binary file; The disassembly module 802 disassembles the binary file using a variety of different disassembly tools to obtain assembly code corresponding to each disassembly tool. The variety of different disassembly tools includes a target disassembly tool for which the disassembly effect needs to be tested. The differential test analysis module 803 performs differential test analysis on multiple assembly codes obtained by the first intelligent agent to determine the code analysis result corresponding to the binary file. The code analysis result corresponding to the binary file includes one or more of the following: assembly code blocks with anomalies, descriptive information of assembly codes with differences, and heuristic information corresponding to the multiple assembly codes. Among them, the heuristic information is a series of semantic constraints that together constitute a reference for judging right and wrong. The test result determination module 804 determines the test result of the disassembly effect of the target disassembler based on the code analysis results corresponding to the binary file.
[0070] In the embodiments described in this specification, the device further includes; The second file acquisition module retrieves the initial binary file; The generation module performs metadata extraction and / or architecture identification processing on the initial binary file through a second intelligent agent, and generates a structured target binary file based on consensus-based instruction boundaries and semantic verification.
[0071] In the embodiments of this specification, the generation module includes: The extraction unit performs metadata extraction and architecture identification processing on the initial binary file using the pyelftools and / or Capstone tools pre-set in the second agent, to obtain data including one or more of the following: instruction boundaries, function entry points, and architecture-specific features. The production unit generates target binary files in ELF format based on the obtained data and through consensus-based instruction boundaries and semantic verification.
[0072] In the embodiments of this specification, the disassembly module 802 includes: The first disassembly unit disassembles the binary file using various disassembly tools set in the third intelligent agent, obtaining assembly code corresponding to each of the various disassembly tools set in the third intelligent agent. The target disassembly tool is not included among the various disassembly tools set in the third intelligent agent. The second disassembly unit performs disassembly processing on the binary file using the target disassembly tool to obtain the assembly code corresponding to the target disassembly tool.
[0073] In the embodiments of this specification, the various disassemblers configured in the third intelligent agent include one or more of the following: IDAPro disassembler, Ghidra disassembler, Radare2 disassembler, angr disassembler, Binary Ninja disassembler, Ddisasm disassembler, XDA disassembler, DeepDi disassembler, D-ARM disassembler, DASSA disassembler, DisasLLM disassembler, Ddisam WIS disassembler, Tady disassembler, Disa disassembler, and disassemblers built based on machine learning models.
[0074] In the embodiments of this specification, the differential test analysis module 803 includes: The boundary determination unit performs consensus analysis on the instruction boundaries in the multiple assembly codes obtained by the first sub-agent in the first agent, and determines the instruction boundaries in the multiple assembly codes whose confidence level is higher than a preset threshold. The alignment unit, through the second sub-agent in the first intelligent agent, performs alignment processing on multiple assembly codes based on determined instruction boundaries; The extraction unit, through the third sub-agent in the first intelligent agent, extracts code segments with consistent content and code segments with inconsistent content from multiple aligned assembly codes based on the determined instruction boundaries using differential testing. The semantic verification unit performs semantic verification on the extracted code segments with consistent content and code segments with inconsistent content through the fourth sub-agent in the first agent, and obtains the corresponding verification results. The annotation unit, through the fifth sub-agent in the first intelligent agent, annotates the multiple assembly codes based on the verification results and the extracted code segments with consistent and inconsistent content, to obtain the code analysis results corresponding to the binary file.
[0075] In the embodiments described in this specification, the device further includes: The storage module stores the code analysis results corresponding to the binary file and the binary file itself in a preset dataset. The test sample determination module, when receiving a test request for a target intelligent agent or a preset disassembler, can obtain binary files and corresponding code analysis results from the dataset for testing the target intelligent agent or the preset disassembler as test samples. The testing module tests the target intelligent agent or a preset disassembler using the acquired test samples.
[0076] In the embodiments described in this specification, the target binary file includes a binary file that does not contain the original assembly code, and the target binary file is a binary file that has been processed by one or more protection strategies, such as obfuscation, packing, and stripping.
[0077] For ease of description, the above devices are described by dividing them into various modules or units based on their functions. Of course, when implementing one or more embodiments of this specification, the functions of each module or unit can be implemented in one or more software and / or hardware components, or a module that performs the same function can be implemented by a combination of multiple sub-modules or sub-units, etc. The device embodiments described above are merely illustrative; the division of each module and unit is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or modules can be combined or integrated into another system, or some features can be ignored or not executed, etc.
[0078] This specification provides a testing device for disassemblers. It acquires a target binary file and disassembles it using multiple different disassemblers to obtain assembly code for each tool. Among these disassemblers is the target disassembler whose disassembly performance needs to be tested. Then, a first intelligent agent performs differential testing analysis on the obtained assembly code to determine the code analysis result corresponding to the binary file. This result includes one or more of the following: abnormal assembly code blocks, descriptions of differing assembly code, and heuristic information corresponding to the multiple assembly codes. Finally, based on the code analysis result of the binary file, the test result for the disassembly performance of the target disassembler can be determined. By disassembling the binary file using multiple different disassemblers and analyzing the obtained assembly code through differential testing, the correct and incorrect behaviors of the various disassemblers can be determined, thereby improving the testing efficiency and accuracy of disassemblers.
[0079] The above describes the testing apparatus for the disassembler provided in the embodiments of this specification. Based on the same idea, the embodiments of this specification also provide a testing device for the disassembler, such as... Figure 9 As shown.
[0080] The testing equipment for the disassembler tool can be a terminal device or server, as described in the above embodiments.
[0081] The testing equipment for disassemblers can vary considerably depending on configuration and performance. It may comprise a communication interface 902, a user interface 904, a processor 906, and data storage 908. These components are interconnected and communicate with each other via a system bus, network, or other connection mechanism 910. The communication interface 902 enables the testing equipment 900 of the disassembler to communicate with other devices, access networks, and transmission networks via analog or digital modulation. For example, the communication interface 902 may include a chipset and antenna for wireless communication with a radio access network or access point. Furthermore, the communication interface 902 can be a wired interface such as Ethernet, Token Ring, or a USB port, or a wireless interface such as Wi-Fi, Bluetooth, Global Positioning System (GPS), or a wide-area wireless interface (e.g., WiMAX or LTE). Of course, the communication interface 902 may also support other forms of physical layer interfaces and standard or proprietary communication protocols. The communication interface 902 may also include multiple physical communication interfaces, such as Wi-Fi, Bluetooth, and wide-area wireless interfaces.
[0082] User interface 904 includes receiving user input and providing output to the user. Therefore, user interface 904 may include input components such as a keypad, keyboard, touch-sensitive or presence-sensitive panel, computer mouse, trackball, joystick, microphone, still camera, and video camera, and output components such as a display screen (which may be combined with a touch-sensitive panel), CRT, LCD, LED, display using DLP technology, printer, and other similar devices known or developed in the future. User interface 904 may also generate auditory output via speakers, speaker jacks, audio output ports, audio output devices, headphones, and other similar devices known or developed in the future. In some embodiments, user interface 904 may include software, circuitry, or other forms of logic capable of transmitting and receiving data from external user input / output devices. Additionally or alternatively, the test device 900 of the disassembler tool may support remote access from other devices via communication interface 902 or another physical interface (not shown). User interface 904 may be configured to receive user input, the position and movement of which may be indicated by indicators or cursors described herein. User interface 904 can also be configured as a display device for rendering or displaying text fragments.
[0083] The processor 906 may contain one or more general-purpose processors and / or special-purpose processors.
[0084] Data storage 908 may include one or more volatile and / or non-volatile storage components and may be integrated wholly or partially with processor 906. Data storage 908 may include removable and non-removable components.
[0085] Processor 906 is capable of executing program instructions 918 (e.g., compiled or uncompiled program logic and / or machine code) stored in data storage 908 to perform the various functions described herein. Data storage 908 may contain a non-transitory computer-readable medium on which program instructions are stored, which, when executed by test device 900 of a disassembler, enable test device 900 of the disassembler to perform any methods, processes, or functions disclosed in this specification and / or the accompanying drawings. Execution of program instructions 918 by processor 906 may result in processor 906 using data 912.
[0086] For example, program instructions 918 may contain an operating system 922 (e.g., an operating system kernel, device drivers, and / or other modules) and one or more applications 920 (e.g., a browser, social application, or game application) installed on the test device 900 of the disassembler. Similarly, data 912 may contain operating system data 916 and application data 914. Operating system data 916 is primarily accessible to the operating system 922, while application data 914 is primarily accessible to one or more applications 920. Application data 914 may reside in a file system visible or hidden from the user on the test device 900 of the disassembler.
[0087] Application 920 can communicate with operating system 912 through one or more application programming interfaces (APIs). These APIs help application 920 read and / or write application data 914, transmit or receive information via communication interface 902, receive or display information on user interface 904, etc.
[0088] In some terminology, application 920 may be simply referred to as "app". Furthermore, application 920 can be downloaded to the disassembler's test device 900 via one or more online app stores or app markets. However, the application can also be installed on the disassembler's test device 900 in other ways, such as through a web browser or a physical interface (e.g., a USB port) on the disassembler's test device 900.
[0089] Specifically, in this embodiment, the testing device 900 of the disassembler includes a data storage 908 and one or more program instructions 918, wherein one or more program instructions 918 are stored in the data storage 908, and one or more program instructions 918 are configured to be executed by one or more processors. The one or more program instructions include computer-executable instructions for performing the following: Obtain the target binary file; The binary file is disassembled using a variety of different disassemblers to obtain assembly code for each disassembler. Among the various disassemblers are target disassemblers for which disassembly effect testing is required. The first intelligent agent performs differential testing analysis on multiple assembly codes to determine the code analysis result corresponding to the binary file. The code analysis result corresponding to the binary file includes one or more of the following: assembly code blocks with anomalies, descriptive information of assembly codes with differences, and heuristic information corresponding to the multiple assembly codes. Among them, the heuristic information is a series of semantic constraints that together constitute a reference for judging right and wrong. Based on the code analysis results corresponding to the binary file, the test results of the disassembly effect of the target disassembler are determined.
[0090] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to interchangeably. Each embodiment focuses on its differences from other embodiments. In particular, the embodiments for testing the disassembler tool are relatively simple in description because they are fundamentally similar to the method embodiments; relevant parts can be referred to the descriptions in the method embodiments.
[0091] This specification provides a testing device for disassemblers. By acquiring a target binary file, the device disassembles the binary file using multiple different disassemblers, obtaining assembly code corresponding to each disassembler. Among these disassemblers is the target disassembler whose disassembly performance needs to be tested. Then, a first intelligent agent performs differential testing analysis on the obtained assembly code to determine the code analysis result corresponding to the binary file. The code analysis result includes one or more of the following: abnormal assembly code blocks, descriptive information of differing assembly code, and heuristic information corresponding to multiple assembly codes. Finally, based on the code analysis result corresponding to the binary file, the test result of the disassembly performance of the target disassembler can be determined. In this way, by disassembling the binary file using multiple different disassemblers and analyzing the obtained assembly code through differential testing, the correct and incorrect behaviors of various disassemblers can be determined, thereby improving the testing efficiency and accuracy of disassemblers.
[0092] Furthermore, based on the above Figures 1 to 7 This specification also provides a storage medium for storing computer-executable instruction information in one or more embodiments. In one specific embodiment, the storage medium may be a USB flash drive, optical disc, hard disk, etc. When the computer-executable instruction information stored in the storage medium is executed by a processor, it can realize the following process: Obtain the target binary file; The binary file is disassembled using a variety of different disassemblers to obtain assembly code for each disassembler. Among the various disassemblers are target disassemblers for which disassembly effect testing is required. The first intelligent agent performs differential testing analysis on multiple assembly codes to determine the code analysis result corresponding to the binary file. The code analysis result corresponding to the binary file includes one or more of the following: assembly code blocks with anomalies, descriptive information of assembly codes with differences, and heuristic information corresponding to the multiple assembly codes. Among them, the heuristic information is a series of semantic constraints that together constitute a reference for judging right and wrong. Based on the code analysis results corresponding to the binary file, the test results of the disassembly effect of the target disassembler are determined.
[0093] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the above-described storage medium embodiment is basically similar to the method embodiment, so the description is relatively simple; relevant parts can be referred to the description of the method embodiment.
[0094] This specification provides a storage medium that acquires a target binary file and disassembles it using multiple different disassemblers to obtain assembly code for each disassembler. Among these disassemblers is a target disassembler whose disassembly performance needs to be tested. Then, a first intelligent agent performs differential testing analysis on the obtained assembly code to determine the code analysis result corresponding to the binary file. This result includes one or more of the following: abnormal assembly code blocks, descriptive information of differing assembly code, and heuristic information corresponding to multiple assembly codes. Finally, based on the code analysis result of the binary file, the test result of the disassembly performance of the target disassembler can be determined. In this way, by disassembling the binary file using multiple different disassemblers and analyzing the obtained assembly code through differential testing, the correct and incorrect behaviors of the various disassemblers can be determined, thereby improving the testing efficiency and accuracy of the disassemblers.
[0095] Furthermore, based on the above Figures 1 to 7 This specification also provides one or more embodiments of a computer program product, including a computer program, which, when executed by a processor, can perform the following processes: Obtain the target binary file; The binary file is disassembled using a variety of different disassemblers to obtain assembly code for each disassembler. Among the various disassemblers are target disassemblers for which disassembly effect testing is required. The first intelligent agent performs differential testing analysis on multiple assembly codes to determine the code analysis result corresponding to the binary file. The code analysis result corresponding to the binary file includes one or more of the following: assembly code blocks with anomalies, descriptive information of assembly codes with differences, and heuristic information corresponding to the multiple assembly codes. Among them, the heuristic information is a series of semantic constraints that together constitute a reference for judging right and wrong. Based on the code analysis results corresponding to the binary file, the test results of the disassembly effect of the target disassembler are determined.
[0096] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the above-described embodiment of a computer program product is relatively simple in description because it is fundamentally similar to the method embodiment; relevant parts can be referred to the description of the method embodiment.
[0097] This specification provides a computer program product that obtains a target binary file and disassembles it using multiple different disassemblers to obtain assembly code for each disassembler. Among these disassemblers is a target disassembler whose disassembly effectiveness needs to be tested. Then, a first intelligent agent performs differential testing analysis on the obtained assembly code to determine the code analysis result corresponding to the binary file. This result includes one or more of the following: abnormal assembly code blocks, descriptive information of differing assembly code, and heuristic information corresponding to multiple assembly codes. Finally, based on the code analysis result of the binary file, the test result of the disassembly effectiveness of the target disassembler can be determined. In this way, by disassembling the binary file using multiple different disassemblers and analyzing the obtained assembly code through differential testing, the correct and incorrect behaviors of the various disassemblers can be determined, thereby improving the testing efficiency and accuracy of the disassemblers.
[0098] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps described in the claims may be performed in a different order than those shown in the embodiments and still achieve the desired results. Furthermore, the processes depicted in the drawings do not necessarily require a specific or sequential order to achieve the desired results. In some embodiments, multitasking and parallel processing are possible or may be advantageous. Moreover, although one or more embodiments of this specification provide method steps as described in the embodiments or flowcharts, it is understood that the order of steps listed in the embodiments or flowcharts is merely one possible execution order among many steps and does not represent the only execution order. Therefore, when method steps are involved in the claims, adjustments to the order of those steps, or parallelism between steps, are also within the scope of protection of the claims.
[0099] In the 1990s, improvements to a technology could be clearly distinguished as either hardware improvements (e.g., improvements to the circuit structure of diodes, transistors, switches, etc.) or software improvements (improvements to the methodology). However, with technological advancements, many methodological improvements today can be considered direct improvements to the hardware circuit structure. Designers almost always obtain the corresponding hardware circuit structure by programming the improved methodology into the hardware circuit. Therefore, it cannot be said that a methodological improvement cannot be implemented using hardware physical modules. For example, a Programmable Logic Device (PLD) (such as a Field Programmable Gate Array (FPGA)) is such an integrated circuit whose logic function is determined by the user programming the device. Designers can program and "integrate" a digital system onto a PLD themselves, without needing chip manufacturers to design and manufacture dedicated integrated circuit chips. Furthermore, nowadays, instead of manually manufacturing integrated circuit chips, this programming is mostly implemented using "logic compiler" software. Similar to the software compiler used in program development, the original code before compilation must also be written in a specific programming language, called a Hardware Description Language (HDL). There are many HDLs, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, and RHDL (Ruby Hardware Description Language). Currently, the most commonly used are VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog. Those skilled in the art should also understand that by simply performing some logic programming on the method flow using one of these hardware description languages and programming it into an integrated circuit, the hardware circuit implementing the logical method flow can be easily obtained.
[0100] The controller can be implemented in any suitable manner. For example, it can take the form of a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro)processor, logic gates, switches, application-specific integrated circuits (ASICs), programmable logic controllers, and embedded microcontrollers. Examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicon Labs C8051F320. A memory controller can also be implemented as part of the control logic of the memory. Those skilled in the art will also recognize that, in addition to implementing the controller in purely computer-readable program code form, the same functionality can be achieved by logically programming the method steps to make the controller take the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers. Therefore, such a controller can be considered a hardware component, and the means included therein for implementing various functions can also be considered as structures within the hardware component. Alternatively, the means for implementing various functions can be considered as both software modules implementing the method and structures within the hardware component.
[0101] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, a computer can be, for example, a personal computer, laptop computer, cellular phone, camera phone, smartphone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or any combination of these devices.
[0102] For ease of description, the above apparatus is described by dividing it into various functional units. Of course, when implementing one or more embodiments of this specification, the functions of each unit can be implemented in one or more software and / or hardware.
[0103] Those skilled in the art will understand that the embodiments of this specification can be provided as methods, systems, or computer program products. Therefore, one or more embodiments of this specification may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of this specification may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0104] Embodiments in this specification are described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this specification. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable parallel device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable parallel device, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0105] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable fraud device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0106] These computer program instructions can also be loaded onto a computer or other programmable device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable device for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0107] In a typical configuration, a computing device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.
[0108] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.
[0109] Computer-readable media include both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.
[0110] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical or equivalent elements in the process, method, article, or apparatus that includes said element. Furthermore, "a," "an," and "the" are not specifically singular and may include plural forms. Ordinal numbers such as "first," "second," etc., do not necessarily indicate order; they are often used to distinguish objects. For example, "first server" and "second server" usually refer to two servers, described as "first server" and "second server" to differentiate them; however, sometimes these two servers may be the same server. Moreover, in this specification, unless explicitly stated otherwise, "receiving and sending data" does not necessarily mean direct receiving and sending; it can be indirect receiving and sending (i.e., receiving and sending indirectly through one or more entities). Similarly, in this specification, unless otherwise stated, the relationships between structures can be direct or indirect.
[0111] Furthermore, the specific terms used in this specification to describe embodiments, such as "an embodiment," "one embodiment," or "some embodiments," refer to a particular feature, structure, or characteristic related to at least one embodiment of this specification. Therefore, it should be emphasized and noted that "an embodiment," "one embodiment," or "an alternative embodiment" mentioned twice or more in different locations in this specification do not necessarily refer to the same embodiment. Moreover, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of those different embodiments or examples, without contradiction.
[0112] Those skilled in the art will understand that the embodiments of this specification can be provided as methods, systems, or computer program products. Therefore, one or more embodiments of this specification may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of this specification may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0113] One or more embodiments of this specification can be described in the general context of computer-executable instructions, such as program modules, that are executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform a specific task or implement a specific abstract data type. One or more embodiments of this specification can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.
[0114] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to interchangeably. Each embodiment focuses on describing the differences from other embodiments. In particular, the system embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.
[0115] The above description is merely an embodiment of this specification and is not intended to limit this document. Various modifications and variations can be made to this specification by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this specification should be included within the scope of the claims in this document.
Claims
1. A method for testing a disassembler, the method comprising: Obtain the target binary file; The binary file is disassembled using a variety of different disassemblers to obtain assembly code for each disassembler. Among the various disassemblers are target disassemblers for which disassembly effect testing is required. The first intelligent agent performs differential testing analysis on multiple assembly codes to determine the code analysis result corresponding to the binary file. The code analysis result corresponding to the binary file includes one or more of the following: assembly code blocks with anomalies, descriptive information of assembly codes with differences, and heuristic information corresponding to the multiple assembly codes. Among them, the heuristic information is a series of semantic constraints that together constitute a reference for judging right and wrong. Based on the code analysis results corresponding to the binary file, the test results of the disassembly effect of the target disassembler are determined.
2. The method according to claim 1, further comprising: Obtain the initial binary file; The initial binary file is processed by a second intelligent agent to extract metadata and / or identify the architecture, and a structured target binary file is generated based on consensus-based instruction boundaries and semantic verification.
3. The method according to claim 2, wherein the step of performing metadata extraction processing and / or architecture identification processing on the initial binary file through a second intelligent agent to generate a structured target binary file includes: The initial binary file is processed by metadata extraction and architecture identification using the pyelftools and / or Capstone tools pre-set in the second agent, to obtain data including one or more of the following: instruction boundaries, function entry points and architecture-specific features. Based on the obtained data, a target binary file in ELF format is generated through consensus-based instruction boundaries and semantic verification.
4. The method according to claim 1, wherein the disassembly of the binary file using multiple different disassemblers to obtain assembly code corresponding to each disassembler includes: The binary file is disassembled by various disassemblers set in the third agent to obtain the assembly code corresponding to each disassembler set in the third agent. The target disassembler is not included in the various disassemblers set in the third agent. The binary file is disassembled using the target disassembler to obtain the assembly code corresponding to the target disassembler.
5. The method according to claim 4, wherein the various disassemblers configured in the third agent include one or more of the following: IDA Pro disassembler, Ghidra disassembler, Radare2 disassembler, angr disassembler, BinaryNinja disassembler, Ddisasm disassembler, XDA disassembler, DeepDi disassembler, D-ARM disassembler, DASSA disassembler, DisasLLM disassembler, Ddisam WIS disassembler, Tady disassembler, Disa disassembler, and disassemblers built based on machine learning models.
6. The method according to claim 1, wherein the step of performing differential test analysis on the obtained multiple assembly codes by the first intelligent agent to determine the code analysis result corresponding to the binary file includes: By performing consensus analysis on the instruction boundaries in the multiple assembly codes obtained by the first sub-agent in the first intelligent agent, the instruction boundaries in the multiple assembly codes with a confidence level higher than a preset threshold are determined. The second sub-agent in the first intelligent agent aligns multiple assembly codes based on defined instruction boundaries. Using the third sub-agent in the first intelligent agent, based on the determined instruction boundaries, differential testing is used to extract consistent and inconsistent code segments from multiple aligned assembly codes. The fourth sub-agent in the first agent performs semantic verification on the extracted code segments with consistent content and code segments with inconsistent content to obtain the corresponding verification results. The fifth sub-agent in the first intelligent agent, based on the verification results and the extracted code segments with consistent and inconsistent content, annotates the multiple assembly codes to obtain the code analysis results corresponding to the binary file.
7. The method according to any one of claims 1-4, further comprising: The code analysis results corresponding to the binary file and the corresponding binary file are stored in a preset dataset; When a test request for a target intelligent agent or a preset disassembler is received, binary files and corresponding code analysis results for testing the target intelligent agent or the preset disassembler can be obtained from the dataset as test samples. The target intelligent agent or a preset disassembler is tested using the obtained test samples.
8. The method of claim 7, wherein the target binary file comprises a binary file that does not contain the original assembly code, and the target binary file is a binary file processed by one or more protection strategies selected from obfuscation, packing, and stripping.
9. A testing apparatus for a disassembler, the apparatus comprising: The first file acquisition module acquires the target binary file; The disassembly module disassembles the binary file using a variety of different disassembly tools to obtain assembly code corresponding to each disassembly tool. Among the various disassembly tools is a target disassembly tool for which the disassembly effect needs to be tested. The differential test analysis module performs differential test analysis on multiple assembly codes obtained by the first intelligent agent to determine the code analysis result corresponding to the binary file. The code analysis result corresponding to the binary file includes one or more of the following: assembly code blocks with anomalies, descriptive information of assembly codes with differences, and heuristic information corresponding to the multiple assembly codes. Among them, the heuristic information is a series of semantic constraints that together constitute a reference for judging right and wrong. The test result determination module determines the test result of the disassembly effect of the target disassembler based on the code analysis results corresponding to the binary file.
10. A testing device for a disassembler tool, the testing device comprising: processor; as well as A memory configured to store computer-executable instructions, which, when executed, cause the processor to: Obtain the target binary file; The binary file is disassembled using a variety of different disassemblers to obtain assembly code for each disassembler. Among the various disassemblers are target disassemblers for which disassembly effect testing is required. The first intelligent agent performs differential testing analysis on multiple assembly codes to determine the code analysis result corresponding to the binary file. The code analysis result corresponding to the binary file includes one or more of the following: assembly code blocks with anomalies, descriptive information of assembly codes with differences, and heuristic information corresponding to the multiple assembly codes. Among them, the heuristic information is a series of semantic constraints that together constitute a reference for judging right and wrong. Based on the code analysis results corresponding to the binary file, the test results of the disassembly effect of the target disassembler are determined.