System for dynamic vulnerability mining of composite android platform native program
By designing a composite dynamic vulnerability mining system for native Android platform programs, combining fuzz testing, information fusion, and symbolic execution, the problem of low test coverage is solved, and efficient vulnerability mining of native Android platform programs is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NAT UNIV OF DEFENSE TECH
- Filing Date
- 2023-04-26
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to efficiently improve test coverage of native Android applications and lack support for information fusion and interaction between fuzzing and symbolic execution methods, resulting in low vulnerability discovery efficiency.
Design a composite dynamic vulnerability mining system for native Android programs, including a fuzzing module, an information fusion scheduling module, and a symbolic execution module. The fuzzing module sets up a test case pool, the symbolic execution module performs symbolic execution and collects path constraints, and the information fusion module merges static and runtime information to optimize test case generation.
It significantly improved test coverage, enabled in-depth vulnerability discovery of native Android programs, and improved the efficiency and accuracy of vulnerability discovery.
Smart Images

Figure CN116541846B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of information security technology, specifically to a composite Android platform native program dynamic vulnerability mining system. Background Technology
[0002] The Android platform offers a diverse range of software. Sometimes, simply using the Android SDK to write applications in Java may not meet specific functional and performance requirements, such as implementing computationally intensive algorithms or protecting code copyright. Therefore, the Android NDK is typically used to combine native C and C++ code with a graphical interface, encapsulating high-performance, core functional modules within native code modules. However, when developing native Android applications, the use of insecure programming languages like C and C++, coupled with developers' focus on designing and promoting new applications, often results in insufficient or inadequate security measures. This makes vulnerability discovery significantly more difficult than for ordinary applications. Malicious program designers exploit these vulnerabilities to develop malware that steals user information, consumes bandwidth and network resources, spreads viruses, and implants remote control Trojans. Consequently, research on the security of native Android applications has garnered widespread attention and remains a hot topic in security research both domestically and internationally.
[0003] Native Android program vulnerabilities are a set of special attributes of software, characterized by their potential, triggerability, and conditionality, making them extremely difficult to discover. Although native Android program vulnerabilities are widespread, they are not immediately exposed but gradually revealed through user usage or hacker analysis, thus exhibiting a typical latent characteristic. The triggerability of native Android program vulnerabilities means that there is always an execution path that will trigger the vulnerability, causing it to manifest through certain phenomena, such as abnormal termination or functional malfunction. The conditionality of native Android program vulnerabilities means that the path triggering the vulnerability generally corresponds to different input conditions. Triggerability indicates that native Android program vulnerabilities can be discovered, while conditionality indicates that discovering them is difficult because there is currently no efficient method to solve the problem of determining the input conditions. Due to these characteristics of native Android program vulnerabilities, vulnerability discovery typically utilizes a combination of static program analysis and dynamic testing to find unknown security vulnerabilities. The challenge lies in ensuring the accuracy, completeness, and efficiency of the discovery results.
[0004] Dynamic penetration testing uses automated tools or manual methods to simulate hacker input to perform offensive tests on application systems, identifying runtime security vulnerabilities. This type of testing is characterized by its realism and effectiveness; the problems it finds are generally accurate and serious. However, a fatal flaw of penetration testing is that the simulated test data only reaches a limited number of test points, resulting in very low coverage. In dynamic testing techniques, the execution of relevant code is a necessary condition for triggering software vulnerabilities. Therefore, the industry widely adopts the strategy of "increasing the probability of vulnerability discovery by improving test coverage" to guide vulnerability discovery. In practice, this heuristic strategy has shown good testing results, and many vulnerability discovery tools based on this strategy have successfully uncovered a large number of vulnerabilities in real software. Therefore, effectively improving test coverage is key to enhancing vulnerability discovery capabilities, especially for fuzzing techniques, where test coverage has a significant impact on vulnerability discovery results. To improve the test coverage of fuzzing engines, researchers have proposed many improvement schemes, but many insurmountable path constraints still exist in the testing process, limiting the improvement of test coverage and hindering in-depth software testing.
[0005] Existing vulnerability discovery solutions typically target ordinary Java programs written using SDKs, and generally only consider using a single dynamic testing method. For example, patent CN107832619B's dynamic analysis module constructs fuzz test case data based on the results of static analysis, then instrumentes the decompiled Smali file, and finally runs the application for fuzz testing, lacking consideration for test coverage. Another example is patent CN108268371B, which designs an intelligent fuzzing method for Android applications combining application reverse symbolic execution and fuzzing. However, this method only considers symbolic execution to assist fuzzing, without considering the information fusion and interaction between fuzzing and symbolic execution, thus affecting test coverage. Furthermore, this method only supports testing ordinary Java applications written using SDKs, lacking support for testing native Android platform programs. Summary of the Invention
[0006] This invention proposes a composite dynamic vulnerability mining system for native Android platform programs to address the technical problems of insufficient test coverage and lack of consideration for information fusion and interaction between fuzzing and symbolic execution methods.
[0007] To address the aforementioned technical problems, this invention provides a composite Android platform native program dynamic vulnerability mining system, characterized by including a fuzzing module, an information fusion scheduling module, and a symbolic execution module.
[0008] The fuzz testing module is used to set up a test case pool; select test cases from the test case pool for testing; track coverage information when test cases mutate; and add the mutated test cases to the test case pool when coverage improves.
[0009] The symbolic execution module is used to perform symbolic execution and collect path constraints. When an uncovered branch is encountered, constraint optimization is performed to generate new test cases and add them to the test case pool.
[0010] The information fusion scheduling module is used to fuse the static and runtime information of the target Android program into an attribute graph EIMap = {(e,hit,input,input_offset)|input∈T,e∈G}, where e represents an edge of the attribute graph, hit represents the number of hits, input represents the input test case, and input_offset represents the offset of the key node associated with the edge. Based on the coverage of the attribute graph, the module guides the mutation process of the fuzzing module to cover the specified control flow nodes. When the fuzzing module cannot cover the nodes, the module optimizes the test case selection and switches to the symbolic execution module for targeted coverage.
[0011] Preferably, the method by which the information fusion scheduling module guides the mutation process of the fuzz testing module includes the following steps:
[0012] Step S11: Dynamically instrument the target Android program and collect operand information of comparison instructions during program execution;
[0013] Step S12: Mark the key bytes of the current test case and mark the comparison instructions affected by the bytes;
[0014] Step S13: Construct a key byte set S = {(ti,ci,offsets)|ti∈T,ci∈C}, where T represents the set of use cases, C represents the set of comparison instructions, and offsets represents the byte position offset;
[0015] Step S14: Calculate the fitness value of an individual during the mutation process by comparing the changes in the operand information of the target comparison instruction, so as to guide the mutation process.
[0016] Preferably, the fitness value is calculated in step S14 as follows: obtain the operand information of the comparison instruction, obtain the values of the two operands of the comparison instruction, and accumulate the absolute values of the difference between the two operands, with the accumulated result serving as the fitness value.
[0017] Preferably, the method for symbolic execution by the symbolic execution module includes the following steps:
[0018] Step S21: Load the target Android program and add callback instrumentation to all code before execution to simulate execution;
[0019] Step S22: Determine whether symbolic execution is needed through the callback instrumentation. If it is needed, proceed to step S23; otherwise, do not perform any processing.
[0020] Step S23: For programs that require symbolic execution, construct a dynamic binary analysis framework instance of Triton to perform symbolic execution.
[0021] Preferably, the program that needs to be symbolically executed includes: a set of branches that contain only symbolic data and are not covered.
[0022] Preferably, the method for the symbolic execution module to perform constraint optimization solutions includes:
[0023] The global control flow graph is dynamically constructed, and loop structures are automatically detected at runtime. When the number of covered loops exceeds a set threshold, the constraint collection of the loop structure is skipped directly.
[0024] If the current branch is unsolvable due to constraint conflicts, try traversing all constraints and successively eliminating the previous conflicting constraint, and then resolve the path constraints.
[0025] Preferably, the method by which the information fusion scheduling module determines that the fuzz testing module cannot cover the data includes:
[0026] Step S31: Obtain edge E ik Number of hits H of the predecessor node i ;
[0027] Step S32: When the number of hits H i If it exceeds the preset threshold T, then E changes. ik Number of hits H ik If the value is 0, it is determined that the fuzz test module cannot cover the area.
[0028] Preferably, the method for optimizing test case selection by the information fusion scheduling module includes the following steps:
[0029] Step S41: Obtain variable E ik The starting node B i Dominating node B i-dom ;
[0030] Step S42: Obtain the dominant node B i-dom The set of reachable paths I Bi-dom ;
[0031] Step S43: Give the reachable path set IBi-dom Tag it as solvable;
[0032] Step S44: Use machine learning algorithms to predict solvability and obtain the test cases with the highest solvability probability.
[0033] Preferably, the machine learning algorithm includes: decision tree, random forest, support vector machine and neural network.
[0034] The beneficial effects of this invention include at least the following: designing a composite dynamic vulnerability mining method, intelligently scheduling the input space and path space processing engine, maximizing the coverage of the testing process, and achieving the goal of deep vulnerability mining.
[0035] By designing an uncovered branch-oriented strategy, a dynamic set of uncovered branches affected by input is constructed based on the combination of static analysis and runtime information, along with corresponding test cases that can reach related sibling branches. Then, the best input test case is selected to drive concrete symbolic execution and generate new test cases that can reach uncovered branches, thereby quickly discovering more security vulnerabilities, increasing the contribution of concrete symbolic execution to hybrid fuzzing, and improving the overall testing efficiency of hybrid testing methods.
[0036] Furthermore, current methods for discovering vulnerabilities in Android platform applications typically only analyze Java programs written using the SDK, lacking methods that directly support vulnerability discovery in native Android platform programs. To address this, this invention designs a vulnerability discovery method for native Android platform programs based on full system simulation execution. It achieves in-depth vulnerability discovery in native Android platform programs by combining fuzz testing and symbolic execution, providing an execution scheme. Attached Figure Description
[0037] Figure 1 This is a schematic diagram of the overall system of the present invention;
[0038] Figure 2 This is a schematic diagram of the execution flow of the fuzz testing module in an embodiment of the present invention;
[0039] Figure 3 This is a schematic diagram of the process of using QiLing for fuzz testing in an embodiment of the present invention;
[0040] Figure 4 This is a schematic diagram of the variation process in an embodiment of the present invention;
[0041] Figure 5 This is a schematic diagram of the execution flow of the symbol execution module in an embodiment of the present invention. Detailed Implementation
[0042] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the protection scope of the present invention.
[0043] like Figure 1 As shown, this embodiment of the invention provides a composite Android platform native program dynamic vulnerability mining system, including a fuzzing module, an information fusion scheduling module, and a symbolic execution module.
[0044] Specifically, the overall process of this invention is as follows: First, the target native program information is analyzed and processed using a static analysis engine to obtain global information; then, conventional fuzz testing is performed based on the AFL framework, and various comprehensive information is dynamically updated through an incremental information fusion module; then, based on the fused information, a set of missed branches is used to guide the specific execution process, test cases are selected to drive symbolic execution based on full system simulation, and new test cases are generated through constraint solving to supplement the fuzz test case pool, thereby improving test coverage.
[0045] The fuzz testing module is used to set up a test case pool; select test cases from the test case pool for testing; track coverage information when test cases mutate; and add the mutated test cases to the test case pool when coverage improves.
[0046] To achieve dynamic testing of native Android applications, the target application can be loaded and run on a physical device or on an emulator. However, both methods make it difficult to dynamically monitor the target native application during execution and make it difficult to utilize existing mature fuzzing frameworks.
[0047] Specifically, such as Figure 2 As shown, this embodiment of the invention uses the binary analysis framework Qiling to simulate the execution of native Android programs, and then performs fuzzing instrumentation, thereby achieving integration with the existing fuzzing framework AFL. AFL drives the entire fuzzing process. Furthermore, this invention proposes to rely on incremental information fusion for keyword tagging and optimized test case search. Its pattern testing process is as follows: Figure 3 As shown.
[0048] Since branch coverage information feedback has been proven to be a cost-effective testing method, this invention, based on the existing fuzz testing framework, extends the collection of comparison instruction operand information during incremental information fusion, and then incorporates a genetic algorithm into the test case mutation generation process. Figure 2 The gray-marked section implements a search-based test case generation method. Based on the information fusion module, the algorithm's specific search process is as follows: First, by dynamically instrumenting the target Android platform's native program, operand information of comparison instructions is collected during program execution. Second, key bytes of the current input test case are marked, indicating which input bytes affect which comparison instructions, constructing a key byte set S = {(ti,ci,offsets)|ti∈T,ci∈C}, where T is the test case set, C is the comparison instruction set, and offsets is the byte position offset, thus recording the influence relationship of all test cases on different comparison instructions. Then, based on the key byte information within a relatively small input space, a genetic algorithm is used to heuristically search for new test cases. During the search process, the fitness value of an individual is represented by the change in the target comparison instruction operand information. Finally, when a test case that covers a new branch is found during the search, it is added to the test case pool.
[0049] Specifically, such as Figure 4 As shown, the search process aims to optimize the search by making all comparison instruction operands approach equality. The sum of deviations caused by operand inequality is used as the fitness value of the genetic algorithm. Referring to the classic genetic algorithm model, different test cases here represent individuals in the algorithm. For a failed comparison instruction, the corresponding key byte needs to be encoded with 0 / 1 genes. Then, to obtain the individual's fitness value, the corresponding comparison instruction operand information needs to be dynamically acquired during execution. The values of the two operands of the comparison instruction are obtained, and the absolute values of the differences between the two operands are accumulated. The accumulated result is used as the fitness value of the current individual. Based on the fitness value, multiple iterations are performed using the classic genetic algorithm process to finally find the optimal test case.
[0050] Dynamic symbolic execution is a path-sensitive analysis technique. However, traditional symbolic execution methods do not consider the relationship between internal code and driving inputs when attempting to test the integrity of the target software, leading to problems such as state space explosion and difficulty in constraint solving, making it impossible to test truly large-scale software. This invention proposes to construct an uncovered set of branches based on an incremental fusion module, controlling symbolic execution within uncovered control flow nodes, thus achieving controllable scale of the symbolic execution process.
[0051] In this embodiment of the invention, the symbolic execution module is used to perform symbolic execution and collect path constraints. When an uncovered branch is encountered, constraint optimization is performed to generate new test cases and add them to the test case pool.
[0052] Specifically, this embodiment of the invention implements symbolic execution based on the Triton dynamic binary analysis framework. First, an adapter plugin is implemented to acquire driver test cases, completing repetitive tasks such as program loading and initialization through the Qiling platform. Then, upon encountering symbolic data, it synchronizes state information to the Triton engine and performs symbolic execution. Next, a test case generation plugin is implemented to perform constraint pruning and consistency solving. Second, a detection plugin with a vulnerability model is implemented to uncover potential security vulnerabilities. Finally, the entire symbolic execution engine is modularly encapsulated for consistent scheduling.
[0053] Symbolic execution design based on Qiling framework simulation, such as Figure 5 As shown, the program is first loaded based on the Qiling framework, and callback instrumentation is added to all code before execution. Then, simulated execution is performed, and it is determined in the callback instrumentation whether symbolic execution is needed. In this embodiment of the invention, symbolic execution is only switched when processing symbolic data, and only the set of uncovered branches is processed, thereby effectively controlling the scale of the symbolic execution process.
[0054] When symbolic execution is required, a triton instance is constructed and the execution state of the Qiling instance is synchronized. Then, symbolic execution is performed and path constraints are collected. When an uncovered branch is reached, constraint optimization and solving are performed to generate new test cases and store them in the test case pool. Verification is carried out in conjunction with the fuzzing module, and the set of uncovered branches is updated.
[0055] For the symbolic execution process, the optimization solution process in this embodiment of the invention includes: after selecting a driving input, state differentiation operation is only allowed after the specified uncovered branch, based on the information fusion result; in order to handle loops that generate multiple constraints, attribute graphs are dynamically constructed and loop structures are automatically detected at runtime. When the number of covered loops exceeds a threshold, the constraint collection of the loop structure is skipped directly, thereby reducing the constraint complexity of the final solution; if the current branch is unsolvable due to constraint conflicts, all constraints are traversed and the previous conflicting constraint is eliminated one by one. Then, the path constraints are resolved, test cases are generated and stored in the test case pool, and then handed over to the fuzz testing module for verification.
[0056] The information fusion scheduling module is used to fuse the static and runtime information of the target Android program into an attribute graph EIMap = {(e,hit,input,input_offset)|input∈T,e∈G}, where e represents an edge in the attribute graph, hit represents the number of hits, input represents the input test case, and input_offset represents the offset of the key node associated with the edge. Based on the coverage of the attribute graph, the module guides the mutation process of the fuzzing module to cover specified control flow nodes. When the fuzzing module cannot cover the target nodes, test case selection optimization is performed, and the system switches to the symbolic execution module for targeted coverage. The guidance and optimization processes have been described in the fuzzing and symbolic execution modules above.
[0057] Specifically, in optimizing the mutation strategy of fuzzing modules, traditional testing methods, while also using coverage as a guide, employ an unplanned qualitative approach. They consider input test cases valid as long as they trigger new control flow nodes and add them to the test case pool for the next round of mutation, resulting in low path discovery efficiency. This invention, based on existing program semantics, dynamically determines the next control flow node to be covered based on the currently covered control flow nodes, and guides the fuzzing process with this node as the target. Then, based on the input test cases and control flow mapping information, keyword section information, and candidate dictionary information obtained through incremental information fusion, the mutation strategy is dynamically adjusted to generate input test cases that can cover the specified control flow node more quickly.
[0058] Regarding the optimization of the switching timing of the symbolic execution module, the general switching strategy is: when the fuzzing module still does not find a new state after a specified number of mutations, the symbolic execution engine is started to analyze all test cases generated by the fuzzing that can find a new state in turn, which is inefficient.
[0059] The present invention proposes to precisely schedule module switching based on incremental information fusion results. By monitoring the coverage of each edge in the attribute graph in real time, if a certain edge E is detected... ik The number of times the predecessor node was hit (H) i The edge E is greater than the preset threshold T. ik Number of hits H ik If the result is still 0, it is considered that the fuzzing engine has difficulty breaking through the branch constraints, and thus the system switches to the symbolic execution module to target edge E. ik Perform targeted coverage to quickly generate coverage for edge E. ik For test cases, the specific value of parameter T can be dynamically adjusted according to the type of test target.
[0060] Regarding the optimization of use case selection, let's assume the edge that needs to be broken is E. ikWhen N test cases can hit E ij But it couldn't hit the edge E. ik At that time, the symbolic execution module is used for solving, and these N test cases are in basic block B. i Not all cases are solvable. To achieve optimal results and avoid invalid symbolic execution, we need to select the case with the highest probability of resolving the basic block B from N test cases. i It identifies solvable test cases and drives the analysis process of the symbolic execution module, thereby improving the efficiency of test case generation.
[0061] Specifically, it includes the following steps:
[0062] Step S41: Obtain variable E ik The starting node B i Dominating node B i-dom ;
[0063] Step S42: Obtain the dominant node B i-dom The set of reachable paths I Bi-dom ;
[0064] Step S43: Give the reachable path set I Bi-dom Tag it as solvable;
[0065] Step S44: Use machine learning algorithms to predict solvability and obtain the test cases with the highest solvability probability.
[0066] The machine learning algorithms in this embodiment of the invention include: decision tree, random forest, support vector machine and neural network.
[0067] The aforementioned system fills a gap in research on vulnerability discovery in native Android applications. By achieving intelligent scheduling of fuzzing and symbolic execution methods through incremental information fusion, it better leverages the advantages of both methods. By guiding the fuzzing and symbolic execution processes with uncovered branch sets, the generated test cases are more targeted and efficient. Furthermore, it possesses excellent applicability, enabling dynamic testing of native Android applications without relying on real devices or emulators.
[0068] The technical features of the above embodiments can be combined arbitrarily. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described; only preferred embodiments of the present invention are illustrated. The descriptions are relatively specific and detailed, but they should not be construed as limiting the scope of the present invention. As long as the combination of these technical features does not contradict each other, it should be considered within the scope of this specification.
[0069] It should be noted that those skilled in the art can make various modifications and improvements without departing from the inventive concept, and these all fall within the scope of protection of this invention. Therefore, the scope of protection of this patent should be determined by the appended claims.
Claims
1. A composite Android platform native program dynamic vulnerability mining system, characterized in that: It includes a fuzz testing module, an information fusion scheduling module, and a symbolic execution module; The fuzz testing module is used to set up the test case pool; Test cases are selected from the test case pool for testing. When test cases are mutated, coverage information is tracked. When coverage is improved, the mutated test cases are added to the test case pool. The symbolic execution module is used to perform symbolic execution and collect path constraints. When an uncovered branch is encountered, constraint optimization is performed to generate new test cases and add them to the test case pool. The information fusion scheduling module is used to fuse the static and runtime information of the target Android program into an attribute graph EIMap={(e,hit,input,input_offset)|input∈T,e∈G}, where e represents an edge of the attribute graph, hit represents the number of hits, input represents the input test case, and input_offset represents the offset of the key node associated with the edge. Based on the coverage of the attribute graph, the module guides the mutation process of the fuzzing module to cover the specified control flow nodes. When the fuzzing module cannot cover the nodes, the module optimizes the test case selection and switches to the symbolic execution module for targeted coverage. The method by which the information fusion scheduling module guides the mutation process of the fuzz testing module includes the following steps: Step S11: Dynamically instrument the target Android program and collect operand information of comparison instructions during program execution; Step S12: Mark the key bytes of the current test case and mark the comparison instructions affected by the bytes; Step S13: Construct a key byte set S = {(ti, ci, offsets) | ti ∈ T, ci ∈ C}, where T represents the set of use cases, C represents the set of comparison instructions, and offsets represents the byte position offset; Step S14: Calculate the fitness value of an individual during the mutation process by comparing the changes in the operand information of the target comparison instruction, so as to guide the mutation process; The method for optimizing test case selection in the information fusion scheduling module includes the following steps: Step S41: obtaining edge E ik of starting node B i of dominated node B i-dom ; Step S42: Acquire the set of reachable paths I i-dom dominated by the Node B Bi-dom ; Step S43: giving said reachable path set I Bi-dom Tag with solvability Step S44: Use machine learning algorithms to predict solvability and obtain the test cases with the highest solvability probability.
2. The system according to claim 1, wherein the system further comprises: The fitness value calculation method in step S14 is as follows: obtain the operand information of the comparison instruction, obtain the values of the two operands of the comparison instruction, and accumulate the absolute values of the difference between the two operands. The accumulated result is used as the fitness value of the individual.
3. The system of claim 1, wherein the system is further configured to: The symbolic execution module performs symbolic execution using the following steps: Step S21: Load the target Android application and add callback instrumentation to all code before execution to simulate execution; Step S22: Determine whether symbolic execution is needed through the callback instrumentation. If it is needed, proceed to step S23; otherwise, do not perform any processing. Step S23: For programs that require symbolic execution, construct a dynamic binary analysis framework instance of Triton to perform symbolic execution.
4. The system of claim 3, wherein the system further comprises: Programs that require symbolic execution include sets of branches that contain only symbolic data and are not covered.
5. The system of claim 1, wherein: The method for constraint optimization solving by the symbolic execution module includes: The global control flow graph is dynamically constructed, and loop structures are automatically detected at runtime. When the number of covered loops exceeds a set threshold, the constraint collection of the loop structure is skipped directly. If the current branch is unsolvable due to constraint conflicts, try traversing all constraints and successively eliminating the previous conflicting constraint, and then resolve the path constraints.
6. The system of claim 1, wherein the system is further configured to: The method by which the information fusion scheduling module determines that the fuzz testing module cannot cover the following includes: Step S31: Acquire the edge E ik of the predecessor node of the hit number H i ; Step S32: When the number of hits H i The value is greater than the preset threshold T, while edge E ik Number of hits H ik If the value is 0, it is determined that the fuzz test module cannot cover the area.
7. The system of claim 1, wherein the system is further configured to: determine a set of native libraries associated with the Android platform; and determine a set of native functions associated with the set of native libraries. The machine learning algorithms include: decision trees, random forests, support vector machines, and neural networks.