Test case selection method based on markov decision process state grid clustering

By discretizing the state space of a Markov decision process into grid cells and evaluating novelty by counting, the problem of redundant test cases is solved, more efficient test case selection is achieved, and the effectiveness and efficiency of agent testing are improved.

CN122240477APending Publication Date: 2026-06-19INST OF SOFTWARE - CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INST OF SOFTWARE - CHINESE ACAD OF SCI
Filing Date
2026-03-11
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies for testing Markov decision process agents suffer from numerous redundant test cases, leading to low testing effectiveness and efficiency. Traditional methods have failed to effectively reduce redundant evaluations of similar states.

Method used

By discretizing the continuous state space into grid cells, a grid clustering method is used to count and evaluate the novelty of states and state-action pairs, thereby filtering out test cases that can trigger abnormal model behavior and avoiding repeated execution.

Benefits of technology

It increases the diversity of test cases, reduces redundant tests, improves test effectiveness and system efficiency, and reduces computational complexity.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240477A_ABST
    Figure CN122240477A_ABST
Patent Text Reader

Abstract

This invention discloses a test case selection method based on Markov decision process state grid clustering. The steps include: discretizing each dimension of the continuous Markov decision state space into a series of grid cells and generating indices; mapping the continuous states of the agent to the corresponding grid cells according to the indices to obtain the discretized states s; selecting test cases for this round of testing to test the agent, generating a decision sequence and forming the actual state x; recording the number of times the agent visits state s or state-action pairs during execution, obtaining the count values ​​of state s and state-action pairs; evaluating the novelty reward of state s and state-action pairs based on the count values; determining whether the test case meets the conditions based on the novelty reward; if it meets the conditions, adding the actual state x and the decision sequence as a test case to the test case set. This invention can significantly improve testing effectiveness and system efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of testing complex decision-making intelligent agents, and relates to a test case selection method based on Markov decision process state grid clustering. Background Technology

[0002] Markov Decision Processes (MDPs) are a core mathematical framework in reinforcement learning used to model sequential decision-making problems. At its core is a quintuple (S, A, P, R, γ), which describes the process by which an agent maximizes cumulative reward by choosing actions in an environment with uncertain state transitions. Specifically: (1) State set (S): The set of all possible states of the system.

[0003] (2) Action set (A): Actions that the agent can perform in each state.

[0004] (3) State transition probability (P): P(s'|s, a) represents the probability of transitioning to state s' after performing action a in state s, satisfying ∑P(s'|s, a)=1.

[0005] (4) Reward function (R): R(s, a, s') is the scalar reward of environmental feedback.

[0006] (5) Discount factor (γ): γ∈[0,1], weighing immediate and future rewards (γ=0 only focuses on the present; γ≈1 focuses on long-term benefits).

[0007] Random network distillation (RND) is a technique used to enhance the exploratory capabilities of reinforcement learning agents, particularly suitable for environments with sparse or absent external rewards. It improves learning efficiency by introducing an intrinsic reward mechanism, encouraging the agent to actively explore novel states. The core idea of ​​RND (Recursive Natural Distributed Networking) is to use the prediction error of a neural network as an intrinsic reward. Its specific mechanisms include: (1) Dual network structure. RND contains two neural networks: a target network (fixed random initialization), whose parameters remain unchanged during training and generate a fixed target feature representation for the input state; and a prediction network (trainable), which learns to predict target features by minimizing the difference (such as mean square error) between the target network output and the target network output.

[0008] (2) Calculation of intrinsic reward. When the agent encounters a new state, the prediction error of the prediction network is large, and the system converts this error into intrinsic reward; when the state is visited repeatedly, the prediction error decreases and the reward decreases.

[0009] When testing dynamic decision-making agents in complex environments, the most commonly used testing method is fuzz testing. Fuzz testing generates input data by mutating test cases, aiming to discover potential decision-making failures of the model when faced with abnormal or unforeseen inputs. It is suitable for handling dynamic and complex decision-making processes.

[0010] Fuzzy testing techniques for Markov decision process models assess the robustness of agents under complex and uncertain conditions by varying the environmental state, thereby uncovering potential decision-making flaws and failure modes.

[0011] In the MDP model agent testing framework, the framework samples test cases and generated decision sequences as the initial test case set within a pre-set time period. After sampling, the testing framework tests the agent according to the following process.

[0012] (1) Select test cases according to energy allocation. In the process of agent testing, test case mutation and test number are specified by energy allocation; the higher the energy, the more likely it is to be selected for testing in this round of testing.

[0013] (2) Mutation of test cases. After selecting the test cases for this round of testing, the testing framework will perturb the test cases so that they can trigger more potential failures of the agent.

[0014] (3) Execute test cases in the MDP model framework. In the agent execution environment, start the execution of test cases, assign the relevant parameters of the agent to the environment, and let the agent make decisions in the environment with a finite step size to generate a decision sequence.

[0015] (4) Obtain the MDP environment data and feedback data after executing the test cases in this round of testing, including the total reward and decision sequence obtained after one round of agent testing.

[0016] (5) Evaluate test cases based on the decision state sequence and reward data, and update the test case set. Based on the feedback data obtained, the testing framework will determine whether the test case meets the relevant conditions and update the test case set accordingly.

[0017] (6) Repeat the above process until the preset test time for this round of testing is exhausted.

[0018] For intelligent decision-making agents based on Markov Decision Processes (MDPs), the main fuzz testing frameworks currently available are MDPFuzz and CureFuzz. The MDPFuzz framework treats the initial environment state as a test case, testing the agent after mutating the initial environment, and uses a Gaussian mixture model and dynamic expectation-maximization algorithm to evaluate the novelty of the decision sequence. The CureFuzz framework evaluates the novelty of the sequence by using random network distillation, and in this way adds an intrinsic reward to the agent to encourage it to explore more untested sequence states, thus making the testing process more thorough.

[0019] In existing technologies, the MDPFuzz framework uses Gaussian mixture models and dynamic expectation-maximization algorithms to evaluate the novelty of decision sequences, while the CureFuzz framework uses random network distillation to evaluate sequence novelty. These two typical implementations can increase the novelty of test cases; however, neither considers reducing a large number of redundant test cases, which can severely impact test effectiveness and system efficiency.

[0020] The MDPFuzz framework calculates the novelty of a test case by considering all states throughout the entire decision sequence. However, calculating sequence diversity can easily introduce redundancy. Sequence decision path diversity does not equate to outcome diversity; test cases may generate different sequences but ultimately trigger the same failure state, leading to decreased testing efficiency and wasted resources. MDPFuzz calculates test case novelty based on all states of the decision sequence, a method prone to redundancy.

[0021] While RND-based state novelty evaluation methods have shown potential in exploratory reinforcement learning within the CureFuzz framework, their underlying mechanisms still have limitations. The core principle of RND is to quantify the prediction error as a reward by comparing the feature outputs of the target network and the prediction network for the same state. RND treats each state as an independent sample, relying on instantaneous prediction error to quantify novelty. This leads to neighboring or semantically similar states potentially having significantly different reward values ​​due to the network's oversensitivity to small input perturbations. RND does not explicitly cluster similar states together; it merely promotes exploration through novelty rewards. This is not equivalent to the explicit clustering operation performed by traditional clustering algorithms (such as K-Means clustering). Traditional clustering algorithms are specifically designed to divide data points into different clusters, resulting in high similarity among data points within the same cluster and high dissimilarity between data points in different clusters. RND essentially processes each state individually, without grouping similar states together like clustering algorithms. Therefore, RND is insufficient in handling redundant states and may not effectively streamline similar test sequences when processing states. Summary of the Invention

[0022] To address the problems existing in the prior art, the purpose of this invention is to provide a test case selection method based on Markov decision process state grid clustering, which can greatly enhance test case diversity, improve test effectiveness, and system efficiency. The core idea of ​​this invention's test case diversity selection method is to evaluate the uniqueness of the decision sequence to filter out test cases that can trigger abnormal model behavior, while avoiding the repeated execution of redundant test cases to improve efficiency.

[0023] The technical solution of this invention is as follows: A test case selection method based on Markov decision process state grid clustering, the steps of which include: Discretize each dimension of the continuous Markov decision state space into a series of grid cells and generate an index; The continuous states of the agent are mapped to the corresponding grid cells according to the index to obtain the discretized state s; Select the test cases used in this round of testing, test the agent, generate a decision sequence and form the actual state x; record the number of times the agent visits the state s or state-action pair during the execution process, and obtain the count value N(s) of the state s and the count value N(s, a) of the state-action pair; a is the action performed by the agent in the state s. The novelty reward of state s is evaluated based on the count value N(s), and the novelty reward of state-action pair is evaluated based on the count value N(s, a). The novelty reward determines whether the test case meets the conditions. If it does, the agent's actual state x and decision sequence are added to the test case set as a selected test case.

[0024] Preferably, the novelty reward is inversely proportional to the count value.

[0025] Preferably, the novelty reward calculation function is a decreasing function or a negative power function.

[0026] Preferably, the test cases used in this round of testing are selected from the test case set, the test cases are mutated and assigned to the environment, and the agent makes decisions in the environment with a finite step size to generate a decision sequence and form the actual state x.

[0027] Preferably, the indexes of each dimension are combined to form a unique discrete state identifier, so that each grid cell has a unique index.

[0028] Preferably, the novelty value of the state s or state-action pair accessed by the agent during execution is obtained through grid clustering or statistical methods.

[0029] Preferably, the novelty reward is normalized, and if the novelty reward of state s or the novelty reward of the state-action pair is greater than a set threshold, then the current test case is determined to meet the conditions.

[0030] A test case selection system based on Markov decision process state grid clustering is characterized by comprising a grid partitioning module, a mapping module, a testing and counting module, a novelty reward module, and a filtering module. The grid partitioning module is used to discretize each dimension of the continuous Markov decision state space into a series of grid cells and generate an index. The mapping module is used to map the continuous state of the agent to the corresponding grid cell according to the index, so as to obtain the discretized state s. The testing and counting module is used to select test cases for this round of testing, test the agent, generate a decision sequence and form the actual state x; record the number of times the agent visits state s or state-action pair during execution, and obtain the count value N(s) of state s and the count value N(s, a) of state-action pair; a is the action performed by the agent in state s; The novelty reward module is used to evaluate the novelty reward of state s based on the count value N(s), and to evaluate the novelty reward of the state-action pair based on the count value N(s, a). The filtering module is used to determine whether the test case meets the conditions based on the novelty reward. If it meets the conditions, the agent's actual state x and decision sequence are added to the test case set as a selected test case.

[0031] A server is characterized by comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program including instructions for performing the methods described above.

[0032] A computer-readable storage medium having a computer program stored thereon, characterized in that the computer program implements the above-described method when executed by a processor.

[0033] This invention, by gridding and counting the states of complex decision-making agents, evaluates the novelty of test cases while clustering the final states of test cases, reduces the probability of similar test cases being selected, reduces redundant test cases, increases the diversity of test cases, and ultimately improves the effectiveness and overall efficiency of testing complex decision-making agents.

[0034] In intelligent decision-making frameworks, agents interact with the environment through a series of states, actions, and reward feedback. Markov Decision Processes (MDPs) assume that the current state can fully describe future state transitions without relying on historical information; their core objective is to find the optimal policy that allows the agent to maximize its expected cumulative reward during the decision-making process. MDPs are suitable for systems with large state spaces and dynamic changes, such as complex system simulation, autonomous driving, and Go.

[0035] Because dynamic decision-making agents in these complex environments face real-world complexity and unpredictability, the test input space for these agents is enormous. To ensure that agents can cope with diverse extreme scenarios, it is necessary to select test cases that can effectively reveal potential vulnerabilities in order to achieve the most comprehensive agent testing possible.

[0036] This invention proposes an agent test case selection method based on Markov decision state grid clustering. By discretizing the final state of the test case decision sequence and counting the states, the method enables testing to simultaneously measure the diversity of test cases and reduce the probability of selecting redundant test cases. Compared to traditional methods that only measure test case novelty based on decision paths, this method can more effectively ensure the diversity of final failure states, reduce repeated testing, and improve testing efficiency.

[0037] Key point 1: By spatial discretization, similar states are clustered and merged into the same grid, which effectively solves the problem of fluctuation in prediction error of similar states in the RND method.

[0038] To address the limitations of RND methods in state novelty evaluation, this method constructs a dimensionally adjustable discretized representation framework by uniformly dividing the continuous state space into equidistant grid cells. Each grid cell maintains an independent state access counter. When the agent reaches the termination state, it first maps its state vector to the corresponding grid coordinates, and then calculates the novelty reward based on the historical access frequency of that grid cell.

[0039] Key point 2: Introducing grid granularity adjustment parameters allows for a dynamic balance between exploration efficiency and accuracy based on state space characteristics.

[0040] Using the novelty of a sequence's state can lead to exploration being overly scattered across path details. Therefore, this method uses the novelty of the sequence's final state to guide agent failure. The novelty of the sequence's final state focuses on outcome diversity, concentrating on the agent's final state after executing a certain number of steps, rather than every intermediate state along the path. For guiding failure, the goal is to find more diverse states that lead to agent failure. The novelty of the sequence's final state helps discover different failure points (i.e., final states), thus more effectively increasing the types of failures. It is computationally inexpensive and efficient. Since only the novelty of the sequence's final state needs to be evaluated, computational and storage requirements are low, making it suitable for application in high-dimensional state spaces.

[0041] Key point 3: Using a hash table to store grid access records reduces the novelty calculation complexity from O(n) of RND to O(1).

[0042] Based on the key points of this application, the advantages and benefits of this method are as follows: Key point 1: By spatial discretization, similar states are clustered and merged into the same grid, which effectively solves the problem of fluctuation in prediction error of similar states in the RND method.

[0043] By pre-setting the grid granularity (such as uniform division or adaptive subdivision), neighboring states in the physical or feature space are forced to be classified into the same grid cell, essentially realizing unsupervised clustering. Although this hard-boundary clustering loses microscopic differences, it can characterize the structural features of the state distribution at the macroscopic level, avoiding the neighbor-state reward oscillation problem caused by the sensitivity of neural networks in RNDs.

[0044] Key point 2: Introducing grid granularity adjustment parameters allows for a dynamic balance between exploration efficiency and accuracy based on state space characteristics.

[0045] As the mesh becomes finer, the resolution of state counting increases, leading to a more accurate estimation of the agent's state coverage. This invention introduces a mesh granularity adjustment parameter to better achieve both exploration efficiency and accuracy.

[0046] Key Point 3: Using a hash table to store grid access records reduces the novelty calculation complexity from that of RND. O (n) Reduced to O (1).

[0047] The grid access count only requires storage in a hash table or array, and the query complexity is O(n log n). O (1) The computational cost is far lower than that of forward inference required by RND ( O(D) , D (Network parameter quantity). Attached Figure Description

[0048] Figure 1 This is a flowchart of the method of the present invention.

[0049] Figure 2 This is a diagram of the testing framework for an agent based on a Markov decision process model.

[0050] Figure 3 This is a system diagram of the present invention. Detailed Implementation

[0051] The present invention will now be described in further detail with reference to the accompanying drawings. The examples given are only for explaining the present invention and are not intended to limit the scope of the present invention.

[0052] like Figure 1 As shown, an optional embodiment of the present invention provides a test case selection method based on Markov decision process state grid clustering, the steps of which include: Discretize each dimension of the continuous Markov decision state space into a series of grid cells and generate an index; The continuous states of the agent are mapped to the corresponding grid cells according to the index to obtain the discretized state s; Select the test cases used in this round of testing, test the agent, generate a decision sequence and form the actual state x; record the number of times the agent visits the state s or state-action pair during the execution process, and obtain the count value N(s) of the state s and the count value N(s, a) of the state-action pair; a is the action performed by the agent in the state s. The novelty reward of state s is evaluated based on the count value N(s), and the novelty reward of state-action pair is evaluated based on the count value N(s, a). The novelty reward determines whether the test case meets the conditions. If it does, the agent's actual state x and decision sequence are added to the test case set as a selected test case.

[0053] This invention provides an agent testing framework based on Markov decision process models, as follows: Figure 2 As shown, in one optional embodiment, the present invention constructs a dimensionally adjustable discretized representation framework by uniformly dividing the continuous state space into equidistant grid cells. Each grid cell maintains an independent state access counter. When the agent reaches the termination state, it first maps its state vector to the corresponding grid coordinates, then calculates the novelty reward based on the historical access frequency of that grid, and finally selects more diverse test cases.

[0054] 1. Discretize the state space using a grid. Grid partitioning divides the continuous Markov decision state space into a series of small grid cells using a discretization method. This discretization method is suitable for discretized state representation in reinforcement learning problems, simplifying the complex continuous state space into a finite, enumerable set of states, which helps the algorithm explore the state space and make decisions.

[0055] Suppose the continuous state space S is a d-dimensional space, which can be represented as S = d , where d represents the dimension of the state space. The goal of mesh generation is to divide this continuous state space into several non-overlapping sub-regions (mesh cells), such that each sub-region can be regarded as a discrete state.

[0056] To mesh an S, it is necessary to divide it along each dimension. For example, a one-dimensional space S = [a, b] can be divided into N equally spaced intervals. Where Δx is the length of each grid cell. This yields N intervals, and the i-th interval is e i The range is For a multidimensional space, each dimension is independently partitioned, and then a multidimensional mesh is generated. For example, in a two-dimensional state space S = [a1, b1] × [a2, b2], if each dimension is divided into N1 and N2 intervals respectively, the total number of mesh cells generated is N1 × N2. More generally, for a d-dimensional state space, if the i-th dimension is divided into N... i If there are 10 intervals, then the total number of grids is N. total for This transforms the discretization problem of high-dimensional space into a multi-dimensional discrete grid.

[0057] 2. Determine the range index for each dimension. For each dimension i, the i-th dimension space is represented as [a i ,b i ], Divide it into N according to the above method i For each grid cell, then It is the length of the interval in the i-th dimension.

[0058] Suppose the agent is in a certain state x=( x 1, x 2, …, x d ),in xi ∈[ a i , b i [Is the] number i The state values ​​are located in each dimension. Then, by determining the interval index of the i-th dimension, it can be mapped to the corresponding discrete grid cell. The interval index ki of the i-th dimension is calculated as follows: in, It is the length of the interval in the i-th dimension. This represents the floor operation, which maps continuous values ​​to the index of discrete intervals.

[0059] 3. Combine range indexes across all dimensions The indices of all d dimensions are combined to form a unique discrete state identifier. A common approach is to use linear indices to combine these dimensional indices. For example, for the two-dimensional indices (k1, k2), a unique discrete state can be mapped using the following formula: Here, N1 and N2 are the number of grid cells in the first and second dimensions. This method can be generalized to higher dimensions. This linear combination method ensures that each discrete state has a unique index, making it easy to store and search in the discretized state space.

[0060] 4. Continuous state mapping of an agent After mesh generation, the agent's continuous states are mapped to corresponding discrete mesh cells. The agent's actual states are denoted as x = (x1, x2, … , x d Let ), where xi ∈ [ai , bi] is the actual state value in the i-th dimension. The actual state x can be mapped to the corresponding discrete grid cell by determining the interval index in each dimension and combining the indices of all dimensions. The discretized state is denoted as s, and is represented by the unique discrete state identifier mentioned above, corresponding to a unique index value.

[0061] 5. Agent state counting State counting is a method for estimating the frequency of occurrence of states or state-action pairs, which helps assess the familiarity and novelty of states during the decision-making process. This method provides useful information for exploring policy design and optimization by recording the number of times an agent visits a state during execution. The basic idea of ​​state counting is to record the number of times each discrete state or "state-action pair" is visited during execution.

[0062] Suppose the discretized state is s. Then each state s ∈ S corresponds to a count value N(s), which represents the number of times state s is visited. The count of state-action pairs can be similarly represented as N(s, a), where a is the action performed by the agent in state s.

[0063] State count update. Whenever the agent visits a state s, the count value of that state is incremented by 1.

[0064] N(s) = N(s) + 1 The state-action pair count is updated. When an agent performs action a in state s, the state-action pair count N(s, a) is increased by 1.

[0065] N(s, a) = N(s, a) + 1 These two counting methods can quantify how frequently an agent accesses a particular state or state-action pair during past executions. Statistical frequency information is crucial for designing exploration strategies because it helps the agent determine which states or state-action pairs are common and which are rare, thereby better optimizing exploration behavior.

[0066] During the testing of an agent, the method for calculating the discretized network partitioning of its state vector is as follows:

[0067] Algorithm State Vector Discretization Grid Partitioning Count enter: State vector state = [s1, s2, ..., s d ] The number of partitions for each dimension, n bins = [n1, n2, ..., n d ] The minimum value of each dimension (min) state = [min1, min2, ..., min d ] The maximum value of each dimension (max) state = [max1, max2, ..., max d ] Grid counting hash table grid Output: Updated grid counting hash table grid 1: function S TATE C OUNT (state, n bins min state , max state , grid) 2: index ← 0 3: for i ←1 to d do 4: Δ i ← (max i - min i ) / n i Calculate the unit length of the i-th dimension 5: k i ← Calculate the index of the i-th dimension 6: index ← (index + k i ) n i Update the main index 7: end for 8: grid[index] ← grid[index] + 1 State counting is performed using a hash table. 9: Return grid 10: end function.

[0068] 6. Calculate the novelty bonus In exploration-based algorithms in reinforcement learning, such as policies with novelty rewards, state counting can be used to evaluate the novelty of states or state-action pairs. Typically, the exploration reward is inversely proportional to the state count; that is, the more frequently a state is visited, the lower its novelty reward. This approach encourages the agent to explore less frequently visited states, thus better covering the state space. Therefore, this method defines the calculation function for the novelty reward as a decreasing function, as follows: Or more generally, a negative power function: Here, α is a parameter that controls the decay rate of the novelty reward. As the number of times a state is visited, N(s), increases, the novelty reward r decreases.novelty This reduces the likelihood of errors and guides the agent to shift its attention to other states. This exploratory incentive mechanism can improve the efficiency of algorithms in policy optimization, enabling agents to find novel states more quickly.

[0069] 7. Select efficient test cases The test environment framework used in this method continuously generates and selects test cases until the test termination condition is met. Before executing the agent test, the test environment framework samples the test environment state and the agent's execution sequence within a pre-set time period to form a test case set. During subsequent agent test execution, the initial test cases used in this round of testing are first selected, mutated, and assigned to the environment. The agent then makes decisions in the new environment with a finite step size, generating a decision sequence to form the final actual state x = (x1, x2, … , x d Let xi ∈ [ai , bi] be the state value in the i-th dimension. Then, the state s in the discrete space is obtained by using the above statistical method based on grid clustering. i The novelty value is a normalized value between (0,1). A higher novelty value indicates a more effective incentive mechanism for the agent's exploration. When any reward value reaches a certain threshold, the test case diversity is considered to be met. Therefore, the agent's actual state x = (x1, x2, … , x) is set. d The decision sequence is selected as a test case and added to the test case set; otherwise, the test case is not selected. By continuously inputting test cases from the test case set into the test framework, the dynamic testing process of the MDP agent is continuously carried out.

[0070] like Figure 3 As shown, an optional embodiment of the present invention provides a test case selection system based on Markov decision process state grid clustering, characterized in that it includes a grid partitioning module, a mapping module, a testing and counting module, a novelty reward module, and a screening module; The grid partitioning module is used to discretize each dimension of the continuous Markov decision state space into a series of grid cells and generate an index. The mapping module is used to map the continuous state of the agent to the corresponding grid cell according to the index, so as to obtain the discretized state s. The testing and counting module is used to select test cases for this round of testing, test the agent, generate a decision sequence and form the actual state x; record the number of times the agent visits state s or state-action pair during execution, and obtain the count value N(s) of state s and the count value N(s, a) of state-action pair; a is the action performed by the agent in state s; The novelty reward module is used to evaluate the novelty reward of state s based on the count value N(s), and to evaluate the novelty reward of the state-action pair based on the count value N(s, a). The filtering module is used to determine whether the test case meets the conditions based on the novelty reward. If it meets the conditions, the agent's actual state x and decision sequence are added to the test case set as a selected test case.

[0071] An optional embodiment of the present invention provides a server, characterized in that it includes a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program including instructions for performing the method.

[0072] An optional embodiment of the present invention provides a computer-readable storage medium having a computer program stored thereon, characterized in that the computer program implements the method when executed by a processor.

[0073] Although specific embodiments of the invention have been disclosed for illustrative purposes to aid in understanding and implementing the invention, those skilled in the art will understand that various substitutions, variations, and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the invention should not be limited to the content disclosed in the preferred embodiments, and the scope of protection claimed by the invention is defined by the claims.

Claims

1. A test case selection method based on Markov decision process state grid clustering, comprising the following steps: Discretize each dimension of the continuous Markov decision state space into a series of grid cells and generate an index; The continuous state of the agent is mapped to the corresponding grid cell according to the index to obtain the discretized state. Select the test cases to be used in this round of testing, test the agent, generate decision sequences and form actual states; Record the number of times the agent visits a state or state-action pair during execution, obtain the count value of the state and the count value of the state-action pair, and evaluate the novelty reward of the state and the novelty reward of the state-action pair. The novelty reward determines whether the test case meets the conditions. If it does, the agent's actual state and decision sequence are added to the test case set as a selected test case.

2. The method according to claim 1, characterized in that, The novelty reward is inversely proportional to the count value.

3. The method according to claim 2, characterized in that, The novelty reward is calculated using a decreasing function or a negative power function.

4. The method according to claim 1, 2, or 3, characterized in that, Select the test cases to be used in this round of testing from the test case set, mutate the test cases and assign them to the environment. The agent makes decisions in the environment with a finite step size to generate a decision sequence and form the actual state.

5. The method according to claim 1, 2, or 3, characterized in that, The indexes of each dimension are combined to form a unique discrete state identifier, so that each grid cell has a unique index.

6. The method according to claim 1, 2, or 3, characterized in that, The novelty values ​​of states or state-action pairs visited by the agent during execution can be obtained through grid clustering or statistical methods.

7. The method according to claim 1, characterized in that, The novelty reward is normalized. If the novelty reward of a state or the novelty reward of a state-action pair is greater than a set threshold, then the current test case is determined to meet the conditions.

8. A test case selection system based on Markov decision process state grid clustering, characterized in that, It includes a grid partitioning module, a mapping module, a testing and counting module, a novelty reward module, and a screening module; The grid partitioning module is used to discretize each dimension of the continuous Markov decision state space into a series of grid cells and generate an index. The mapping module is used to map the continuous state of the agent to the corresponding grid cell according to the index, so as to obtain the discretized state. The testing and counting module is used to select the test cases to be used in this round of testing, test the agent, generate decision sequences and form actual states; Record the number of times the agent accesses a state or state-action pair during execution, and obtain the count value of that state and the count value of that state-action pair; The novelty reward module is used to evaluate the novelty reward of the state and the novelty reward of the state-action pair based on the above count values. The filtering module is used to determine whether the test case meets the conditions based on the novelty reward. If it meets the conditions, the agent's actual state and decision sequence are added to the test case set as a selected test case.

9. A server, characterized in that, It includes a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program including instructions for performing the method of any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the method of any one of claims 1 to 7.