Reinforcement learning simulation step control method based on closed-loop adaptive noise injection and entropy increasing optimization strategy

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By employing a reinforcement learning simulation step size control method based on closed-loop adaptive noise injection and entropy increase optimization strategies, the problems of low circuit simulation efficiency and poor stability in existing technologies are solved, achieving a more efficient simulation process.

CN120255376BActive Publication Date: 2026-06-19SOUTHEAST UNIV

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SOUTHEAST UNIV
Filing Date: 2025-03-28
Publication Date: 2026-06-19

Application Information

Patent Timeline

28 Mar 2025

Application

19 Jun 2026

Publication

CN120255376B

IPC: G05B17/02

AI Tagging

Application Domain

Simulator control

Technology Topics

Transient analysisAlgorithm

Technical Efficacy Phrases

Improve exploration abilityExplore efficiency

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Trajectory coverage sensor placement method based on hierarchical deep reinforcement learning
CN122263193AImprove stabilityReduce decision-making dimensionsGeometric CAD Biological models Algorithm Engineering
An entropy-regularized driving method for cooperative non-cooperative target capture of aircraft cluster
CN121050458BImprove exploration abilityImprove robustness Local optimumTarget capture
Agent-based industrial malt manufacturing system and automatically optimizing operation of industrial malt manufacturing equipment and method thereof
CN122180758AReduce energy consumption Improve efficiency Bioreactor/fermenter combinations Biological substance pretreatments Process engineering Moisture
Inference method, system, device and medium based on heterogeneous model reinforcement learning
CN122334521Aavoid dependence Improve generalization ability Theoretical computer science Reinforcement learning algorithm

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing reinforcement learning pseudo-transient analysis methods are prone to overfitting, oscillating curves, and have low simulation efficiency and insufficient exploratory power when dealing with large-scale circuits and highly nonlinear circuits.

Method used

A reinforcement learning simulation step size control method based on closed-loop adaptive noise injection and entropy increase optimization strategy is adopted. The simulation stability and efficiency are improved by adaptive noise injection and entropy increase optimization strategy, including initialization, step size action type conversion, entropy regularization term calculation, adaptive exploration noise injection and gradient update method to adjust strategy network parameters.

Benefits of technology

It effectively avoids solution curve oscillations and local optima, improves simulation efficiency and robustness, and enhances the exploratory and convergent nature of the algorithm.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN120255376B_ABST

Patent Text Reader

Abstract

This invention discloses a reinforcement learning simulation step size control method based on closed-loop adaptive noise injection and entropy increase optimization strategy, belonging to integrated circuit computer-aided design technology. Specifically, the method involves: first, inputting the circuit netlist and interacting with the simulator using file read / write; next, establishing two new network output layers to convert the deterministic actions of the policy network's output into a probability distribution; then, obtaining the entropy regularization term based on the probability density function and weight coefficients, and adding it to the policy loss function; next, generating Gaussian-distributed exploration noise and adjusting the standard deviation of the noise at each step using a PID controller; finally, adjusting the policy network parameters using a gradient update method and injecting the exploration noise into the output actions to obtain the final time step. Using this invention helps enhance the simulation stability of pseudo-transient analysis, improves simulation efficiency, and provides a new method for DC analysis.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer-aided design technology for integrated circuits, specifically involving a reinforcement learning simulation step size control method based on closed-loop adaptive noise injection and entropy increase optimization strategies. Background Technology

[0002] DC analysis is crucial in the design and simulation of integrated circuit chips. It not only provides the necessary conditions for transient and AC analysis of circuits but also ensures proper circuit operation, accurately assesses static power consumption, and promptly identifies potential circuit problems. Among various DC analysis methods, pseudo-transient analysis (PTA) has become the most commercially promising DC analysis algorithm due to its stability and ease of implementation.

[0003] Reinforcement learning is an algorithm used to describe and solve the interaction between an agent and its environment by learning a strategy to maximize rewards or achieve a specific goal. In recent years, the maturity of reinforcement learning technology has provided a new approach to pseudo-transient analysis. The time step control method for pseudo-transient analysis using reinforcement learning, exemplified by patent CN202111297554.9, has significantly improved simulation efficiency. However, its dual-agent structure, where each agent has a policy network and two evaluation networks, makes the algorithm sensitive to the environment and hyperparameters. This makes it prone to overfitting and solution curve oscillations when dealing with large-scale circuits and highly nonlinear circuits. Furthermore, because the evaluation network in this method simply takes the smallest value for the Q-value of the action, its policy exploration is insufficient, resulting in low simulation efficiency.

[0004] To address these issues, this invention proposes a reinforcement learning simulation step size control method based on closed-loop adaptive noise injection and entropy increase optimization strategies. This method improves the stability and efficiency of pseudo-transient analysis simulation through adaptive noise injection and entropy increase optimization strategies. Summary of the Invention

[0005] The purpose of this invention is to provide a reinforcement learning simulation step size control method based on closed-loop adaptive noise injection and entropy increase optimization strategy, which can not only avoid solution curve oscillation and getting trapped in local optima, but also improve simulation efficiency.

[0006] To achieve the above-mentioned objectives, the present invention adopts the following technical solution: a reinforcement learning simulation step size control method based on closed-loop adaptive noise injection and entropy increase optimization strategy, comprising the following steps:

[0007] S1: Initialization, input circuit netlist information and interact with the simulator using file read / write;

[0008] S2: Step size action type conversion, establish two new network output layers, and convert the deterministic actions of the policy network output into a probability distribution;

[0009] S3: Entropy regularization term calculation: solve the probability density of the action and the policy entropy in sequence, multiply them with the weight coefficients and add them to the policy loss function;

[0010] S4: Inject adaptive exploration noise to generate a Gaussian-distributed exploration noise, and use a PID controller to control the noise standard deviation σ(t+1) of the next action;

[0011] S5: Output the final step size. Use the gradient update method to continuously adjust the policy network parameters and add exploration noise to the output action to obtain the final time step size.

[0012] In step S1 above, the circuit netlist information is first input to obtain the initial circuit flags, including the NR convergence flag f, the PTA convergence flag, the residual δ, the rate of change of the solution γ, and the number of NR iterations. At each step, the algorithm and the simulator read the status information and write the status information respectively, realizing file read and write interaction.

[0013] In step S2 above, firstly, the flag information is read to characterize the circuit state s; then, a set of mean output layers, policy_mu(x), is established. Based on state s, the mean of the probability distribution of the policy network output action is calculated, and its calculation formula is as follows:

[0014]

[0015] in Let L be the weight matrix of the Lth layer mean neural network. h is the bias vector. L-1 This represents the output features of the (L-1)th layer of the neural network. It is automatically initialized during instantiation in the PyTorch environment and automatically updated during training via backpropagation based on the principle of minimizing the loss function. After obtaining the output, the target activation function is used to restrict its range to (-1, 1), calculated as follows:

[0016] μ = tanh(policy_mu(s))

[0017] Wherein, tanh() is the inverse tangent function, used to limit the output range of the mean;

[0018] Establish a network standard deviation output layer, policy_log_std(x), and calculate the log-standard deviation of the probability distribution of the policy network output actions based on the state s. The calculation formula is as follows:

[0019]

[0020] in Let L be the weight matrix of the Lth layer mean neural network. The method for setting and updating the bias vector, weight matrix, and bias vector is the same as that for the mean output layer. The difference is that the standard deviation does not need to be range-limited using the hyperbolic tangent function.

[0021] In step S3 above, the probability density function of the action is first calculated based on the mean and standard deviation of the action. The calculation formula is shown below:

[0022]

[0023] Where σ = policy_log_std(s), representing the standard deviation of the step size; then, the policy entropy is calculated, as shown in the following formula:

[0024]

[0025] Where π() is the policy function;

[0026] The entropy of the policy at each step is multiplied by the weight coefficients, which are controlled by exponential decay based on the number of iterations. The resulting entropy regularization term is added to the policy loss function to obtain a new loss function, calculated as follows:

[0027]

[0028] The first term represents the negative of the Q-value network expectation, φ1 is the Q-network parameter, the second term represents the entropy regularization term under the current policy parameters, α is the temperature coefficient, and π θ Let θ be the policy network, and θ be the policy network parameters.

[0029] In step S4 above, firstly, the reward r for advancing is set. target1 and rollback target reward r target2 Then, in each simulation step, the current solution is judged to meet the NR convergence requirement based on the flag information. If NR converges, the difference between the actual reward of the advancing agent and the target reward is used as the error signal. If it does not converge, the retreating agent generates the error signal e(t) in the same way. Finally, the PID controller outputs the change in noise standard deviation based on the error signal. The formula for PID control is shown below:

[0030] P(t) = K p e(t)

[0031]

[0032] u(t) = P(t) + I(t) + D(t)

[0033] Where P(t), I(t), and D(t) are the proportional, integral, and derivative terms, respectively, and u(t) is the output of the PID controller, used to adjust the system's response speed, eliminate steady-state error, and suppress oscillations during dynamic processes, respectively; K p ,K i ,K d These are the coefficients of proportional-integral-derivative (PID), e(t) represents the error term input at time t, and Δt is the time interval between time t and t-1. The standard deviation of the original noise is subtracted from the output of the PID to obtain the standard deviation of the noise for the next step, σ'. The new exploratory noise is Gaussian noise with a mean of 0 and a standard deviation of σ'.

[0034] In step S5 above, firstly, the objective function is optimized using gradient descent to obtain the policy network parameters θ under the i-th update. i The formula is as follows:

[0035]

[0036] in The predicted Q value of the action output by the policy network is obtained; then, the exploration noise generated in step S4 is injected into the time step a(t) output by the policy network to obtain the time step a′(t) after adding noise.

[0037] An electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the reinforcement learning simulation step size control method based on closed-loop adaptive noise injection and entropy increase optimization strategy.

[0038] A computer-readable storage medium storing computer instructions that, when executed by a processor, implement the reinforcement learning simulation step size control method based on closed-loop adaptive noise injection and entropy increase optimization strategy.

[0039] Compared with the prior art, the advantages of the present invention are as follows:

[0040] 1) This invention uses noise injection technology to enhance the exploratory nature of the algorithm, enabling the agent to explore more state-action space more efficiently in the early stages of exploration, and to rely more on the environmental information obtained by the agent in the later stages of simulation, thereby achieving stable convergence and improving simulation efficiency.

[0041] 2) This invention employs adaptive control technology for changes in injected noise, enabling the noise to be adaptively adjusted according to circuits of different sizes and characteristics. If the solution curve enters an oscillating state, the algorithm can adjust the magnitude of the injected noise according to real-time performance until it breaks away from the oscillation and reconverges, thereby enhancing the robustness of the algorithm and accelerating the simulation efficiency.

[0042] 3) This invention uses an entropy increase optimization strategy, which enables the policy network to focus not only on the estimation of Q-values at the level of loss function and parameter update, but also on a wider range of explorations, thereby improving the efficiency of simulation. Attached Figure Description

[0043] Figure 1 This is a block diagram of the step size control system of the present invention.

[0044] Figure 2 This is a flowchart illustrating the actual application of the present invention. Detailed Implementation

[0045] To enhance understanding of the present invention, further description of the invention is provided below in conjunction with the accompanying drawings and specific embodiments.

[0046] Example: The reinforcement learning simulation step size control method based on closed-loop adaptive noise injection and entropy increase optimization strategy of the present invention, such as... Figure 1 As shown, the specific implementation steps include the following:

[0047] S1: Initialization, input circuit netlist information and interact with the simulator using file read / write;

[0048] S2: Step size action type conversion, establish two new network output layers, and convert the deterministic actions of the policy network output into a probability distribution;

[0049] S3: Entropy regularization term calculation: solve the probability density of the action and the policy entropy in sequence, multiply them with the weight coefficients and add them to the policy loss function;

[0050] S4: Inject adaptive exploration noise to generate a Gaussian-distributed exploration noise, and use a PID controller to control the noise standard deviation σ(t+1) of the next action;

[0051] S5: Output the final step size. Use the gradient update method to continuously adjust the policy network parameters and add exploration noise to the output action to obtain the final time step size.

[0052] In step S1 above, the circuit netlist information is first input to obtain the initial circuit flags, including the NR convergence flag f, the PTA convergence flag, the residual δ, the rate of change of the solution γ, and the number of NR iterations. At each step, the algorithm and the simulator read the status information and write the status information respectively, realizing file read and write interaction.

[0053] In step S2 above, firstly, the flag information is read to characterize the circuit state s; then, a set of mean output layers, policy_mu(x), is established. Based on state s, the mean of the probability distribution of the policy network output action is calculated, and its calculation formula is as follows:

[0054]

[0055] in Let L be the weight matrix of the Lth layer mean neural network. h is the bias vector. L-1 This represents the output features of the (L-1)th layer of the neural network. It is automatically initialized during instantiation in the PyTorch environment and automatically updated during training via backpropagation based on the principle of minimizing the loss function. After obtaining the output, the target activation function is used to restrict its range to (-1, 1), calculated as follows:

[0056] μ = tanh(policy_mu(s))

[0057] Here, tanh() is the inflection tangent function, used to limit the output range of the mean; then, a set of standard deviation output layers, policy_log_std(x), is established. Based on the state s, the logarithmic standard deviation of the probability distribution of the policy network's output actions is calculated, and its calculation formula is as follows:

[0058]

[0059] in Let L be the weight matrix of the Lth layer mean neural network. The method for setting and updating the bias vector, weight matrix, and bias vector is the same as that for the mean output layer. The difference is that the standard deviation does not need to be range-limited using the hyperbolic tangent function.

[0060] In step S3 above, the probability density function of the action is first calculated based on the mean and standard deviation of the action. The calculation formula is shown below:

[0061]

[0062] Where σ = policy_log_std(s), representing the standard deviation of the step size; then, the policy entropy is calculated, as shown in the following formula:

[0063]

[0064] Where π() is the policy function;

[0065] The entropy of the policy at each step is multiplied by the weight coefficients, which are controlled by exponential decay based on the number of iterations. The resulting entropy regularization term is added to the policy loss function to obtain a new loss function, calculated as follows:

[0066]

[0067] The first term represents the negative of the Q-value network expectation, φ1 is the Q-network parameter, the second term represents the entropy regularization term under the current policy parameters, α is the temperature coefficient, and π θ Let θ be the policy network, and θ be the policy network parameters.

[0068] In step S4 above, firstly, the reward r for advancing is set. target1 and rollback target reward r target2 Then, in each simulation step, the current solution is judged to meet the NR convergence requirement based on the flag information. If NR converges, the difference between the actual reward of the advancing agent and the target reward is used as the error signal. If it does not converge, the retreating agent generates the error signal e(t) in the same way. Finally, the PID controller outputs the change in noise standard deviation based on the error signal. The formula for PID control is shown below:

[0069] P(t) = K p e(t)

[0070]

[0071] u(t) = P(t) + I(t) + D(t)

[0072] Where P(t), I(t), and D(t) are the proportional, integral, and derivative terms, respectively, and u(t) is the output of the PID controller, used to adjust the system's response speed, eliminate steady-state error, and suppress oscillations during dynamic processes, respectively; K p ,K i ,K d These are the coefficients of proportional-integral-derivative (PID), e(t) represents the error term input at time t, and Δt is the time interval between time t and t-1. The standard deviation of the original noise is subtracted from the output of the PID to obtain the standard deviation of the noise for the next step, σ'. The new exploratory noise is Gaussian noise with a mean of 0 and a standard deviation of σ'.

[0073] In step S5 above, firstly, the objective function is optimized using gradient descent to obtain the policy network parameters θ under the i-th update. i The formula is as follows:

[0074]

[0075] in The predicted Q value of the action output by the policy network is obtained; then, the exploration noise generated in step S4 is injected into the time step a(t) output by the policy network to obtain the time step a′(t) after adding noise.

[0076] In practical circuit applications, the entire process of using pseudo-transient analysis to solve for the DC operating point, including the step size control process, is as follows: Figure 2As shown, the algorithm checks the step size of each output sequentially for NR convergence and PTA convergence. If a certain output step size satisfies convergence, it becomes the final step size. Before the algorithm ends, the state-action information obtained for each step size is stored in the sample pool, and step size control is performed according to the present invention based on the forward and backward situations.

[0077] Example 1:

[0078] The experimental dataset contains a circuit netlist of 132 benchmark circuits, including MOS circuits, BJT circuits, oscillator circuits, difficult circuits, and transistor circuits. The training set circuits are used to train the model, while the remaining circuits serve as the test set to evaluate the model's predictive performance on unknown circuits. This embodiment uses benchmark circuits for testing; some test circuit information is shown in Table 1.

[0079] Table 1. Basic information of the reference circuits in the test set.

[0080]

[0081] Table 1 (continued) Basic information of the reference circuits in the test set

[0082] gm19 0 1 0 17 fadd32 0 25 102 161 memplus 0 14274 7454 2865 opampal 28 4 0 71 voter 1 460 4243 1708 toronto 0 33 0 25 suntraction 0 0 0 10 ring 0 1 34 18

[0083] This experiment uses the TD3 algorithm for time step control in pseudo-transient analysis. The random seed is set to 2, the learning rate of both the Q network and the policy network is 8e-4, the interval between policy and target network delay updates is 3, the exploration noise scale is 0.001, the evaluation noise scale is 0.0005, and the reward scale is 1.0.

[0084] The present invention is compared with the WSPICE algorithm and the TD3 algorithm, as shown in Table 2. The best performance is highlighted in bold. The method of the present invention achieves the best speed performance in terms of the simulation efficiency index NR iterations. It should be noted that if the simulation does not converge, i.e., the simulation fails, it is indicated by "-" in the table.

[0085] Table 2 Comparison of NR iteration counts for different algorithms in pseudo-transient analysis simulation.

[0086]

[0087]

[0088] Example 2:

[0089] This method also employs Example 2, namely, industrial large-scale transistor circuits for testing, and the test circuit information is shown in Table 3.

[0090] Table 3. Basic Information on Industrial Large Scale Transistor Circuits

[0091] resistance number Number of capacitors MOS number Number of secondary tubes Total number of devices Number of nodes CKT1 0 0 4415 0 4416 2209 CKT2 18780 47933 1956 0 68669 6965 CKT3 11253 8896 6163 2 26274 13789 CKT4 1443601 1803000 360903 0 19834504 722570 CKT5 0 0 4002 0 4003 2002 CKT6 24000 48000 16108 0 88126 40057 CKT7 259737 733169 99090 127 1003023 31142 CKT8 317038 259200 9060 0 578161 145569 CKT9 3000003 0 0 0 3000004 20000 CKT10 459120 251438 35859 1405 747871 177775 CKT11 33513 24922 82543 121409 262827 5072 CKT12 0 0 3159136 0 3159137 11887 CKT13 57869 452761 855174 4 1091079 270208 CKT14 0 0 0 0 20011 10006 CKT15 622121 1948565 79967 72 2650814 2481474

[0092] The present invention is compared with the ALPS algorithm and the TD3 algorithm, as shown in Table 4. The best performance is highlighted in bold. The method of the present invention achieves the best speed performance in terms of the simulation efficiency index NR iterations.

[0093] Table 4 Comparison of NR iteration counts for different algorithms in pseudo-transient analysis simulation.

[0094] Model This invention ALPS TD3 CKT1 330 390 6239

[0095] Table 4 (continued) Comparison of NR iteration counts for different algorithms in pseudo-transient analysis simulation

[0096] CKT2 95 107 97 CKT3 3899 13544 12052 CKT4 116 - 115 CKT5 329 390 8469 CKT6 73 79 99 CKT7 841 2050 2101 CKT8 239 255 276 CKT9 39 14 41 CKT10 37 32 40 CKT11 14 18 14 CKT12 25 90 69 CKT13 4496 - 5002 CKT14 209 215 245 CKT15 161 175 193

[0097] It should be noted that the above embodiments are not intended to limit the scope of protection of the present invention. Equivalent transformations or substitutions made based on the above technical solutions all fall within the scope of protection of the claims of the present invention.

Claims

1. A reinforcement learning simulation step size control method based on closed-loop adaptive noise injection and entropy increase optimization strategy, characterized in that: Includes the following steps: S1: Initialization, input circuit netlist information and interact with the simulator using file read / write; S2: Step size action type conversion, establish two new network output layers, and convert the deterministic actions of the policy network output into a probability distribution; S3: Entropy regularization term calculation: solve the probability density of the action and the policy entropy in sequence, multiply them with the weight coefficients and add them to the policy loss function; S4: Inject adaptive exploratory noise, generating a Gaussian-distributed exploratory noise, and use a PID controller to control the noise standard deviation of the next action. ; S5: Output the final step size. Use the gradient update method to continuously adjust the policy network parameters and add exploration noise to the output action to obtain the final time step size. In step S1, the circuit netlist information is first input to obtain the initial circuit flags, including the NR convergence flag f, the PTA convergence flag, and the residual. rate of change of the solution The number of NR iterations is determined, and during each simulation step, the algorithm and simulator read and write state information respectively, realizing file read and write interaction. Step S2 is detailed as follows: S21: Deactivation: First, read the flag information to represent the circuit state s; then, establish a set of mean output layers policy_mu(x), and calculate the mean of the probability distribution of the policy network output action based on the state s. The calculation formula is as follows: in Let L be the weight matrix of the Lth layer mean neural network. For bias vectors, This represents the output features of the (L-1)th layer of the neural network. It is automatically initialized during instantiation in the PyTorch environment and automatically updated during training via backpropagation based on the principle of minimizing the loss function. After obtaining the output, the target activation function is used to restrict its range to (-1, 1), calculated as follows: Wherein, tanh() is the inverse tangent function, used to limit the output range of the mean; S22: Establish the network standard deviation output layer policy_log_std(x). Based on state s, calculate the log standard deviation of the probability distribution of the policy network output actions. The calculation formula is as follows: in Let L be the weight matrix of the Lth layer mean neural network. The method for setting and updating the bias vector, weight matrix, and bias vector is the same as that for the mean output layer. The difference is that the standard deviation does not need to be limited by the hyperbolic tangent function. Step S3 mainly includes the following steps: S31: Solving for the probability density: Calculate the probability density using the mean and standard deviation of the actions, according to the following formula: in , representing the standard deviation of the step size; S32: Calculate the entropy regularization term: Based on the definition of entropy and the probability density function, the formula for calculating entropy is as follows: As shown below: in It is a policy function; S33: Loss Function Update: Multiply the policy entropy by the weight coefficients at each step. The weight coefficients are controlled by exponential decay based on the number of iterations. Add the resulting entropy regularization term to the policy loss function to obtain a new loss function. The calculation formula is as follows: The first term represents the negative of the Q-value network expectation. For Q-network parameters, the second term represents the entropy regularization term under the current policy parameters. For temperature coefficient, For policy networks, For policy network parameters; Step S4 mainly includes the following steps: S41: Generate motion noise that follows a Gaussian distribution and inject it into the time step of the policy network output. Then, the time step after adding noise is obtained. ; S42: Set Target Rewards: Set rewards for progressing to target goals. and rollback target rewards ; S43: Convergence judgment: Determine whether the current solution satisfies NR convergence based on the flag information; S44: Error signal generation: If NR converges, the difference between the actual reward of the advancing agent at each step and the target reward is used as the error signal. If it does not converge, the same reasoning is used to generate the error signal e(t) by the retreating agent. S45: PID Control: The PID controller outputs the change in noise standard deviation based on the error signal, and calculates the difference to obtain the next step's noise standard deviation. ; The PID controller outputs the change in noise standard deviation based on the error signal. The formula for PID control is shown below: Where P(t), I(t), and D(t) are the proportional, integral, and derivative terms, respectively, and u(t) is the output of the PID controller, which are used to adjust the system's response speed, eliminate steady-state error, and suppress oscillations during dynamic processes, respectively. These are the coefficients of the proportional integral and its derivative, respectively. This represents the error term input at time t. Let t be the time interval between t and t-1; the standard deviation of the original noise is obtained by subtracting the standard deviation of the PID output. The new exploration noise has a mean of 0 and a standard deviation of . Gaussian noise; In step S5, firstly, the objective function is optimized using gradient descent to obtain the policy network parameters under the i-th update. The formula is as follows: in The predicted Q-value of the action output by the policy network is then used; subsequently, the exploration noise generated in step S4 is injected into the time step of the policy network output. Then, the time step after adding noise is obtained. .

2. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that: When the processor executes the program, it implements the reinforcement learning simulation step size control method based on closed-loop adaptive noise injection and entropy increase optimization strategy as described in claim 1.

3. A computer-readable storage medium storing computer instructions thereon, characterized in that: When executed by the processor, the computer instruction implements the reinforcement learning simulation step size control method as described in claim 1, which is based on closed-loop adaptive noise injection and entropy increase optimization strategy.