A two-stage parameter identification and stimulus generation method and system for SPMe lithium batteries based on reinforcement learning
By proposing a two-stage parameter identification and excitation generation method for SPMe lithium batteries based on reinforcement learning, the problem of complying with multidimensional hard safety constraints in the excitation optimization of lithium-ion batteries is solved. This method achieves zero limit violation control and efficient parameter identification, thereby improving the physical consistency and experimental safety of the battery excitation signal.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANDONG UNIV
- Filing Date
- 2026-04-28
- Publication Date
- 2026-06-30
AI Technical Summary
Existing lithium-ion battery excitation optimization methods have limitations in handling multidimensional hard safety constraints, leading to biased parameter identification results and overspending of experimental budgets. They cannot effectively cover the dynamic characteristics of the battery throughout its entire operating cycle, and existing reinforcement learning methods lack strict adherence to multidimensional hard safety constraints.
A two-stage parameter identification and excitation generation method for SPMe lithium batteries based on reinforcement learning is adopted. By constructing a low-fidelity proxy model and a high-fidelity verification model, and combining the Lagrange duality mechanism, the strict compliance with multidimensional hard safety constraints is achieved. Furthermore, constant phase elements are introduced to describe fractional polarization characteristics, and the Lagrange duality mechanism is used to achieve strict compliance with multidimensional hard safety constraints.
It achieves zero-limit control, reduces voltage violation rate, improves parameter identification accuracy and information output efficiency, significantly covers the low-frequency diffusion and mid-frequency polarization characteristics of the battery, and enhances experimental safety and information output efficiency.
Smart Images

Figure CN122113683B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of lithium-ion battery control technology, and in particular to a two-stage parameter identification and excitation generation method and system for SPMe lithium batteries based on reinforcement learning. Background Technology
[0002] In the intelligent operation and maintenance system of modern new energy vehicle power batteries and electrochemical energy storage systems, real-time sensing of the internal microstate of lithium-ion batteries is a key technology for ensuring system safety boundaries and extending cycle life. Among them, reduced-order electrochemical mechanism models, represented by the Single Particle Model (SPM) and its electrolyte-extended form (SPMe), have become the most widely used model foundation in embedded BMS because they can significantly reduce the computational dimension while retaining core physical processes (such as solid-phase diffusion, liquid-phase transport, and electrochemical reactions). In order to accurately invert key physical parameters that are difficult to measure directly inside the model (such as the volume fraction of positive electrode active material and solid-phase diffusion coefficient) from external observation data (voltage, current, temperature), it is usually necessary to apply a specific current excitation signal to the battery, i.e., optimal experimental design (OED).
[0003] Specifically, most existing excitation optimization methods are based on idealized integer-order differential equation models for policy search. Traditional SPMe models, to meet the real-time requirements of embedded computing, often simplify the complex solid-liquid interface reaction kinetics within porous electrodes to idealized Butler-Volmer equations. This simplification disrupts the inherent integrity of the battery as a complex electrochemical system, neglecting the "long memory" diffusion effect and fractional-order polarization characteristics prevalent in real batteries, caused by uneven distribution of electrode material surface roughness and porosity. When excitation signals generated based on this simplified model are applied to real physical batteries with complex fractional-order kinetics, severe "model mismatch" often occurs. That is, excitation signals that perform well in simulation environments fail to effectively excite dynamic responses in the target frequency band on real batteries, even leading to significant deviations in parameter identification results.
[0004] More seriously, existing technologies have significant limitations in handling multidimensional hard safety constraints during battery operation. The charging and discharging process of lithium-ion batteries is subject to strict physical boundaries such as terminal voltage, state of charge (SOC), current rate, and its rate of change. Traditional gradient-based optimization methods or pseudo-random binary sequence (PRBS) methods typically employ static truncation strategies to handle constraints, which directly destroys the dynamic integrity and spectral characteristics of the excitation signal. Emerging reinforcement learning (RL)-based incentive design methods mostly adopt "soft constraint" mechanisms, that is, only adding constraint violations as negative rewards with fixed weights to the objective function. This mechanism cannot provide deterministic safety boundary guarantees to the agent, leading to frequent violations such as overcharging, over-discharging, or high-frequency current surges when exploring high-information extreme conditions (such as steep open-circuit voltage segments in the high SOC range). Furthermore, existing methods lack the ability to globally plan ampere-hour (Ah) throughput and energy consumption throughout the entire experimental cycle, often resulting in excessive excitation in the early stages of the experiment prematurely exhausting the allowable test budget, making it impossible for the identification experiment to cover the dynamic characteristics of the battery throughout its entire operating cycle.
[0005] Therefore, there is an urgent need to develop a novel incentive generation method that can deeply integrate constraint reinforcement learning. This method can ensure the physical consistency of the policy by introducing a fractional-order model with constant phase elements (CPEs) for verification, while using the Lagrange duality mechanism to strictly adhere to multidimensional hard safety constraints. Summary of the Invention
[0006] To address the aforementioned problems, this invention provides a two-stage parameter identification and stimulus generation method and system for SPMe lithium batteries based on reinforcement learning.
[0007] In a first aspect, the present invention provides a two-stage parameter identification and stimulus generation method for SPMe lithium batteries based on reinforcement learning, which adopts the following technical solution:
[0008] A two-stage parameter identification and stimulus generation method for SPMe lithium batteries based on reinforcement learning includes:
[0009] Obtain lithium battery parameter information;
[0010] Data preprocessing is performed based on the acquired lithium battery parameter information;
[0011] A two-stage excitation model for SPMe lithium batteries is constructed, including the construction of a low-fidelity proxy model, a composite state space, a safety constraint, and a high-fidelity verification model.
[0012] Intelligent agent interaction training is carried out based on the constructed composite state space and security constraints, including low-fidelity agent model training, candidate policy parameter set construction, and intelligent agent interaction training.
[0013] The trained two-stage excitation model of SPMe lithium battery is used to generate current excitation.
[0014] Furthermore, the construction of the low-fidelity proxy model includes building a battery proxy environment with low computational cost and rapid rolling derivation based on the premise of physical interpretability. The model uses the single-particle model SPMe with electrolyte extension as the low-fidelity proxy model, where the lithium-ion diffusion process within the positive and negative electrode particles is described by Fick's second law in spherical coordinates.
[0015] ;in This indicates the lithium-ion concentration in the electrode particles. Represents the solid-phase diffusion coefficient. The coordinates represent the particle radius direction. The boundary conditions are that the flux at the particle center is zero and the surface flux is determined by the interfacial reactive flow, i.e.:
[0016] ; ;in Indicates particle radius, Indicates the interfacial reaction flux density. Represents specific surface area. Here, is the Faraday constant; to adapt to high-frequency calls in reinforcement learning environments, the solid-phase diffuser model is discretized using finite difference, orthogonal configuration, and moment matching. The electrolyte concentration change is represented using a reduced-order method, retaining its influence on the mid-to-low frequency terminal voltage response. The terminal voltage output is written as the sum of open-circuit voltage difference, reaction overpotential, electrolyte polarization term, and ohmic voltage drop.
[0017] ;
[0018] in and This indicates the surface concentration of the positive and negative electrode particles. and This is the overpotential of the two-electrode reaction. This is the concentration polarization term of the electrolyte. The equivalent ohmic resistance is used. Considering the cost of directly evaluating the Butler–Volmer relation in high-frequency training, a piecewise linear approximation is made near the current operating point to obtain a state-space expression suitable for discrete-time iteration:
[0019]
[0020]
[0021] The subscript L indicates a low-fidelity model. For internal state vectors, The parameter vector to be identified, and The process noise and measurement noise are represented by the state vector as follows:
[0022] ;
[0023] in, Represents the internal state vector of the model; This represents the vector of parameters to be identified; This represents the state vector in a reinforcement learning environment; This represents the normalized action output by the policy network; Indicates an immediate reward; Represents the safety cost vector; Represent the Lagrange dual variable vector; This represents the Jacobian, which indicates the sensitivity of the output to the parameter. Represents the Fisher information matrix; Indicates the cumulative throughput per ampere-hour; This indicates the cumulative energy consumption.
[0024] Furthermore, the construction of the composite state space includes transforming the battery excitation design problem into a constrained Markov decision process and constructing a composite state vector for direct use by the reinforcement learning network. By explicitly introducing information sensitivity and budget features into the state, an active decision-making capability for parameter identification tasks is formed. First, the state vector is defined as: in Indicates the current state of charge. Indicates the current terminal voltage. This represents the input current at the previous moment, used to constrain the rate of change of the action at the current moment; and These represent the normalized cumulative ampere-hour consumption and cumulative energy consumption, respectively. and These represent the normalized distances from the state to the voltage boundary and the SOC boundary, respectively. Normalized temperature state, information sensitivity state Defined as the square of the first-order sensitivity norm of the output voltage with respect to the parameter to be identified: When the object to be identified is a single parameter, it degenerates into: When the object to be identified has multiple parameters, Reflects the comprehensive identifiable strength of all target parameters under the current operating conditions; the budgetary characteristics are obtained by accumulation. and Its composition, in its normalized form, is:
[0025] The boundary distance state is defined as follows:
[0026]
[0027]
[0028] Finally, the control current is generated through actual mapping, using the following saturation limiting relationship:
[0029]
[0030] in This represents the amplitude limiting function.
[0031] Furthermore, the construction of the safety constraints includes transforming the multidimensional physical boundaries, budget boundaries, and device boundaries during battery operation into constraints and safety cost functions that can be handled by reinforcement learning. The safety constraints include full-cycle budget boundaries for ampere-hour throughput and energy consumption. The battery operation constraint domain is defined as...
[0032]
[0033] Then, the degree of violation of each constraint is mapped to a non-negative safety cost component, and the safety cost vector is defined as: The voltage cost is:
[0034]
[0035] SOC cost is:
[0036] The budgeted cost for ampere-hours is:
[0037] The energy budget cost is:
[0038] The cost of the rate of change of current is:
[0039] The cost of temperature is: Finally, to fully reflect budget expenditures, the budget variables are updated recursively: This yields cost signals.
[0040] Furthermore, the construction of the high-fidelity verification model includes constructing a high-fidelity verification model to compensate for the insufficient description of the complex interface dynamics of real batteries by the low-fidelity proxy model. First, a constant-phase element CPE is introduced on the basis of the SPMe main framework to describe the fractional polarization characteristics exhibited by the porous electrode-electrolyte interface in the mid-to-low frequency region. The terminal voltage in the high-fidelity model is expressed as:
[0041]
[0042] in The frequency domain impedance of the CPE is expressed as the additional fractional polarization voltage introduced by the CPE: in For generalized capacitance parameters, For fractional order, when When the value is less than 1, it reflects the diffuse polarization behavior and long memory effect in the real interface; in the time domain implementation, the Caputo fractional derivative is used to describe the fractional internal state. The dynamics, namely:
[0043] To address the computational complexity of directly solving for fractional derivatives, we discretize the memory kernel using exponential and approximation methods: This leads to the introduction of a set of auxiliary states. , so that: And the fractional polarization term is approximated as: Ultimately, this enhances the physical realism of the model while controlling computational costs.
[0044] Furthermore, the training of the low-fidelity proxy model includes employing the Lagrangian-SAC algorithm, a maximum entropy flexible actor-critic algorithm based on the Lagrange multiplier method, which uses the parameter identification information as the core of policy optimization, assuming the output voltage is relative to the parameter vector to be identified. The sensitivity Jacobian is: The measurement noise covariance matrix is Then the single-step Fisher information matrix is: The instant reward is defined as the trace of this matrix: When the object to be identified is a single parameter, the reward degenerates to the normalized squared sensitivity; when the object to be identified is a multi-parameter object, the reward represents the contribution of the current operating condition to the comprehensive information content of the multi-parameter joint identification; considering the safety cost vector... Construct the following Lagrange maximum entropy objective:
[0045]
[0046] in For Actor policy networks, For the dual variable vector, As a discount factor, For entropy weighting coefficients, Represents the action distribution entropy; by initializing an Actor network, two Critic networks, and corresponding target networks, and establishing an experience replay pool. At each sampling time, the Actor receives the current state. And output the normalized action. The actual current is obtained after motion mapping. Low-fidelity model based on The deduction yields the new state, voltage output, and the composite state at the next moment. ; calculate the instant reward accordingly and security costs and transfer samples The samples are stored in the experience replay pool, and then a small batch of samples is randomly sampled from the experience replay pool to update the Critic network, Actor network, and dual variables respectively. The Lagrange dual variable is updated according to the gradient ascent method. in For the dual learning rate, For the first The upper bound of the expected value allowed by class constraints. This indicates projection onto the non-negative interval.
[0047] Furthermore, the construction of the candidate policy parameter set includes first freezing the Actor network parameters, Critic network parameters, and Lagrange dual variables, which are denoted as follows: Among these, the policy network parameters are what truly determine the subsequent incentive trajectory. To avoid relying solely on a single training result for high-fidelity validation, a rollout with no or weak exploration noise is first performed in a low-fidelity agent environment. Performance metrics such as cumulative reward, cumulative Fisher information, information gain per ampere-hour, and probability of exceeding limits are calculated.
[0048] ,in To prevent tiny positive numbers with a denominator of zero; subsequently, a candidate strategy set is formed based on the evaluation results, whereby a strategy satisfies the condition that the probability of exceeding the limit in a low-fidelity environment is not higher than a threshold. At the same time, the information gain per unit ampere-hour is not lower than the threshold. When that happens, it is included in the candidate strategy set: Among them, the Actor parameters with the best performance in the most recent training rounds are retained, and they are packaged together with the performance description corresponding to each candidate policy to form a candidate information packet:
[0049]
[0050] in This represents an estimate of the probability of violation in a low-fidelity environment.
[0051] Furthermore, the agent interaction training includes, to improve the robustness of the policy to model uncertainty, preferentially applying bounded random perturbations to the parameters to be identified during the rapid training phase, i.e.:
[0052] ,in Indicates uniform distribution. The parameter perturbation amplitude is defined. After the rapid training reaches a preset number of rounds, the current policy parameters are frozen and deployed to a high-fidelity model for offline rolling verification. During the verification process, the initial SOC, initial temperature, budget upper limit, and boundary parameters are maintained consistent with those in the training phase. The voltage trajectory, SOC trajectory, temperature trajectory, ampere-hour consumption, energy consumption, and Fisher information accumulation under the candidate policy are recorded throughout the entire cycle, and the information gain per unit ampere-hour of the policy is calculated. And calculate the information gain per unit energy: To ensure the final strategy can be directly deployed in the actual system, a zero-tolerance screening criterion is adopted. If any candidate strategy exceeds the hard constraint limit at any point during the high-fidelity offline verification process, the strategy is immediately removed from the candidate set and will not be included in the final feasible set. Only when the strategy consistently satisfies all hard boundaries throughout the entire verification cycle is it considered a feasible strategy and included in the feasible set.
[0053]
[0054] Within the feasible set, the optimal strategy is selected based on the principle of maximizing information gain per unit ampere-hour:
[0055] .
[0056] Furthermore, the process of generating current excitation using the trained SPMe lithium battery two-stage excitation model includes reading the optimal strategy parameters. Perform a complete rolling inference without introducing exploration noise to obtain the target time domain. The optimal input stimulus sequence within: To improve deployment security, the voltage limit during the training phase will be increased. Correspondingly, the upper limit of engineering protection is tightened. The upper limit of temperature has been tightened to This forms a dual-layer protection structure for training and execution. The final input stimulus sequence is sent to the battery testing platform or embedded BMS for execution via CAN bus, Ethernet, or serial port. During execution, voltage, current, temperature, estimated SOC, and budgeted energy consumption are continuously and in real-time collected, and the system checks for triggering secondary safety protection conditions. The stimulus output is immediately terminated when any of the following conditions are met:
[0057]
[0058] After the entire stimulus execution is completed, the data from the entire process is compiled into a parameter identification dataset:
[0059] .
[0060] Secondly, a two-stage parameter identification and stimulus generation system for SPMe lithium batteries based on reinforcement learning includes:
[0061] The data acquisition module is configured to acquire lithium battery parameter information;
[0062] The preprocessing module is configured to perform data preprocessing based on the acquired lithium battery parameter information;
[0063] The model building module is configured to build a two-stage excitation model for SPMe lithium batteries, including low-fidelity proxy model building, composite state space building, safety constraint building, and high-fidelity verification model building.
[0064] The model training module is configured to train agent interactions based on the constructed composite state space and security constraints, including low-fidelity agent model training, candidate policy parameter set construction, and agent interaction training.
[0065] The excitation module is configured to generate current excitation using a trained two-stage excitation model for SPMe lithium batteries.
[0066] Thirdly, the present invention provides a computer-readable storage medium storing a plurality of instructions adapted for loading and execution by a processor of a terminal device of the aforementioned reinforcement learning-based two-stage parameter identification and stimulus generation method for SPMe lithium batteries.
[0067] Fourthly, the present invention provides a terminal device, including a processor and a computer-readable storage medium, wherein the processor is used to implement various instructions; the computer-readable storage medium is used to store multiple instructions, the instructions being adapted to be loaded and executed by the processor to provide a reinforcement learning-based two-stage parameter identification and stimulus generation method for SPMe lithium batteries.
[0068] In summary, the present invention has the following beneficial technical effects:
[0069] Compared with existing technologies (such as the traditional OED method and the Soft-SAC method based on fixed penalty weights), the present invention has the following substantial technical effects:
[0070] First, it achieves "zero violation" control under multi-dimensional hard safety constraints. Existing reinforcement learning methods based on soft constraints treat constraints as fixed negative rewards. When facing high-information-reward regions (usually near voltage boundaries), the agent tends to sacrifice safety for high rewards, resulting in voltage violation rates exceeding 85%. This invention constructs a Lagrange dual update mechanism, causing the penalty weights to grow exponentially as they approach the safety boundary, forming a mandatory "probabilistic barrier." Experimental data confirms that this mechanism reduces the violation rate to 0% (i.e., zero violation) under multi-dimensional constraints such as voltage, SOC, and current change rate, fundamentally ensuring experimental safety.
[0071] Second, this invention solves the frequency domain response mismatch problem caused by the reduced-order model. The traditional SPMe model cannot describe the fractional-order dynamic characteristics of porous electrodes. This invention introduces a CPE-based fractional-order impedance model in the offline verification stage to screen the generated excitation signal for physical consistency. Experiments show that the excitation signal screened by this invention has a root mean square error (RMSE) of only 0.68% in real battery parameter identification, far lower than the 4%-5% of traditional methods, effectively covering the low-frequency diffusion and mid-frequency polarization characteristics of the battery.
[0072] Third, it improves the efficiency of information output under limited experimental budgets. Addressing the problem of existing methods lacking full-cycle planning and thus overspending the budget in the early stages of experiments, this invention explicitly introduces a budget state vector into the state space. And with "information gain per unit ampere-hour" "As the final screening criterion, compared with standard constant current (CC) and dynamic stress (DST) conditions, the strategy generated by this invention has increased the amount of accumulated Fisher information by more than 2 times under the same ampere-hour consumption, which significantly improves the time economy of parameter identification." Attached Figure Description
[0073] Figure 1 This is a schematic diagram of a two-stage parameter identification and stimulus generation method for SPMe lithium batteries based on reinforcement learning, according to Embodiment 1 of the present invention.
[0074] Figure 2 This is a schematic diagram of the system structure of Embodiment 1 of the present invention;
[0075] Figure 3 This is a flowchart of the Lagrange constraint reinforcement learning algorithm with security constraints according to Embodiment 1 of the present invention;
[0076] Figure 4 This is the security boundary compliance verification diagram of Embodiment 1 of the present invention;
[0077] Figure 5 This is a comparison chart of information output efficiency in Embodiment 1 of the present invention;
[0078] Figure 6 This is a comparison chart of the recognition accuracy of Embodiment 1 of the present invention. Detailed Implementation
[0079] The present invention will be further described in detail below with reference to the accompanying drawings.
[0080] Example 1
[0081] Reference Figure 1 This embodiment of a two-stage parameter identification and stimulus generation method for SPMe lithium batteries based on reinforcement learning includes:
[0082] To ensure consistency in subsequent symbol usage, the main symbols used throughout the text are defined using a unified convention. Let the discrete-time index be... The sampling and control cycle is ; Indicates the first The control current applied to the battery at each sampling time Indicates the battery terminal voltage. Indicates battery temperature. Indicates the battery's state of charge; Represents the internal state vector of the model; This represents the vector of parameters to be identified; This represents the state vector in a reinforcement learning environment; This represents the normalized action output by the policy network; Indicates an immediate reward; Represents the safety cost vector; Represent the Lagrange dual variable vector; This represents the Jacobian, which indicates the sensitivity of the output to the parameter. Represents the Fisher information matrix; Indicates the cumulative throughput per ampere-hour; This represents the cumulative energy consumption. If a specific battery system uses a current definition method of "positive charging" or "positive discharging," a consistent substitution can be made during implementation, but consistency should be maintained throughout the same embodiment.
[0083] Step 1: Data Acquisition
[0084] This step establishes the input foundation for the entire method. Its purpose is not simply to collect a few raw measurements, but to uniformly organize all the data required for subsequent model construction, reinforcement learning training, constraint determination, and experimental execution. First, the battery's static parameters, operating status, and constraint boundary information are read through the battery management system, experimental workstation, host computer control software, or database interface. The static parameters include at least the rated capacity. Nominal voltage and open-circuit voltage curves The parameters to be identified include: positive and negative electrode geometry, active material parameters, electrolyte characteristic parameters, ohmic internal resistance reference value, and fundamental constants related to SPMe modeling. The initial values or ranges of the parameters to be identified are also read in this step, and the parameter vector is written as follows:
[0085]
[0086] in and These represent the initial estimates of the volume fraction of the positive electrode active material and the diffusion coefficient of the positive electrode solid phase, respectively. The remaining parameters can be selected or omitted depending on the specific identification task.
[0087] Regarding real-time data, this step continuously acquires the terminal voltage according to a preset sampling period. Current ,temperature timestamp And the state-of-charge estimates given by Coulomb integrals, Kalman filters, or other state observers. To enable subsequent reinforcement learning states to have global budget planning capabilities, this step also accumulates the ampere-hour throughput and energy consumption during the experiment in real time, and their update relationships are written as follows:
[0088]
[0089]
[0090] The two quantities mentioned above are not simply recorded, but are direct inputs for the budget status and budget constraints in subsequent steps three and four.
[0091] Safety boundary information is also configured in this step. The system reads or sets the terminal voltage upper limit. Lower limit of terminal voltage SOC limit SOC lower limit Upper limit of current amplitude Upper limit of current change rate , budget ceiling Energy budget cap and upper temperature limit In one embodiment, it can be set as follows:
[0092]
[0093]
[0094]
[0095]
[0096]
[0097]
[0098] .
[0099] However, the present invention is not limited to the above values, and adjustments can be made according to the cell model, test purpose and safety strategy during actual deployment.
[0100] Considering that industrial field data is often accompanied by noise, asynchronous processing, and sporadic anomalies, this step prioritizes further unified preprocessing. Voltage, current, and temperature sequences are processed using low-pass filtering, median filtering, or moving average to reduce the impact of sampling noise on sensitivity calculation and reward functions. If communication failure or data loss occurs at any point, the most recent valid value is preserved or linear interpolation is used for compensation. Validity checks are performed on static and boundary parameters; if they are found to exceed reasonable physical ranges, a reset or alarm is triggered. After the above processing, this step outputs a standardized input set:
[0101]
[0102] This set runs through the subsequent eight steps and serves as the common input basis for the implementation chain of this invention.
[0103] Step 2: Building a Low-Fidelity Proxy Model
[0104] The task of this step is to construct a computationally inefficient and rapidly evolving battery proxy environment while ensuring physical interpretability, to support subsequent reinforcement learning interactions with massive numbers of samples. To this end, this invention employs the single-particle model SPMe with electrolyte extension as a low-fidelity proxy model. Compared to the complete Doyle–Fuller–Newman model, SPMe significantly reduces the state dimension and solution complexity, while still preserving the key sensitivity relationships of electrode solid-phase diffusion, electrolyte concentration polarization, and terminal voltage response to changes in critical parameters. Therefore, it is suitable as a training environment for the policy learning phase.
[0105] The lithium-ion diffusion process within the positive and negative electrode particles is described by Fick's second law in spherical coordinates:
[0106]
[0107] in This indicates the lithium-ion concentration in the electrode particles. Represents the solid-phase diffusion coefficient. This represents the coordinates of the particle radius. The boundary conditions are that the flux at the particle center is zero and the surface flux is determined by the interfacial reactive flow, i.e.:
[0108]
[0109]
[0110] in Indicates particle radius, Indicates the interfacial reaction flux density. Represents specific surface area. is Faraday's constant.
[0111] To adapt to high-frequency calls in reinforcement learning environments, this invention does not directly solve the original partial differential equations at each sampling time. Instead, it discretizes the solid-phase diffuser model using finite difference, orthogonal configuration, moment matching, or equivalent low-order approximation. The electrolyte concentration change is also represented using a reduced-order method, preserving its influence on the mid-to-low frequency terminal voltage response. The terminal voltage output is written as the sum of the open-circuit voltage difference, reaction overpotential, electrolyte polarization term, and ohmic voltage drop.
[0112]
[0113] in and This indicates the surface concentration of the positive and negative electrode particles. and This is the overpotential of the two-electrode reaction. This is the concentration polarization term of the electrolyte. This is the equivalent ohmic internal resistance.
[0114] Considering that direct evaluation of the Butler-Volmer relation is costly in high-frequency training, this invention performs local linearization or piecewise linear approximation on it near the current operating point, thereby obtaining a state-space expression suitable for discrete-time iteration:
[0115]
[0116]
[0117] The subscript "L" indicates a low-fidelity model. For internal state vectors, The parameter vector to be identified, and These represent process noise and measurement noise, respectively. The state vector can be written as:
[0118]
[0119] Alternatively, an equivalent state set can be used depending on the implementation method, as long as it can reflect the main influence of the parameter to be identified on the output voltage.
[0120] The low-fidelity agent model constructed in this step is not directly used for the final experimental output, but rather serves as the main environment for agent interaction training in steps five and eight. Its core requirements are fast single-step inference speed, reusability, and the ability to withstand hundreds of thousands or even millions of environmental interactions during the training phase. Therefore, the output of this step is not a "high-precision battery digital twin model," but rather a "policy learning environment that balances physical constraints and training efficiency."
[0121] Step 3: Constructing Composite Vectors
[0122] This step formally transforms the battery stimulus design problem into a constrained Markov decision process and constructs a composite state vector for direct use by reinforcement learning networks. Unlike conventional strategies that only use voltage, current, and SOC as states, this invention explicitly introduces information sensitivity and budget features into the states, enabling the strategy to not only know "what condition the battery is currently in," but also "whether the current condition is sensitive to parameter identification" and "how much experimental budget remains," thereby forming an active decision-making capability for parameter identification tasks.
[0123] This invention defines the state vector as:
[0124]
[0125] in Indicates the current state of charge. Indicates the current terminal voltage. This represents the input current at the previous moment, used to constrain the rate of change of the action at the current moment; and These represent the normalized cumulative ampere-hour consumption and cumulative energy consumption, respectively. and These represent the normalized distances from the state to the voltage boundary and the SOC boundary, respectively. This represents the normalized temperature state. Among the above states, the most crucial is the information sensitivity state. It is defined as the square of the first-order sensitivity norm of the output voltage with respect to the parameter to be identified:
[0126]
[0127] When the object to be identified is a single parameter, the above formula can be degenerated into:
[0128]
[0129] When the object to be identified has multiple parameters This reflects the overall identifiable strength of all target parameters under the current operating conditions. Through this quantity, the agent can learn to actively gravitate towards highly sensitive regions, rather than blindly exploring the entire voltage range based solely on experience.
[0130] Budget characteristics are obtained by accumulating from step one. and Its composition, in its normalized form, is written as
[0131]
[0132]
[0133] Boundary distance state is defined as
[0134]
[0135]
[0136] The purpose of this approach is to enable the strategy to perceive at the state level how far the current operating condition is from the danger boundary, so as to automatically change its behavior when approaching the boundary, rather than passively accepting punishment after exceeding the limit.
[0137] The actions in this step are determined by the continuously normalized variables output by the policy network. This indicates that, since the current excitation needs to meet amplitude and rate of change limits, the control current needs to be generated through actual mapping. This step uses the following saturation limiting relationship:
[0138]
[0139] in This represents the limiting function. This action construction simultaneously ensures current continuity, realizability, and engineering feasibility. If adaptation to specific charging / discharging definitions is required, a uniform sign transformation of the action direction can be made in the implementation, without affecting the overall method of this invention.
[0140] After this step, the system obtains a composite state space containing three types of features: physical, informational, and budgetary, as well as a continuous action space compatible with the actuator. This construction directly determines the effectiveness of the subsequent safety constraint mapping in step four and the policy learning in step five, and is an important foundation that distinguishes this invention from conventional reinforcement learning incentive design schemes.
[0141] Step 4: Constructing Security Constraints
[0142] This step transforms the multidimensional physical boundaries, budget boundaries, and device boundaries during battery operation into constraints and safety cost functions that can be handled by reinforcement learning. This allows subsequent training to move beyond relying on a single soft-penalty empirical setting and instead be built upon a clear, computable, and updatable safety constraint system. Since this invention focuses on parameter identification experiments rather than ordinary charge-discharge control, the safety constraints include not only instantaneous boundaries such as voltage, SOC, and temperature, but also full-cycle budget boundaries such as ampere-hour throughput and energy consumption.
[0143] Battery operating constraint domain defined as
[0144]
[0145] Based on this, the present invention does not directly use the binary variable of "whether the limit is exceeded" to participate in training, but instead maps the degree of violation of each dimension constraint to a non-negative safety cost component.
[0146] The safety cost vector is defined as:
[0147]
[0148] The voltage cost is written as:
[0149]
[0150] SOC cost is written as:
[0151]
[0152] The budgeted cost per kilowatt-hour is written as:
[0153]
[0154] Energy budget cost is written as:
[0155]
[0156] The cost of the rate of change of current is written as:
[0157]
[0158] Temperature cost is written as:
[0159] .
[0160] When no limits are exceeded, all of the above cost components are zero; when a constraint is triggered, the corresponding cost becomes positive and increases with the degree of limit violation. Unlike traditional soft penalty mechanisms that use fixed-weight negative rewards, this invention uses these costs as inputs for subsequent updates to the Lagrange dual variable, thus allowing the penalty intensity to dynamically change according to the current actual degree of violation, rather than being pre-fixed. To fully reflect the budget consumption process, this step also recursively updates the budget variable:
[0161]
[0162]
[0163] This recursion not only serves the budget cost calculation, but also directly affects the budget status input in step three and the high-fidelity screening indicators in step eight.
[0164] After this step is completed, the system not only obtains the definition of the "allowed region," but also a set of differentiable or piecewise differentiable cost signals. In this way, subsequent policy training can achieve an explicit trade-off between "pursuing high information content" and "maintaining safety and budget control," rather than relying solely on empirical parameter tuning.
[0165] Step 5: Intelligent Agent Interaction Training
[0166] This step performs basic agent training in the low-fidelity agent model environment constructed in step two. Its goal is to learn a policy network capable of actively selecting high-information stimuli while simultaneously satisfying safety constraints, within the composite state space of step three and the constraint system of step four. To adapt to the continuous action space, this invention employs the maximum entropy flexible actor-critic algorithm based on the Lagrangian multiplier method, namely Lagrangian-SAC.
[0167] Unlike conventional control tasks that reward minimizing tracking error, energy consumption, or time, this invention uses parameter identification information as the core of strategy optimization. Let the output voltage be relative to the vector of parameters to be identified. The sensitivity Jacobian is:
[0168]
[0169] The measurement noise covariance matrix is Then the single-step Fisher information matrix can be written as:
[0170]
[0171] The instant reward is defined as the trace of this matrix:
[0172]
[0173] When the object to be identified is a single parameter, the reward degenerates into the normalized square of the sensitivity; when the object to be identified is multiple parameters, the reward represents the contribution of the current operating condition to the comprehensive information of the joint identification of multiple parameters. Therefore, the intelligent system naturally tends to operate within the range where the terminal voltage is most sensitive to the target parameter.
[0174] Consider the safety cost vector in step four. This invention constructs the following Lagrange maximum entropy objective:
[0175]
[0176] in For Actor policy networks, For the dual variable vector, As a discount factor, For entropy weighting coefficients, This represents the entropy of the action distribution. The objective is to maximize information rewards, minimize safety costs, and maintain a moderate level of exploration.
[0177] At the start of training, the system initializes one Actor network, two Critic networks, and the corresponding target network, and establishes an experience replay pool. At each sampling moment, the Actor receives the current state. And output the normalized action. The actual current is obtained after the action mapping in step three. Low-fidelity model based on The deduction yields the new state, voltage output, and the composite state at the next moment. The system then calculates the instant reward based on this. and security costs and transfer samples The samples are stored in the experience replay pool. Then, a small batch of samples is randomly sampled from the experience replay pool to update the Critic network, Actor network, and dual variables, respectively.
[0178] The Lagrange dual variable is updated according to the gradient ascent method:
[0179]
[0180] in For the dual learning rate, For the first The upper bound of the expected value allowed by class constraints. This indicates projection onto the non-negative interval. This mechanism ensures that if a constraint is triggered for a long period, its corresponding penalty weight automatically increases, thereby forcing the policy distribution to shrink back to the safe region; if a constraint remains satisfied for a long period, its corresponding penalty weight will not be unnecessarily amplified, thus avoiding excessive inhibition of information exploration.
[0181] The training termination condition for this step can be set to reaching a preset number of rounds, or the policy's average reward, average cost, and violation probability converging simultaneously over a series of consecutive rounds. In one implementation, after training termination, it must be ensured that the average violation probability in the low-fidelity environment is below a set threshold, and the information gain per ampere-hour is above a set lower limit. After this step, the agent has a preliminary ability to "actively seek high-sensitivity regions within budget and avoid exceeding limits as much as possible," but since it is still only trained in a low-fidelity environment, subsequent high-fidelity verification is still required.
[0182] Step Six: Preliminary Acquisition of Agent Parameters
[0183] The main purpose of this step is to transform the intermediate results obtained from step five training from "network objects in training" into "a set of candidate policy parameters that can be saved, compared, and sent to a high-fidelity environment for verification." In other words, this step is a transitional link between the low-fidelity training stage and the high-fidelity verification stage.
[0184] Once the stopping condition is met in step five, first freeze the Actor network parameters, Critic network parameters, and Lagrange dual variables, and denote them as follows:
[0185]
[0186] Among them, the policy network parameters are what truly determine the subsequent incentive trajectory. To avoid relying solely on a single training result for high-fidelity validation, this step prioritizes performing several evaluation rollouts in a low-fidelity agent environment with no or weak exploration noise, and statistically analyzing performance metrics such as cumulative reward, cumulative Fisher information, information gain per ampere-hour, and probability of exceeding limits. Typically, this can be calculated as follows:
[0187]
[0188]
[0189]
[0190] in To prevent tiny positive numbers with a denominator of zero.
[0191] A set of candidate strategies is then formed based on the evaluation results. A strategy is considered valid if its probability of exceeding the limit in a low-fidelity environment does not exceed a certain threshold. At the same time, the information gain per unit ampere-hour is not lower than the threshold. When that happens, it is included in the candidate strategy set:
[0192]
[0193] In one implementation, multiple Actor parameters that have achieved the best performance in the most recent training rounds can be retained to increase the candidate space for high-fidelity screening, rather than simply storing a single optimal parameter. Simultaneously, the performance description corresponding to each candidate strategy can be packaged and output together to form a candidate information packet.
[0194]
[0195] in This represents an estimate of the probability of violation in a low-fidelity environment. This information packet will be directly used as input for steps seven and eight.
[0196] Through this step, the preliminary learning results obtained in step five are systematically organized into candidate policy objects that can enter the next stage, providing standardized input for subsequent high-fidelity verification and two-stage alternating optimization.
[0197] Step 7: Construction of High-Fidelity Verification Model
[0198] This step constructs a high-fidelity verification model to compensate for the shortcomings of low-fidelity surrogate models in describing the complex interface dynamics of real batteries. The high-fidelity model is not used for massive training throughout the process, but rather for more rigorous physical consistency verification of candidate strategies. Therefore, its focus is on "accurately characterizing interface polarization and long memory effects" rather than "extremely reducing computational load".
[0199] This step introduces a constant-phase element (CPE) into the SPMe framework to describe the fractional-order polarization characteristics of the porous electrode / electrolyte interface in the low-to-mid-frequency region. The terminal voltage in the high-fidelity model can be expressed as:
[0200]
[0201] in Let be the additional fractional polarization voltage introduced by the CPE. The frequency domain impedance of the CPE is expressed as:
[0202]
[0203] in For generalized capacitance parameters, For fractional order. When When the value is less than 1, the model can reflect the diffuse polarization behavior and long memory effect in the real interface.
[0204] In the time-domain implementation, this invention uses Caputo fractional derivatives to describe the fractional internal states. The dynamics, namely:
[0205]
[0206] Considering the large computational cost of directly solving for fractional derivatives, the memory kernel is discretized using exponential and approximation methods:
[0207]
[0208] This allows us to introduce a set of auxiliary states. , so that:
[0209]
[0210] The fractional polarization term is approximated as:
[0211]
[0212] This process allows for significant enhancement of the model's physical realism while controlling computational costs, making it suitable for periodic offline verification in step eight.
[0213] The high-fidelity verification model constructed in this step maintains consistency with the low-fidelity model in step two in terms of initial conditions, boundary conditions, range of parameters to be identified, and excitation input interface, thus ensuring the comparability of candidate strategies in the two environments. Its main uses are: receiving the candidate strategy parameters output from step six and generating voltage, temperature, and state trajectories under a more rigorous physical model; identifying strategies that appear effective in the low-fidelity environment but exhibit overvoltage, amplified lithium plating risk, or polarization response distortion in the high-fidelity environment; and providing a more realistic verification basis for the two-stage alternating training in step eight.
[0214] Step 8: Intelligent Agent Interaction Training
[0215] In this step, the system first performs several rounds of rapid training in a low-fidelity agent environment to further expand the state-action coverage and allow the policy to explore more high-information scenarios without significantly increasing computational costs. To improve the policy's robustness to model uncertainty, this invention preferentially applies bounded random perturbations to the parameters to be identified during this rapid training phase, i.e.:
[0216]
[0217] in Indicates uniform distribution. This represents the magnitude of parameter perturbation. The resulting strategy is not optimal for a single nominal parameter, but rather remains stable and effective within a certain range of parameter fluctuations.
[0218] After the rapid training reaches the preset number of rounds, the system freezes the current policy parameters and deploys them to the high-fidelity model built in step seven for offline rolling verification. During verification, the initial SOC, initial temperature, budget ceiling, and boundary parameters are maintained consistent with those used in the training phase. The voltage trajectory, SOC trajectory, temperature trajectory, ampere-hour consumption, energy consumption, and Fisher information accumulation under the candidate policy are recorded throughout the entire cycle. Based on this, the information gain per ampere-hour of the policy is calculated.
[0219]
[0220] If necessary, the information gain per unit energy can be further calculated:
[0221]
[0222] To ensure the final strategy can be directly deployed in a real system, this invention employs a zero-tolerance screening criterion. Specifically, if a candidate strategy experiences a hard constraint violation at any point during the high-fidelity offline verification process, for example... , , , , or If the strategy fails to meet the requirements, it is immediately removed from the candidate set and does not enter the final feasible set. A strategy is considered feasible and included in the feasible set only if it consistently satisfies all hard boundaries throughout the entire verification period. The feasible set is denoted as:
[0223]
[0224] in, This represents the set of feasible strategies after high-fidelity offline verification. Indicates the first The policy network parameters corresponding to each candidate policy, with symbolic subscripts. This indicates that the parameter belongs to the policy network (Actor network), subscript Number the candidate strategies. ,in This represents the total number of feasible strategies that ultimately passed high-fidelity verification and were retained; superscript. This set of parameters represents the candidate strategy parameters frozen and exported when entering the current high-fidelity screening stage; subsequently, the optimal strategy is selected from the feasible set according to the principle of maximizing information gain per unit ampere-hour, and its expression is:
[0225]
[0226] in, This represents the network parameters of the final selected optimal policy; Indicates in the feasible set Search and select the set of parameters that maximizes the objective function; The information gain per unit ampere-hour is used to characterize the amount of Fisher information obtained per unit ampere-hour throughput. It is preferably defined as the ratio of the cumulative Fisher information to the total ampere-hour throughput. Through the above screening method, the strategy parameter with the highest information output efficiency can be selected from all feasible candidate strategies under the premise of satisfying all hard security boundaries, and used as the optimal strategy for the subsequent final input stimulus generation.
[0227] If the feasible set is empty in the current round, it indicates that the current policy family still relies too heavily on the low-fidelity environment and cannot maintain safety and effectiveness under high-fidelity physical conditions. In this case, the system returns to step five to continue training, or adjusts the range of random perturbations, reward weights, and Lagrange update parameters in the low-fidelity training before re-entering this step. If the feasible set is not empty, the policy that meets the high-fidelity zero-boundary requirement and has the best information efficiency is sent to step nine. Through this step, the seemingly contradictory requirements of fast low-fidelity training and strict high-fidelity verification are unified in the same workflow, making the final policy both training feasible and deployment reliable.
[0228] Step Nine: Final Input Stimulus
[0229] This step transforms the optimal strategy parameters obtained in step eight into a final input excitation sequence that can be directly executed by the battery management system or experimental equipment, and completes the excitation distribution, online monitoring, and experimental data feedback. This step is the engineering implementation of the entire method; its output is no longer abstract neural network parameters, but specific, time-ordered current excitation trajectories and corresponding identification datasets.
[0230] First, the system reads the optimal strategy parameters output in step eight. Perform a complete rolling inference without adding exploration noise to obtain the target time domain. The optimal input stimulus sequence within:
[0231]
[0232] If the policy network outputs a normalized action sequence, it is first restored to the true current amplitude through the amplitude limiting mapping in step three. If the output is a piecewise parameterized trajectory, it is discretized into a current command table conforming to the device control cycle in this step. To match the actual controller, the system prioritizes further processing of the sequence, including sample-and-hold processing, actuator amplitude limiting correction, and protection margin scaling. For example, to improve deployment safety, the voltage upper limit during the training phase can be adjusted. Correspondingly, the upper limit of engineering protection is tightened. The upper limit of temperature has been tightened to This creates a dual-layer protection structure for training and execution.
[0233] The final input excitation sequence is sent to the battery testing platform or embedded BMS for execution via CAN bus, Ethernet, serial port, or other industrial communication methods. During execution, the system continuously collects voltage, current, temperature, estimated SOC, and budgeted energy consumption in real time, and detects whether secondary safety protection conditions are triggered. The excitation output is immediately terminated when any of the following conditions are met:
[0234]
[0235] The aforementioned safety threshold can be the same as the threshold during the training phase, or it can be set to a more conservative value.
[0236] After the entire stimulus execution is completed, the system organizes all the data into a parameter identification dataset, which is represented as follows:
[0237]
[0238] in, This represents the experimental dataset used for subsequent parameter identification calculations; Indicates the first The timestamp corresponding to each sampling moment; Indicates the first The optimal input excitation current actually applied to the battery at each sampling time, where the superscript " "This indicates that the current is generated by the finally selected optimal strategy; Indicates the first Battery terminal voltage measured at each sampling time; Indicates the first Battery temperature measured at each sampling time; Indicates the first Estimated state of charge at each sampling time; This indicates the final sampling sequence number within this incentive execution cycle, which is also the end index of this set of time-series data; the dataset It fully records the time-series mapping relationship from excitation input to voltage, temperature and state response, which can be directly used as input for subsequent recursive least squares, extended Kalman filter, unscented Kalman filter, particle filter or other electrochemical parameter estimation algorithms to achieve the updating and identification of target parameters.
[0239] Experimental verification
[0240] To illustrate the differences between the method of this invention and the comparative method in terms of voltage boundary compliance, information output efficiency, and parameter identification error, this embodiment constructs a comparative calculation case based on the parameter background of LG18650 battery and plots... Figure 4 , Figure 5 and Figure 6 The following explanation is provided. The comparison methods include the traditional constant current (CC) condition, the standard dynamic condition (FUDS) condition, the soft-constraint reinforcement learning method with fixed penalty weights (Soft-SAC), and the L-SAC method proposed in this invention. Each method is compared under the same voltage constraint and the same ampere-hour budget.
[0241] 1. Security boundary compliance verification (corresponding appendix) Figure 4 )
[0242] Figure 4The terminal voltage response trajectories under the given operating conditions using different methods are presented. Figure 4 It is evident that the terminal voltage trajectory of the Soft-SAC exceeded [a certain value] in both high-voltage regions. The safety threshold indicates that the method exhibits voltage overshooting under the current comparative conditions. The terminal voltage trajectory of the L-SAC in this invention approaches... It exhibits a clear amplitude limiting characteristic at the threshold and remains consistently within a certain range. Below the safety line. Figure 4 As shown in the results, the method of the present invention did not exhibit voltage exceedance under the comparison conditions.
[0243] 2. Comparison of Information Output Efficiency (see corresponding appendix) Figure 5 )
[0244] Figure 5 The relationship between cumulative Fisher information content and throughput per ampere-hour is presented in the curve. Figure 5 It is evident that the cumulative Fisher information volume of the conventional constant current (CC) operating condition grows the slowest, indicating its lowest information output capability under the same ampere-hour budget. As ampere-hour throughput increases, the cumulative Fisher information volume of L-SAC gradually surpasses that of Soft-SAC in subsequent intervals, reaching its highest value at the end. Figure 5 Near the endpoint, under the same 0.5 Ah budget condition, the final values of the cumulative Fisher information for L-SAC, Soft-SAC and conventional constant current conditions are approximately 165, 120 and 25, respectively, indicating that the method of the present invention has a higher information output capability under this budget condition.
[0245] 3. Comparison of recognition accuracy (corresponding appendix) Figure 6 )
[0246] Figure 6 The root mean square error (RMSE) of parameter identification for different methods is presented. Figure 6 The results show that the RMSE for the traditional constant current (CC) condition is 4.32%, the RMSE for the standard condition (FUDS) is 5.18%, the RMSE for Soft-SAC is 1.15%, and the RMSE for the method of this invention (L-SAC) is 0.68%. These results indicate that, under the current comparative conditions, the method of this invention has the smallest parameter identification error. Compared to Soft-SAC, the method of this invention further reduces the root mean square error of parameter identification from 1.15% to 0.68%. Compared to the traditional constant current and standard conditions, the method of this invention also demonstrates a more significant advantage in identification accuracy.
[0247] In conclusion, Figure 4 , Figure 5 and Figure 6This invention demonstrates that, under the current comparative conditions, the method of this invention balances better voltage boundary compliance, higher information output capability, and lower parameter identification error. It can generate input excitation trajectories that balance safety and information within a limited experimental budget, thereby improving the effectiveness of identifying key parameters of lithium-ion batteries.
[0248] Example 2
[0249] This embodiment provides a two-stage parameter identification and stimulus generation system for SPMe lithium batteries based on reinforcement learning.
[0250] A computer-readable storage medium storing a plurality of instructions adapted for loading and execution by a processor of a terminal device, the aforementioned reinforcement learning-based two-stage parameter identification and stimulus generation method for SPMe lithium batteries.
[0251] A terminal device includes a processor and a computer-readable storage medium, the processor being used to implement various instructions; the computer-readable storage medium being used to store multiple instructions adapted for loading and execution by the processor of the aforementioned reinforcement learning-based two-stage parameter identification and stimulus generation method for SPMe lithium batteries.
[0252] The above are all preferred embodiments of the present invention and are not intended to limit the scope of protection of the present invention. Therefore, all equivalent changes made in accordance with the structure, shape and principle of the present invention should be covered within the scope of protection of the present invention.
Claims
1. A two-stage parameter identification and stimulus generation method for SPMe lithium batteries based on reinforcement learning, characterized in that, include: Obtain lithium battery parameter information; Data preprocessing is performed based on the acquired lithium battery parameter information; A two-stage excitation model for SPMe lithium batteries is constructed, including the construction of a low-fidelity proxy model, a composite state space, a safety constraint, and a high-fidelity verification model. Intelligent agent interaction training is carried out based on the constructed composite state space and security constraints, including low-fidelity agent model training, candidate policy parameter set construction, and intelligent agent interaction training. The trained two-stage excitation model of SPMe lithium battery is used to generate current excitation. The high-fidelity verification model construction includes constructing a high-fidelity verification model to compensate for the insufficient description of the complex interface dynamics of real batteries by the low-fidelity proxy model. First, a constant-phase element CPE is introduced on the SPMe main framework to describe the fractional polarization characteristics exhibited by the porous electrode-electrolyte interface in the mid-to-low frequency region. The terminal voltage in the high-fidelity model is expressed as: ;in The frequency domain impedance of CPE, which is the additional fractional polarization voltage introduced by CP, is expressed as: in For generalized capacitance parameters, For fractional order, when When the value is less than 1, it reflects the diffuse polarization behavior and long memory effect in the real interface; in the time domain implementation, the Caputo fractional derivative is used to describe the fractional internal state. The dynamics, namely: To address the computational complexity of directly solving for fractional derivatives, we discretize the memory kernel using exponential and approximation methods: This leads to the introduction of a set of auxiliary states. , so that: And the fractional polarization term is approximated as: Ultimately, this enhances the physical realism of the model while controlling computational costs; The training of the low-fidelity proxy model includes employing the Lagrangian-SAC algorithm, which is based on the Lagrange multiplier method and uses parameter identification information as the core of policy optimization. The output voltage is then set relative to the vector of parameters to be identified. The sensitivity Jacobian is: The measurement noise covariance matrix is Then the single-step Fisher information matrix is: The instant reward is defined as the trace of this matrix: When the object to be identified is a single parameter, the reward degenerates to the normalized squared sensitivity; when the object to be identified is a multi-parameter object, the reward represents the contribution of the current operating condition to the comprehensive information content of the multi-parameter joint identification; considering the safety cost vector... Construct the Lagrange maximum entropy objective: ,in For Actor policy networks, For the dual variable vector, As a discount factor, For entropy weighting coefficients, Represents the action distribution entropy; by initializing an Actor network, two Critic networks, and corresponding target networks, and establishing an experience replay pool. At each sampling time, the Actor receives the current state. And output the normalized action. The actual current is obtained after motion mapping. Low-fidelity model based on The deduction yields the new state, voltage output, and the composite state at the next moment. ; calculate the instant reward accordingly and security costs and transfer samples The samples are stored in the experience replay pool, and then a small batch of samples is randomly sampled from the experience replay pool to update the Critic network, Actor network, and dual variables respectively. The Lagrange dual variable is updated according to the gradient ascent method. ,in For the dual learning rate, For the first The upper bound of the expected value allowed by class constraints. This indicates projection onto the non-negative interval; The construction of the candidate strategy parameter set includes first freezing the Actor network parameters, Critic network parameters, and Lagrange dual variables, which are denoted as follows: Among these, the policy network parameters are what truly determine the subsequent incentive trajectory. To avoid relying solely on a single training result for high-fidelity validation, a rollout with no or weak exploration noise is first performed in a low-fidelity agent environment. Performance metrics, including cumulative reward, cumulative Fisher information, information gain per ampere-hour, and probability of exceeding limits, are calculated. ,in To prevent tiny positive numbers with a denominator of zero; subsequently, a candidate strategy set is formed based on the evaluation results, whereby a strategy satisfies the condition that the probability of exceeding the limit in a low-fidelity environment is not higher than a threshold. At the same time, the information gain per unit ampere-hour is not lower than the threshold. When that happens, it is included in the candidate strategy set: Among them, the Actor parameters with the best performance in the most recent training rounds are retained, and they are packaged together with the performance description corresponding to each candidate policy to form a candidate information packet: ,in This represents an estimate of the probability of violation in a low-fidelity environment.
2. The two-stage parameter identification and stimulus generation method for SPMe lithium batteries based on reinforcement learning according to claim 1, characterized in that, The construction of the low-fidelity proxy model includes building a battery proxy environment with low computational cost and rapid rolling derivation based on the premise of physical interpretability. The single-particle model SPMe with electrolyte extension is used as the low-fidelity proxy model, where the lithium-ion diffusion process within the positive and negative electrode particles is described by Fick's second law in spherical coordinates. , in This indicates the lithium-ion concentration in the electrode particles. Represents the solid-phase diffusion coefficient. The coordinates represent the particle radius direction. The boundary conditions are that the flux at the particle center is zero and the surface flux is determined by the interfacial reactive flow, i.e.: , , in Indicates particle radius, Indicates the interfacial reaction flux density. Represents specific surface area. Here, is the Faraday constant; to adapt to high-frequency calls in reinforcement learning environments, the solid-phase diffuser model is discretized using finite difference, orthogonal configuration, and moment matching. The electrolyte concentration change is represented using a reduced-order method, retaining its influence on the mid-to-low frequency terminal voltage response. The terminal voltage output is written as the sum of open-circuit voltage difference, reaction overpotential, electrolyte polarization term, and ohmic voltage drop. , in and This indicates the surface concentration of the positive and negative electrode particles. and This is the overpotential of the two-electrode reaction. This is the concentration polarization term of the electrolyte. The equivalent ohmic resistance is used. Considering the cost of directly evaluating the Butler–Volmer relation in high-frequency training, a piecewise linear approximation is made near the current operating point to obtain a state-space expression suitable for discrete-time iteration: The subscript L indicates a low-fidelity model. For internal state vectors, The parameter vector to be identified, and The process noise and measurement noise are represented by the state vector as follows: ,in, Represents the internal state vector of the model; This represents the vector of parameters to be identified; This represents the state vector in a reinforcement learning environment; This represents the normalized action output by the policy network; Indicates an immediate reward; Represents the safety cost vector; Represent the Lagrange dual variable vector; This represents the Jacobian, which indicates the sensitivity of the output to the parameter. Represents the Fisher information matrix; Indicates the cumulative throughput per ampere-hour; This indicates the cumulative energy consumption.
3. The two-stage parameter identification and stimulus generation method for SPMe lithium batteries based on reinforcement learning according to claim 2, characterized in that, The construction of the composite state space involves transforming the battery excitation design problem into a constrained Markov decision process and constructing a composite state vector for direct use by reinforcement learning networks. By explicitly introducing information sensitivity and budget features into the states, an active decision-making capability for parameter identification tasks is formed. The state vector is first defined as follows: ,in Indicates the current state of charge. Indicates the current terminal voltage. This represents the input current at the previous moment, used to constrain the rate of change of the action at the current moment; and These represent the normalized cumulative ampere-hour consumption and cumulative energy consumption, respectively. and These represent the normalized distances from the state to the voltage boundary and the SOC boundary, respectively. Normalized temperature state, information sensitivity state Defined as the square of the first-order sensitivity norm of the output voltage with respect to the parameter to be identified: When the object to be identified is a single parameter, it degenerates into: When the object to be identified has multiple parameters, It reflects the comprehensive identifiable strength of all target parameters under the current operating conditions; Budget characteristics are derived from cumulative and Its composition, in its normalized form, is: The boundary distance state is defined as follows: , , Finally, the control current is generated through actual mapping, using the following saturation limiting relationship: , in This represents the amplitude limiting function.
4. The two-stage parameter identification and stimulus generation method for SPMe lithium batteries based on reinforcement learning according to claim 3, characterized in that, The construction of the safety constraints includes transforming the multidimensional physical boundaries, budget boundaries, and device boundaries during battery operation into constraints and safety cost functions that can be handled by reinforcement learning. The safety constraints include full-cycle budget boundaries for ampere-hour throughput and energy consumption. The battery operation constraint domain is defined as follows: , Then, the degree of violation of each constraint is mapped to a non-negative safety cost component, and the safety cost vector is defined as: The voltage cost is: ; SOC cost is: ; The budgeted cost for ampere-hours is: ; The energy budget cost is: ; The cost of the rate of change of current is: ; The cost of temperature is: Finally, to fully reflect budget expenditures, the budget variables are updated recursively: This yields cost signals.
5. The two-stage parameter identification and stimulus generation method for SPMe lithium batteries based on reinforcement learning according to claim 4, characterized in that, The agent interaction training includes, to improve the robustness of the policy to model uncertainty, preferentially applying bounded random perturbations to the parameters to be identified during the fast training phase, namely: ,in Indicates uniform distribution. The parameter perturbation amplitude is defined. After the rapid training reaches a preset number of rounds, the current policy parameters are frozen and deployed to a high-fidelity model for offline rolling verification. During the verification process, the initial SOC, initial temperature, budget upper limit, and boundary parameters are maintained consistent with those in the training phase. The voltage trajectory, SOC trajectory, temperature trajectory, ampere-hour consumption, energy consumption, and Fisher information accumulation under the candidate policy are recorded throughout the entire cycle, and the information gain per unit ampere-hour of the policy is calculated. And calculate the information gain per unit energy: To ensure the final strategy can be directly deployed in the actual system, a zero-tolerance screening criterion is adopted. If any candidate strategy exceeds the hard constraint limit at any point during the high-fidelity offline verification process, the strategy is immediately removed from the candidate set and does not enter the final feasible set. Only when a strategy consistently satisfies all hard boundaries throughout the entire verification cycle is it considered a feasible strategy and included in the feasible set. The feasible set is represented as follows: , in, This represents the set of feasible strategies after high-fidelity offline verification. Indicates the first The policy network parameters corresponding to each candidate policy, with symbolic subscripts. This indicates that the parameter belongs to the policy network, and the subscript... Number the candidate strategies. ,in This represents the total number of feasible strategies that ultimately passed high-fidelity verification and were retained; superscript. This set of parameters represents the candidate strategy parameters frozen and exported when entering the current high-fidelity screening stage; subsequently, the optimal strategy is selected from the feasible set according to the principle of maximizing information gain per unit ampere-hour, and its expression is: , in, This represents the network parameters of the final selected optimal strategy; Indicates in the feasible set Search and select the set of parameters that maximizes the objective function; This represents the information gain index per unit ampere-hour.
6. The two-stage parameter identification and stimulus generation method for SPMe lithium batteries based on reinforcement learning according to claim 5, characterized in that, The process of generating current excitation using a trained two-stage excitation model for SPMe lithium batteries includes reading the optimal strategy parameters. Perform a complete rolling inference without introducing exploration noise to obtain the target time domain. The optimal input stimulus sequence within: To improve deployment security, the voltage limit during the training phase will be increased. Correspondingly, the upper limit of engineering protection is tightened. The upper limit of temperature has been tightened to This forms a dual-layer protection structure for training and execution. The final input stimulus sequence is sent to the battery testing platform or embedded BMS for execution via CAN bus, Ethernet, or serial port. During execution, voltage, current, temperature, estimated SOC, and budgeted energy consumption are continuously and in real-time collected, and the system checks for triggering secondary safety protection conditions. The stimulus output is immediately terminated when any of the following conditions are met: After the entire stimulus execution is completed, the data from the entire process is organized into a parameter identification dataset, which is represented as follows: ,in, This represents the experimental dataset used for subsequent parameter identification calculations; Indicates the first The timestamp corresponding to each sampling moment; Indicates the first The optimal input excitation current actually applied to the battery at each sampling time, where the superscript... This indicates that the current was generated by the ultimately selected optimal strategy; Indicates the first Battery terminal voltage measured at each sampling time; Indicates the first Battery temperature measured at each sampling time; Indicates the first Estimated state of charge at each sampling time; This indicates the final sampling sequence number within this incentive execution cycle.
7. A two-stage parameter identification and stimulus generation system for SPMe lithium batteries based on reinforcement learning, executing the two-stage parameter identification and stimulus generation method for SPMe lithium batteries based on reinforcement learning as described in claim 1, characterized in that, include: The data acquisition module is configured to acquire lithium battery parameter information; The preprocessing module is configured to perform data preprocessing based on the acquired lithium battery parameter information; The model building module is configured to build a two-stage excitation model for SPMe lithium batteries, including low-fidelity proxy model building, composite state space building, safety constraint building, and high-fidelity verification model building. The model training module is configured to train agent interactions based on the constructed composite state space and security constraints, including low-fidelity agent model training, candidate policy parameter set construction, and agent interaction training. The excitation module is configured to generate current excitation using a trained two-stage excitation model for SPMe lithium batteries.