Battery soc online estimation method based on deep reinforcement learning and related device

CN121522465BActive Publication Date: 2026-06-26JIANGSU ZENIO NEW ENERGY BATTERY TECH CO LTD

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: JIANGSU ZENIO NEW ENERGY BATTERY TECH CO LTD
Filing Date: 2025-09-12
Publication Date: 2026-06-26

AI Technical Summary

Technical Problem

Existing SOC estimation methods, such as the ampere-hour integration method, heavily rely on the accuracy of the initial SOC value, leading to increased cumulative errors and making it difficult to achieve accurate state of charge estimation in fields such as electric vehicles.

Method used

A deep reinforcement learning-based approach is adopted, using a dual reinforcement learning model that combines a CNN-BiLSTM network and a dual deep Q network. By utilizing historical state parameters such as voltage, current, and temperature, an experience pool is constructed, and the model is trained by prioritizing sampling through the PER experience replay mechanism to output the current SOC of the battery.

Benefits of technology

It achieves accurate estimation of battery state of charge, reduces dependence on initial SOC value, enhances adaptability and robustness under different conditions, avoids error accumulation, and improves the adaptive capability of battery management.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN121522465B_ABST

Patent Text Reader

Abstract

The application provides a battery SOC online estimation method based on deep reinforcement learning and related equipment, historical state parameters of a target battery can be obtained; an action space corresponding to the historical state parameters is defined, and an action in the action space represents an SOC adjustment amplitude of the target battery; a double reinforcement learning model combining a CNN-BiLSTM network and a double deep Q network is constructed, experience data interacting with an environment is generated according to the double reinforcement learning model to construct an experience pool; the double reinforcement learning model is trained in cycles through a PER experience replay mechanism to preferentially sample until the double reinforcement learning model training reaches a preset condition; a current state parameter of the target battery is obtained and input into the trained double reinforcement learning model, and a current SOC of the target battery is output. According to the method provided in the application, the state of charge of the battery can be accurately predicted according to deep reinforcement learning.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of battery technology, and in particular to an online battery SOC estimation method and related equipment based on deep reinforcement learning. Background Technology

[0002] The State of Charge (SOC) of a battery refers to the percentage of its remaining charge at a specific point in time. Accurately estimating a battery's SOC is crucial for many applications, particularly in electric vehicles, energy storage systems, and portable electronic devices. Precise SOC estimation not only helps optimize battery efficiency and extend its lifespan but also ensures system safety and reliability. For example, in electric vehicles, accurate SOC estimation can prevent overcharging or over-discharging, thus avoiding potential safety risks and increasing the vehicle's driving range.

[0003] Currently, one of the widely used methods for SOC estimation is the ampere-hour integration method (Coulomb Counting). This method calculates the battery's charging or discharging capacity by measuring the current flowing through the battery and integrating it over time, thereby estimating the battery's SOC. However, the ampere-hour integration method heavily relies on an accurate initial SOC value. If the initial SOC value is inaccurate, even if subsequent current measurements are very precise, the accumulated error will gradually increase, leading to deviations in the entire SOC estimation process. Summary of the Invention

[0004] The technical problem this application aims to solve is to provide an online battery SOC estimation method and related equipment based on deep reinforcement learning, which can accurately predict the battery's state of charge. The specific solution is as follows:

[0005] An online battery SOC estimation method based on deep reinforcement learning includes:

[0006] Obtain historical state parameters of the target battery, including at least voltage, current, temperature, and the SOC tag corresponding to the historical state parameters;

[0007] Define the action space corresponding to the historical state parameters. The action space includes multiple action types, each of which represents the adjustment range of the SOC of the target battery. The action types include decreasing SOC, keeping SOC unchanged, or increasing SOC.

[0008] A dual reinforcement learning model combining a CNN-BiLSTM network and a dual deep Q-network is constructed, wherein the first CNN-BiLSTM network in the dual reinforcement learning model is used to select actions in the action space and update parameters, and the second CNN-BiLSTM network is used to calculate the target Q-value.

[0009] Based on the dual reinforcement learning model, experience data on interaction with the environment is generated to construct an experience pool;

[0010] The dual reinforcement learning model is trained cyclically by prioritizing sampling through the PER experience replay mechanism until the training of the dual reinforcement learning model reaches the preset conditions.

[0011] The current state parameters of the target battery are obtained and input into the trained dual reinforcement learning model to output the current SOC of the target battery. The current state parameters do not include the current SOC.

[0012] Optionally, the CNN-BiLSTM network described above includes a CNN network, a BiLSTM network, and a fully connected layer; the CNN network is used to extract local spatial features of the parameters other than the SOC label in the historical state parameters; the BiLSTM network is used to extract temporal features of the parameters other than the SOC label in the historical state parameters; and the fully connected layer is used to map the local spatial features and temporal features of the historical state parameters to the action space.

[0013] Optionally, in the above method, the CNN network and the BiLSTM network are connected in series, and the output of the CNN network is used as the input of the BiLSTM network.

[0014] Optionally, in the above method, the step of generating experiential data on interaction with the environment based on the dual reinforcement learning model to construct an experience pool includes:

[0015] Obtain the current state parameters of the target battery, and use the first CNN-BiLSTM network to select the current action corresponding to the current state parameters;

[0016] The next state parameters of the target battery after performing the current action are obtained, and experience data on interaction with the environment is generated; wherein, the experience data includes the current state parameters of the target battery, the current action of the target battery, the reward value, the next state parameters of the target battery, and the completion flag;

[0017] The experience data is stored in the experience pool.

[0018] Optionally, during the training of the dual reinforcement learning model, an ε-greedy strategy is used to randomly select actions; after the dual reinforcement learning model is trained, an ε-greedy strategy is used to select the optimal action.

[0019] Optionally, after obtaining the next state parameter of the target battery after performing the current action, the above method further includes:

[0020] The current Q value is determined based on the current state parameter, the current action, and the first CNN-BiLSTM network, and the next action corresponding to the next state parameter is selected based on the first CNN-BiLSTM network.

[0021] The target Q-value is determined based on the next state parameter, the next action, and the second CNN-BiLSTM network;

[0022] The model parameters of the first CNN-BiLSTM network are updated based on the current Q value and the target Q value.

[0023] Optionally, the above method, which prioritizes sampling and iteratively trains the model using the PER experience replay mechanism, includes:

[0024] Determine the TD error of each piece of experience data stored in the experience pool, and determine the priority of each piece of experience data based on the TD error;

[0025] The sampling probability of each empirical data point is determined based on its priority.

[0026] The experience pool is sampled according to the sampling probability, and the weight of each sampled experience data is calculated.

[0027] The loss function value is determined based on the empirical data obtained from sampling, and the loss function value is weighted by the weights.

[0028] The model parameters of the first CNN-BiLSTM network are updated using the weighted loss function value, and the process returns to the step of determining the TD error for each of the empirical data stored in the empirical pool.

[0029] Upon reaching a preset time, the model parameters in the first CNN-BiLSTM network are copied to the second CNN-BiLSTM network.

[0030] An online battery SOC estimation device based on deep reinforcement learning, comprising:

[0031] The acquisition unit is used to acquire historical state parameters of the target battery, wherein the historical state parameters include at least voltage, current, temperature, and the SOC tag corresponding to the historical state parameters;

[0032] An execution unit is used to define the action space corresponding to the historical state parameters. The action space includes multiple action types, each of which represents the adjustment range of the SOC of the target battery. The action types include decreasing SOC, keeping SOC unchanged, or increasing SOC.

[0033] The first building unit is used to build a dual reinforcement learning model that combines a CNN-BiLSTM network and a dual deep Q network. In the dual reinforcement learning model, the first CNN-BiLSTM network is used to select actions in the action space and update parameters, and the second CNN-BiLSTM network is used to calculate the target Q value.

[0034] The second construction unit is used to generate experiential data on interaction with the environment based on the dual reinforcement learning model, so as to construct an experience pool;

[0035] The training unit is used to preferentially sample and cyclically train the dual reinforcement learning model through the PER experience replay mechanism until the dual reinforcement learning model training reaches the preset conditions.

[0036] An online estimation unit is used to obtain the current state parameters of the target battery, input into the trained dual reinforcement learning model, and output the current SOC of the target battery. The current state parameters do not include the current SOC.

[0037] A storage medium comprising stored instructions, wherein, when the instructions are executed, the device in which the storage medium resides executes the online battery SOC estimation method based on deep reinforcement learning as described above.

[0038] An electronic device includes a memory and one or more instructions, wherein one or more instructions are stored in the memory and configured to be executed by one or more processors as described above for an online battery SOC estimation method based on deep reinforcement learning.

[0039] Based on the above, this application provides an online battery SOC estimation method and related equipment based on deep reinforcement learning. The method includes: acquiring historical state parameters of the target battery, the historical state parameters including at least voltage, current, temperature, and SOC labels corresponding to the historical state parameters; defining an action space corresponding to the historical state parameters, the action space including multiple action types, each action representing the adjustment magnitude of the SOC of the target battery; the action types including decreasing SOC, keeping SOC unchanged, or increasing SOC; and constructing a dual reinforcement learning system combining a CNN-BiLSTM network and a dual deep Q-network. The learning model includes a first CNN-BiLSTM network for selecting actions and updating parameters in the action space, and a second CNN-BiLSTM network for calculating the target Q-value. Experience data on interaction with the environment is generated based on the dual reinforcement learning model to construct an experience pool. The dual reinforcement learning model is trained iteratively using a PER (Performance-Replay) experience replay mechanism until the training meets preset conditions. The current state parameters of the target battery are obtained, input into the trained dual reinforcement learning model, and the current state of charge (SOC) of the target battery is output. The current state parameters do not include the current SOC. Applying the method provided in this application, the state of charge of a battery can be accurately predicted based on the dual reinforcement learning model. Attached Figure Description

[0040] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0041] Figure 1 A flowchart illustrating an online battery SOC estimation method based on deep reinforcement learning provided in this application;

[0042] Figure 2 A schematic diagram of the structure of a first CNN-BiLSTM network provided in this application;

[0043] Figure 3 A schematic diagram illustrating the data processing process of an intelligent agent provided in this application;

[0044] Figure 4 A flowchart illustrating the training process of an online network provided in this application;

[0045] Figure 5 A schematic diagram illustrating the learning result of a neural network provided in this application;

[0046] Figure 6 This application provides a schematic diagram of the performance test results of a neural network.

[0047] Figure 7 A schematic diagram of the structure of an online battery SOC estimation device based on deep reinforcement learning provided in this application;

[0048] Figure 8 This is a schematic diagram of the structure of an electronic device provided in this application. Detailed Implementation

[0049] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0050] In this application, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0051] This application provides an online battery SOC estimation method based on deep reinforcement learning. This method can be applied to electronic devices, such as computers, tablets, smartphones, and smart wearable devices. The method flowchart is shown below. Figure 1 As shown, it specifically includes:

[0052] S101: Obtain the historical state parameters of the target battery. The historical state parameters include at least voltage, current, temperature, and the corresponding state of charge (SOC) tag.

[0053] In this embodiment, historical state parameters may also include one or more parameters such as power, internal resistance, and equivalent cycle count. By integrating the advantages of multiple influencing factors, the accuracy, robustness, and adaptability of SOC estimation for the battery under dynamic, complex, and aging conditions are improved.

[0054] S102: Define the action space corresponding to the historical state parameters. The action space includes multiple action types. Each action is used to represent the adjustment range of the target battery's SOC. The action types include decreasing SOC, keeping SOC unchanged, or increasing SOC.

[0055] In this embodiment, each action in the action space can include different adjustment values to effectively cover the needs of dynamic battery adjustment. For example, decreasing x1SOC, decreasing x2SOC, ..., keeping SOC unchanged, increasing y1SOC, increasing y2SOC, ... The adjustment value for each action is discrete. By discretizing the action space and fixing the action pattern, the decision space is simplified, the model complexity is reduced, inference efficiency is improved, and extreme predictions by the model on unseen data are reduced.

[0056] S103: Construct a dual reinforcement learning model that combines a CNN-BiLSTM network and a dual deep Q-network. In the dual reinforcement learning model, the first CNN-BiLSTM network is used to select actions in the action space and update parameters, and the second CNN-BiLSTM network is used to calculate the target Q-value.

[0057] In this embodiment, the first CNN-BiLSTM network in the dual reinforcement learning model can be an online network in the DDQN algorithm, which can be represented as: , where θ is the network parameter of the first CNN-BiLSTM network, a represents the action, which is the adjustment value of SOC, and the value is taken from the action space; s is the current state, which includes battery state parameters such as voltage, current, temperature, power, internal resistance, and equivalent cycle number.

[0058] Optionally, the second CNN-BiLSTM network can be the target network in the DDQN algorithm, represented as... θ- represents the network parameters of the second CNN-BiLSTM network.

[0059] S104: Generate experiential data on interaction with the environment based on a dual reinforcement learning model to build an experience pool.

[0060] In this embodiment, the current battery state parameters in the environment are input into the dual reinforcement learning model, which selects an action. After the environment executes the action, a reward value and the next battery state parameters of the environment are obtained, thus completing the interaction with the environment. The experience data may include the current battery state parameters, the action selected by the dual reinforcement learning model, and the next battery state parameters.

[0061] Optionally, the experience pool stores multiple experience data.

[0062] S105: Prioritize sampling through the PER experience replay mechanism, and cyclically train the dual reinforcement learning model until the dual reinforcement learning model training reaches the preset conditions.

[0063] In this embodiment, the PER experience replay mechanism is used for priority sampling, and a dual reinforcement learning model is trained based on the experience data sampled from the experience pool.

[0064] In this embodiment, the preset condition may be that the loss function converges or that the number of training iterations of the dual reinforcement learning model is greater than a training iteration threshold.

[0065] S106: Obtain the current state parameters of the target battery, input them into the trained dual reinforcement learning model, and output the current SOC of the target battery. The current state parameters do not include the current SOC.

[0066] In this embodiment, once the dual reinforcement learning model has reached the preset training conditions, the battery's State of Charge (SOC) can be estimated online using the dual reinforcement learning model. Specifically, the current state parameters of the target battery can be input into the dual reinforcement learning model to obtain the current SOC of the target battery. It should be noted that the current state parameters do not include the current SOC itself, but are estimated using other measurement data related to the battery state. This reduces the dependence on the initial SOC, i.e., it avoids directly using the SOC as input, reducing data dependence and enhancing adaptability to different battery types and usage conditions. In practical applications, accurately obtaining the initial SOC is often difficult. Avoiding the problem of the model falling into circular dependencies that may result from directly using the initial SOC as input enhances the model's adaptability and robustness, and prevents error accumulation.

[0067] In some embodiments, after obtaining the current SOC of the target battery, at least one operation such as optimizing the charging and discharging strategy, fault diagnosis, and energy management of the target battery can be performed using the current SOC of the target battery.

[0068] By applying the method provided in the embodiments of this application, the state of charge of the battery can be accurately identified based on the dual reinforcement learning model, thereby enhancing the adaptability and robustness of SOC estimation under different initial SOC conditions.

[0069] In one embodiment provided in this application, based on the above scheme, optionally, the CNN-BiLSTM network includes a CNN network, a BiLSTM network, and a fully connected layer; the CNN network is used to extract the local spatial features of the parameters other than the SOC label in the historical state parameters; the BiLSTM network is used to extract the temporal features of the parameters other than the SOC label in the historical state parameters; and the fully connected layer is used to map the local spatial features and temporal features of the historical state parameters to the action space.

[0070] In this embodiment, the CNN network includes multiple convolutional layers, such as a first convolutional layer and a second convolutional layer. The parameters in the historical state parameters, excluding the SOC label, are first extracted via the first convolutional layer, followed by a batch normalization layer (BatchNorm1d, bn1) to normalize the data output from the convolution, maintaining the stability of the output data distribution during training; then, a non-linear activation function, ReLU, is applied. The second convolutional layer and its connected batch normalization layer and activation function are arranged sequentially in the same manner. Further, a dropout layer is introduced after each convolutional layer and its corresponding activation function to suppress overfitting by randomly masking some neuron connections during forward propagation with a preset probability, where the preset probability is a configurable hyperparameter. Finally, the local spatial features of the parameters in the historical state parameters, excluding the SOC label, are obtained.

[0071] Optionally, the BiLSTM network includes a bidirectional long short-term memory (LSTM) network layer, with the number of hidden units customizable as Hlstm and the number of layers customizable as nlayers. The output of the bidirectional LSTM network is the final hidden state, which is formed by concatenating the forward propagation hidden state hforward and the backward propagation hidden state hbackward along the feature dimension. Through this structure, the module can effectively capture the long-term temporal dependencies in historical state parameters and simultaneously utilize historical and future contextual information to achieve accurate prediction of SOC. Thus, by using multiple historical time steps as input, combined with the selective memory function of the LSTM structure, the model can balance the use of short-term and long-term information for comprehensive SOC estimation. Furthermore, the model does not directly use SOC as input, forcing it to learn and infer SOC from measurement data, avoiding circular dependencies and error accumulation.

[0072] In this embodiment, a fully connected layer maps a high-dimensional feature vector that integrates spatial and temporal features to a predefined action space and outputs the state-action value function Q-value for each action. In this reinforcement learning framework, the training objective is to maximize the long-term cumulative reward. For any state s, the agent is configured to select action a based on the current Q-value, which is used to evaluate the expected cumulative reward obtained by performing action a in state s. Through iterative training, the agent gradually learns to select actions that maximize the Q-value, thereby determining the optimal action in a given state and ultimately obtaining a battery management strategy that maximizes the cumulative reward.

[0073] In one embodiment provided in this application, based on the above scheme, optionally, the CNN network and the BiLSTM network are connected in series, and the output of the CNN network is used as the input of the BiLSTM network.

[0074] In this embodiment, as Figure 2 As shown, battery state parameters such as voltage, current, temperature, and power information from historical states can be input into a CNN-BiLSTM network. The CNN network portion of the CNN-BiLSTM network then extracts features from these historical state parameters to obtain local spatial features. Specifically, the input data dimension is set to D. in = B×L×F, where B represents the batch size, L represents the history length, and F represents the number of features (such as voltage, current, temperature, etc.). The CNN network first processes the data through two convolutional layers, Conv1 and Conv2, and the output data dimension is B×C. out2 ×L. Here C out2 It is the number of output channels of the second convolutional layer.

[0075] Optionally, after the CNN network portion of the CNN-BiLSTM network extracts spatial features, these features are then input into the BiLSTM network within the CNN-BiLSTM network to obtain temporal features. The input data dimension of the BiLSTM network portion is B×L×C. out2 The output dimension is B×(H) lstm ×2).

[0076] Furthermore, the output of the BiLSTM network is used as the input to the fully connected layer, and the output dimension of the fully connected layer is... ,in The cardinality represents the action space, which includes, but is not limited to, multiple discrete actions such as decreasing SOC, keeping SOC constant, and increasing SOC. By fusing the local spatial features extracted by the convolutional neural network with the bidirectional temporal dependency features captured by the bidirectional long short-term memory network, the effective information in the historical state sequence can be fully extracted, significantly improving the estimation accuracy of SOC and the reliability of policy generation.

[0077] In one embodiment provided in this application, based on the above-described scheme, optionally, experience data on interaction with the environment is generated according to a dual reinforcement learning model to construct an experience pool, including:

[0078] Obtain the current state parameters of the target battery, and use the first CNN-BiLSTM network to select the current action corresponding to the current state parameters;

[0079] Obtain the next state parameters of the target battery after it performs the current action, and generate experience data on interaction with the environment; wherein, the experience data includes the current state parameters of the target battery, the current action of the target battery, the reward value, the next state parameters of the target battery, and the completion flag;

[0080] Experience data is stored in the experience pool.

[0081] In this embodiment, by executing the current action, the current state of charge (SOC) of the battery can be increased, maintained, or decreased to obtain a processed SOC. This processed SOC is then used as the new estimated SOC of the target battery. In some embodiments, an adjustment value in the current action can be obtained, and the adjusted SOC of the target battery can be calculated based on the adjustment value and the current SOC.

[0082] In some embodiments, when the current action is a decrease or increase in SOC (State of Charge) action type, the current state of charge (SOC) can be changed by an adjustment value in the current action. This adjustment value (the amount of increase or decrease) can be determined according to a preset action execution strategy. The adjustment value can be a fixed value or a dynamic value determined based on at least one of the target battery's usage scenario and operating state. For example, if the target battery's current SOC is 50%, and the current action is a decrease in SOC with an adjustment value of 1%, the target battery's current SOC can be reduced to 49%.

[0083] In this embodiment, the target battery's current state parameter s, the target battery's current action a, the reward value r, the target battery's next state parameter s', and the completion flag can be used as a piece of empirical data. This empirical data is stored in the experience pool, and is represented as follows: , where d is the completion marker.

[0084] Optionally, when storing experience data in an experience pool, the priority of that experience data can be set, as follows:

[0085]

[0086] Where, p i Let δi be the priority of empirical data i, δi be the time difference TD error of empirical data i, and ∈ be a small positive number used to prevent the priority from being zero.

[0087] In one embodiment provided in this application, based on the above scheme, optionally, during the training process of the dual reinforcement learning model, an ε-greedy strategy is used to randomly select actions; after the dual reinforcement learning model is trained, an ε-greedy strategy is used to select the optimal action.

[0088] In this embodiment, during the training phase, the dual-reinforcement learning model employs an ε-greedy strategy to perform random action selection, aiming to balance the exploration of unknown states with the utilization of existing knowledge. This strategy may select a non-optimal action with a certain probability, thereby promoting a thorough exploration of the state space. As the training process progresses, the dual-reinforcement learning model gradually converges to a better action strategy, allowing the optimal action to be selected directly after training, thus improving the overall estimation accuracy.

[0089] Through reinforcement learning, the CNN-BiLSTM network ultimately obtains the optimal SOC adjustment strategy and optimizes the strategy based on reward signals (such as SOC estimation error). This enables the model to dynamically and adaptively adjust the SOC estimation strategy, thereby effectively responding to different battery state changes and environmental conditions.

[0090] In one embodiment provided in this application, based on the above-described solution, optionally, after obtaining the next state parameter of the target battery after performing the current action, the method further includes:

[0091] The current Q value is determined based on the current state parameters, the current action, and the first CNN-BiLSTM network, and the next action corresponding to the next state parameter is selected based on the first CNN-BiLSTM network.

[0092] The target Q-value is determined based on the next state parameters, the next action, and the second CNN-BiLSTM network.

[0093] Update the model parameters of the first CNN-BiLSTM network based on the current Q value and the target Q value.

[0094] In this embodiment, the target Q-value is a reference value used by the model to update its parameters during training. It provides the model with a clear objective, enabling it to optimize its parameters by minimizing the difference between the target Q-value and the current Q-value.

[0095] In this embodiment, the current state parameters and the current action can be input into the first CNN-BiLSTM network to obtain the current Q value; the next action corresponding to the next state parameter can be selected based on the first CNN-BiLSTM network and the current Q value.

[0096] The target Q-value for the next action is evaluated using a second CNN-BiLSTM network, as follows:

[0097]

[0098] Where r is the reward value, γ is the discount factor, d is the completion mark (done), y represents the target Q value, and a' represents the next action.

[0099] Optionally, the reward value r can be calculated as follows:

[0100] r=q·r error +e·r smooth

[0101] Where r is the reward value, r error r is the accuracy bonus value. error =-∣SOC 实际 -SOC 预测 |, r smooth As a stability reward, r smooth =-∣SOC 预测1 -SOC 预测0 |, q is r error The weight, e is r smooth Weights, SOC 实际 SOC is the current state parameter. 预测 The current SOC of the target battery is predicted by the first CNN-BiLSTM network. 预测1 The SOC obtained for executing the next action. 预测0 The SOC obtained by performing the current action.

[0102] In one embodiment provided in this application, based on the above-described scheme, optionally, the model is trained cyclically by prioritizing sampling through a PER experience replay mechanism, including:

[0103] Determine the TD error for each piece of empirical data stored in the experience pool, and determine the priority of each piece of empirical data based on the TD error;

[0104] The sampling probability of each empirical data point is determined based on its priority.

[0105] The experience pool is sampled according to the sampling probability, and the weight of each sampled experience data is calculated.

[0106] The loss function value is determined based on the empirical data obtained from sampling, and the loss function value is weighted by weights.

[0107] The model parameters of the first CNN-BiLSTM network are updated using the weighted loss function value, and the step of determining the TD error for each empirical data point stored in the empirical pool is returned.

[0108] When the preset time is reached, the model parameters in the first CNN-BiLSTM network are copied to the second CNN-BiLSTM network.

[0109] In this embodiment, the data in the experience pool can be sampled to update the network parameters of the first CNN-BiLSTM network. The sampling probability of the experience data can be determined according to the priority p.i Confirmed, details are as follows:

[0110]

[0111] in, Let α be the sampling probability, and α control the influence of priority on the sampling probability, representing the sensitivity of the sampling probability to TD error. When α=0, all experiences have equal sampling probabilities, equivalent to random sampling. In this case, the effect of priority experience replay is equivalent to a normal experience replay mechanism. When α=1, sampling is performed entirely according to the magnitude of the TD error; experiences with larger errors have a higher probability of being sampled first. K represents the total amount of data in the experience pool.

[0112] In some embodiments, to correct for bias, an importance sampling weight wi can be introduced:

[0113]

[0114] Where N is the size of the experience pool, and β gradually increases from the initial value to 1 to control the degree of weighting; it is defined to correct the bias caused by priority sampling.

[0115] Thus, prioritizing experience replay makes it more likely that experiences with larger TD errors will be sampled, thereby accelerating learning and improving sampling efficiency. This allows the model to focus more on experiences with high learning value, improving learning efficiency. At the same time, the importance sampling weights can correct biases, ensuring that sampling biases are corrected and guaranteeing the unbiasedness of the estimation.

[0116] In priority experience replay, some experiences are assigned a higher sampling probability based on their TD error, while the sampling probability of other experiences may be reduced. Therefore, 𝛽 is introduced to adjust the sampling weights, allowing the model to utilize sampled experiences more fairly, especially those samples with lower sampling probabilities. The initial value of 𝛽 is usually small (close to 0), gradually increasing during training and eventually approaching 1. The purpose of this is to allow the model to focus more on learning samples with larger TD errors in the early stages of training, but to prevent the model from becoming biased due to over-focusing on high-error samples as training progresses.

[0117] In this embodiment, the loss function is calculated using sampled empirical data, and the loss function values are weighted by weights, as follows:

[0118]

[0119] in, This represents the weighted loss function value, where, This represents the expected value under the empirical sampling distribution of state-action-reward-next state-completion flag.

[0120] Optionally, after calculating the weighted loss function value, the network parameters of the first CNN-BiLSTM network can be updated using optimization algorithms such as gradient descent.

[0121] In some embodiments, the priority of each piece of empirical data in the empirical pool is updated according to the new TD error.

[0122] See Figure 3 This is a schematic diagram of the data processing process of an intelligent agent provided in this application. The intelligent agent includes an online network (a first CNN-BiLSTM network) and a target network (a second CNN-BiLSTM network). Both networks adopt the CNN-BiLSTM architecture and share the same model structure to simplify the structure. However, the network parameters are updated independently. The online network is used to select actions and update parameters, while the target network is used to calculate the target Q value and fix the network parameters for a period of time. This decouples action selection and evaluation through two independent value functions, reducing the overestimation problem. The introduction of the target network provides a stable learning target, improving the stability and convergence speed of training. In state s, the online network selects action a and calculates reward value r. It then stores state s, action a, reward value r, the actual state s' at the next moment, and completion flag in a memory pool (experience pool). When the amount of data stored in the memory pool meets the quantity condition, the data in the memory pool can be sampled to select experience data for training. Specifically, the online network selects the next action, and the target network evaluates the Q value of the action and uses it to calculate the loss function value. The online network is then updated based on the loss function value. Finally, the network parameters of the online network are periodically copied to the target network.

[0123] Prioritized Experience Replay (PER) aims to improve the efficiency of experience sampling, making experiences with larger TD errors more likely to be sampled, thereby accelerating learning. After the online network updates the model parameters, the priority of the experience data in the memory pool can be updated.

[0124] See Figure 4 This application provides a flowchart of the training process for an online network. The online network processes the state 's' in the environment to obtain the rating values of each action in the action space. Then, according to an ε-greedy policy, action 'a' is selected. Action 'a' is then executed to obtain the next state 's', reward 'r', and completion flag 'd'. The data is stored in an experience pool. When the experience pool reaches the batch size, training begins. A batch of experience is sampled from the experience pool, and the sampling weights are calculated. The target Q-value and the current Q-value are calculated, the loss is calculated, and the online network parameters are updated. The priority and sampling weights are updated. Then, the online network parameters are periodically copied to the target network. The next state s′ is updated to the current state s. After training, the model performance is evaluated using test data, and evaluation metrics (such as MSE, MAE, RMSE, R² score) are calculated.

[0125] Taking the energy storage battery in a vehicle as an example, the model training and testing process is introduced as follows:

[0126] First, data from four vehicle driving condition discharge cycles were collected, totaling 4 × 10,000 records (sampled at 1-second intervals). To reduce training costs and accelerate testing and training, data was extracted at 20-point intervals, resulting in 4 × 500 records (time step magnification increased by 20 times). These four sets of data were then used for training sequentially. Since the algorithm adjusts incrementally based on the number of steps, directly merging the datasets could lead to excessive adjustments in the initial stage, affecting training efficiency. To avoid forgetting and prevent overfitting, four sets of data were randomly and alternately selected for mixed training across the total epochs.

[0127] During the model testing phase, considering the significant dynamic changes in data during actual driving cycles, data from 20 points were randomly selected from each of the four cycles, forming a test set of 500 time points for each group. Furthermore, to evaluate the model's autoregressive capability, the initial state of interest (SOC) was set to 60% in the test set, observing its impact on model estimation and the model's performance when the initial SOC was unknown or inaccurate. Experimental results are as follows: Figure 5 As shown, the blue curve represents the true SOC, and the orange curve represents the predicted SOC. After training for only 100 epochs, the R² of the true SOC and the model-estimated SOC in all four epochs of the training set is ≥0.99, indicating that the model can effectively learn the SOC changes under different operating conditions. The performance on the test set is also excellent, demonstrating the model's good generalization ability. Even with an initial SOC set to 60%, the model can quickly regress to the true SOC trend within 100 time steps.

[0128] Figure 6The figure shows the distribution of current and voltage parameters in the training and test sets for the first two loops. The blue line represents the change of the actual SOC value over time, while the orange line represents the change of the algorithm-predicted SOC value over time. The initial SOC was set to 60% to evaluate the model's ability to quickly self-correct, demonstrating its adaptability under unknown initial SOC conditions. The figure also shows the changes in current and voltage data, which vary significantly during the driving loops and are sufficiently complex to effectively verify the model's prediction accuracy under complex conditions. The voltage density distribution shows the distribution of voltage data, and the current density distribution shows the distribution of current data. Peak values represent the most common voltage or current values, and the curve width reflects the degree of data dispersion. Furthermore, Figure 6 The voltage histogram and current histogram, which are overlaid with density curves, further illustrate the data dispersion and show that the data distribution ranges of the training set and the test set are consistent, but the probabilities of the values appearing are different. Cyan represents the training data, and pink represents the test data.

[0129] The training set is used to teach the model how to make SOC predictions, while the test set is used to evaluate the model's actual prediction performance under unhinted conditions, essentially simulating real-world performance. This approach allows for a comprehensive evaluation of the model's performance and generalization ability. The results show a certain difference between the two sets of data, with a wide distribution, further validating the robustness of the data.

[0130] In this embodiment, the metrics used to estimate model performance include MSE (mean squared error), MAE (mean absolute error), RMSE (root mean squared error), and R² (coefficient of determination). These metrics help quantify the deviation between the model's predicted values and the true values, thereby comprehensively evaluating the model's accuracy. MSE calculates the squared average of the differences between predicted and true values, and is particularly sensitive to larger errors, thus effectively reflecting the model's bias on certain samples. MAE is the average of the absolute values of all prediction errors, with each error contributing equally, suitable for applications where all errors should be treated equally, directly reflecting the model's average error. RMSE is the square root of MSE, restoring the error to the same order of magnitude as the original data, retaining sensitivity to large errors, while providing a more intuitive result with consistent units for easier understanding. R² measures the model's interpretability of the data, with values between 0 and 1. 1 indicates a perfect fit, 0 indicates the model's predictive ability is no different from using the mean, and values close to 1 indicate strong predictive ability. By comprehensively using these metrics, the model's performance can be comprehensively and accurately evaluated.

[0131] In this embodiment, the specific evaluation metrics of the model are shown in Table 1:

[0132] Table 1

[0133]

[0134] and Figure 1 Corresponding to the method described above, this application also provides an online battery SOC estimation device based on deep reinforcement learning, used for estimating... Figure 1 The specific implementation of the method is shown in the schematic diagram of the device. Figure 7 As shown, it includes:

[0135] The acquisition unit 701 is used to acquire historical state parameters of the target battery, wherein the historical state parameters include at least voltage, current, temperature, and the SOC tag corresponding to the historical state parameters.

[0136] The execution unit 702 is used to define the action space corresponding to the historical state parameters. The action space includes multiple action types, and each action is used to represent the SOC adjustment range of the target battery. The action types include decreasing SOC, keeping SOC unchanged, or increasing SOC.

[0137] The first building unit 703 is used to build a dual reinforcement learning model that combines a CNN-BiLSTM network and a dual deep Q network. The first CNN-BiLSTM network in the dual reinforcement learning model is used to select actions in the action space and update parameters, and the second CNN-BiLSTM network is used to calculate the target Q value.

[0138] The second construction unit 704 is used to generate experience data on interaction with the environment based on the dual reinforcement learning model in order to construct an experience pool;

[0139] Training unit 705 is used to preferentially sample and cyclically train the dual reinforcement learning model through the PER experience replay mechanism until the training of the dual reinforcement learning model reaches the preset conditions.

[0140] The online estimation unit 706 is used to obtain the current state parameters of the target battery, input them into the trained dual reinforcement learning model, and output the current SOC of the target battery.

[0141] In one embodiment provided in this application, based on the above scheme, optionally, the CNN-BiLSTM network includes a CNN network, a BiLSTM network, and a fully connected layer; the CNN network is used to extract the local spatial features of the parameters other than the SOC label in the historical state parameters; the BiLSTM network is used to extract the temporal features of the historical state parameters; and the fully connected layer is used to map the local spatial features and temporal features of the historical state parameters to the action space.

[0142] In one embodiment provided in this application, based on the above scheme, optionally, the CNN network and the BiLSTM network are connected in series, and the output of the CNN network is used as the input of the BiLSTM network.

[0143] In one embodiment provided in this application, based on the above-described solution, optionally, the second building unit includes:

[0144] The first acquisition subunit is used to acquire the current state parameters of the target battery and use the first CNN-BiLSTM network to select the current action corresponding to the current state parameters.

[0145] The second acquisition subunit is used to acquire the next state parameters of the target battery after it performs the current action, and generate experience data on interaction with the environment; wherein, the experience data includes the current state parameters of the target battery, the current action of the target battery, the reward value, the next state parameters of the target battery, and the completion flag;

[0146] A storage subunit is used to store the experience data into an experience pool.

[0147] In one embodiment provided in this application, based on the above scheme, optionally, during the training process of the dual reinforcement learning model, an ε-greedy strategy is used to randomly select actions; after the dual reinforcement learning model is trained, an ε-greedy strategy is used to select the optimal action.

[0148] In one embodiment provided in this application, based on the above-described solution, optionally, the second building unit further includes:

[0149] The selection subunit is used to determine the current Q value based on the current state parameter, the current action, and the first CNN-BiLSTM network, and to select the next action corresponding to the next state parameter based on the first CNN-BiLSTM network.

[0150] The first determining subunit is used to determine the target Q value based on the next state parameter, the next action, and the second CNN-BiLSTM network;

[0151] An update subunit is used to update the model parameters of the first CNN-BiLSTM network based on the current Q value and the target Q value.

[0152] In one embodiment provided in this application, based on the above-described scheme, optionally, the training unit 705 includes:

[0153] The second determining subunit is used to determine the TD error of each piece of experience data stored in the experience pool, and to determine the priority of each piece of experience data based on the TD error;

[0154] The third determining subunit is used to determine the sampling probability of each empirical data according to the priority of each empirical data;

[0155] A sampling subunit is used to sample the experience pool according to the sampling probability and calculate the weight of each sampled experience data.

[0156] The fourth determining subunit is used to determine the loss function value based on the sampled empirical data, and to weight the loss function value using the weights.

[0157] The first execution subunit is used to update the model parameters of the first CNN-BiLSTM network using the weighted loss function value, and return to the step of determining the TD error of each of the empirical data stored in the empirical pool;

[0158] The second execution subunit is used to copy the model parameters in the first CNN-BiLSTM network to the second CNN-BiLSTM network when a preset time is reached.

[0159] The specific principles and execution processes of each unit and module in the battery SOC online estimation device based on deep reinforcement learning disclosed in the above embodiments of this application are the same as those of the battery SOC online estimation method based on deep reinforcement learning disclosed in the above embodiments of this application. Please refer to the corresponding parts of the battery SOC online estimation method based on deep reinforcement learning provided in the above embodiments of this application, and they will not be repeated here.

[0160] This application embodiment also provides a storage medium, the storage medium including stored instructions, wherein, when the instructions are executed, the device where the storage medium is located is controlled to execute the above-described online battery SOC estimation method based on deep reinforcement learning.

[0161] This application also provides an electronic device, the structural schematic diagram of which is shown below. Figure 8 As shown, it specifically includes a memory 801 and one or more instructions 802, wherein one or more instructions 802 are stored in the memory 801 and are configured to be executed by one or more processors 803 to perform the above-mentioned online battery SOC estimation method based on deep reinforcement learning.

[0162] It should be noted that the various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For apparatus embodiments, since they are basically similar to method embodiments, the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.

[0163] Finally, it should be noted that in this paper, relational terms such as first and second are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations.

[0164] For ease of description, the above devices are described separately by function as various units. Of course, in implementing this application, the functions of each unit can be implemented in one or more software and / or hardware.

[0165] As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware platforms. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments or some parts of the embodiments of this application.

[0166] The above provides a detailed description of an online battery SOC estimation method based on deep reinforcement learning provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A battery SOC online estimation method based on deep reinforcement learning, characterized in that, include: Obtain historical state parameters of the target battery, including at least voltage, current, temperature, and the SOC tag corresponding to the historical state parameters; Define the action space corresponding to the historical state parameters. The action space includes multiple action types, each of which represents the adjustment range of the SOC of the target battery. The action types include decreasing SOC, keeping SOC unchanged, or increasing SOC. A dual reinforcement learning model combining a CNN-BiLSTM network and a dual deep Q-network is constructed, wherein the first CNN-BiLSTM network in the dual reinforcement learning model is used to select actions in the action space and update parameters, and the second CNN-BiLSTM network is used to calculate the target Q-value. Based on the dual reinforcement learning model, experience data on interaction with the environment is generated to construct an experience pool; The dual reinforcement learning model is trained cyclically by prioritizing sampling through the PER experience replay mechanism until the training of the dual reinforcement learning model reaches the preset conditions. The current state parameters of the target battery are obtained and input into the trained dual reinforcement learning model to output the current SOC of the target battery. The current state parameters do not include the current SOC.

2. The method according to claim 1, characterized in that, The CNN-BiLSTM network includes a CNN network, a BiLSTM network, and a fully connected layer; the CNN network is used to extract local spatial features of the parameters other than the SOC label in the historical state parameters; The BiLSTM network is used to extract the temporal features of the parameters other than the SOC label in the historical state parameters, and the fully connected layer is used to map the local spatial features and temporal features of the historical state parameters to the action space.

3. The method according to claim 2, characterized in that, The CNN network and the BiLSTM network are connected in series, with the output of the CNN network serving as the input of the BiLSTM network.

4. The method according to claim 2, characterized in that, The step of generating experiential data on interaction with the environment based on the dual reinforcement learning model to construct an experience pool includes: Obtain the current state parameters of the target battery, and use the first CNN-BiLSTM network to select the current action corresponding to the current state parameters; The next state parameters of the target battery after performing the current action are obtained, and experience data on interaction with the environment is generated; wherein, the experience data includes the current state parameters of the target battery, the current action of the target battery, the reward value, the next state parameters of the target battery, and the completion flag; The experience data is stored in the experience pool.

5. The method according to claim 4, characterized in that, During the training of the dual reinforcement learning model, an ε-greedy strategy is used to randomly select actions; after the dual reinforcement learning model is trained, an ε-greedy strategy is used to select the optimal action.

6. The method according to claim 4, characterized in that, After obtaining the next state parameter of the target battery after performing the current action, the method further includes: The current Q value is determined based on the current state parameter, the current action, and the first CNN-BiLSTM network, and the next action corresponding to the next state parameter is selected based on the first CNN-BiLSTM network. The target Q-value is determined based on the next state parameter, the next action, and the second CNN-BiLSTM network; The model parameters of the first CNN-BiLSTM network are updated based on the current Q value and the target Q value.

7. The method according to claim 4, characterized in that, The method of prioritizing sampling and iteratively training the model through the PER experience replay mechanism includes: Determine the TD error of each piece of experience data stored in the experience pool, and determine the priority of each piece of experience data based on the TD error; The sampling probability of each empirical data point is determined based on its priority. The experience pool is sampled according to the sampling probability, and the weight of each sampled experience data is calculated. The loss function value is determined based on the empirical data obtained from sampling, and the loss function value is weighted by the weights. The model parameters of the first CNN-BiLSTM network are updated using the weighted loss function value, and the process returns to the step of determining the TD error for each of the empirical data stored in the empirical pool. Upon reaching a preset time, the model parameters in the first CNN-BiLSTM network are copied to the second CNN-BiLSTM network.

8. A battery SOC online estimation device based on deep reinforcement learning, characterized in that, include: The acquisition unit is used to acquire historical state parameters of the target battery, wherein the historical state parameters include at least voltage, current, temperature, and the SOC tag corresponding to the historical state parameters; An execution unit is used to define the action space corresponding to the historical state parameters. The action space includes multiple action types, each of which represents the adjustment range of the SOC of the target battery. The action types include decreasing SOC, keeping SOC unchanged, or increasing SOC. The first building unit is used to build a dual reinforcement learning model that combines a CNN-BiLSTM network and a dual deep Q network. In the dual reinforcement learning model, the first CNN-BiLSTM network is used to select actions in the action space and update parameters, and the second CNN-BiLSTM network is used to calculate the target Q value. The second construction unit is used to generate experiential data on interaction with the environment based on the dual reinforcement learning model, so as to construct an experience pool; The training unit is used to preferentially sample and cyclically train the dual reinforcement learning model through the PER experience replay mechanism until the dual reinforcement learning model training reaches the preset conditions. An online estimation unit is used to obtain the current state parameters of the target battery, input into the trained dual reinforcement learning model, and output the current SOC of the target battery. The current state parameters do not include the current SOC.

9. A storage medium, characterized in that, The storage medium includes stored instructions, wherein, when the instructions are executed, the device in which the storage medium resides controls the execution of the method as described in any one of claims 1-7.

10. An electronic device, characterized in that, It includes memory, and one or more instructions, wherein one or more instructions are stored in memory and configured to be executed by one or more processors as described in any one of claims 1-7.