A control method of a liquid-cooled battery thermal management system

By combining a multilayer perceptron model and the D2SAC algorithm, the problem of lithium battery sensitivity to temperature was solved, and adaptive control of the liquid-cooled battery thermal management system was realized, improving the performance and safety of the battery pack.

CN119419417BActive Publication Date: 2026-06-26BEIJING NORMAL UNIV AT ZHUHAI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING NORMAL UNIV AT ZHUHAI
Filing Date
2024-10-24
Publication Date
2026-06-26

Smart Images

  • Figure CN119419417B_ABST
    Figure CN119419417B_ABST
Patent Text Reader

Abstract

The application relates to a control method of a liquid-cooled battery thermal management system, the application is based on a multilayer perception machine to predict the state environment of a lithium battery under a given control action, and through deep reinforcement learning, the interaction between the current state of the battery and the environmental conditions is simulated, after preliminary offline training of the reinforcement learning model, the model is tested on an energy storage system test platform, and a dynamic algorithm model capable of predicting and regulating the future state of the lithium battery according to the current state of the lithium battery is obtained. The application provides basic theory and key technical support for the application of deep learning and reinforcement learning in the field of lithium battery thermal management.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the technical field of energy storage liquid cooling systems, and specifically relates to a control method for an energy storage liquid cooling battery thermal management system. Background Technology

[0002] As a new type of electrochemical energy storage unit, lithium batteries are widely used in energy storage due to their advantages such as high energy density, long cycle life, low self-discharge rate, and no memory effect. However, lithium-ion batteries have strict requirements for their operating temperature range. High temperatures and large temperature differences can not only shorten the battery's lifespan but also lead to thermal runaway, triggering a chain reaction that can cause fires or even explosions, resulting in serious economic losses and personal injury. Therefore, a reliable lithium battery thermal management system is crucial for practical production and daily life scenarios. Summary of the Invention

[0003] The purpose of this invention is to provide a control method for a liquid-cooled battery thermal management system, which effectively solves the problems mentioned in the background art.

[0004] To achieve the above objectives, the present invention provides the following technical solution.

[0005] A control method for a liquid-cooled battery thermal management system includes the following steps:

[0006] 1) Environment Setup: A multilayer perceptron (MLP) model consisting of an input layer, hidden layers, and an output layer is established to build a deep reinforcement learning training environment. The input layer is responsible for receiving data, and the number of nodes is consistent with the number of features in the input data. Each neuron in the hidden layer is connected to all nodes in the previous layer, and the nodes are weighted and summed. The output layer uses a linear activation function f(x) = x to output the regression result. The multilayer perceptron (MLP) model is used as a simulator of the environment. It can predict the future state of the battery based on the current battery state (such as SOC, SOH, temperature, voltage, current, etc.) and control actions (such as liquid cooling temperature setting, liquid flow pressure, etc.). This prediction model is the foundation for training the deep reinforcement learning model.

[0007] Furthermore, the environment setup includes:

[0008] A. Establish a data collection platform that communicates in real-time with both the battery management system and the liquid cooling control system. Collect various parameters and operational data of the battery from the energy storage system experimental platform, upload the data to a cloud server database, and then have the training equipment extract the data from the cloud server and perform data preprocessing. This preprocessing includes: data denoising, supplementing missing data, correcting erroneous or out-of-permissible data, and data normalization. Furthermore, by establishing a physical simulation model of the battery and liquid cooling system, obtain more battery operational information under more realistic conditions, thus achieving data augmentation.

[0009] B. Based on the processed data, an MLP model is established to receive the current state of the battery and the control actions as inputs, and output the predicted new state of the battery, thereby modeling the lithium battery and its surrounding environment.

[0010] 2) Model Training: Utilizing the deep reinforcement learning training environment built using the aforementioned MLP model, the Deep Diffusion Soft Actor-Critic (D2SAC) algorithm is used to interact with the training environment and train the deep reinforcement learning model. The D2SAC policy network (Actor) uses a diffusion-based algorithm, progressively adding Gaussian noise to the action distribution during the forward pass to increase its randomness, and gradually removing the noise through learning during the backward pass to recover an optimal action distribution, from which an action is sampled. s t This is the current state, π. θ (s t The action is generated by the policy network; the value function of each action is estimated through a dual-Q network (value network). To reduce overestimation; use an objective value function. Update, r t It is the reward, γ is the discount factor, and d t It is the termination marker; the objective function consists of maximizing the expected reward and entropy term of the policy. Update the policy network, where α is the weight of the entropy, logπ. θ (a t |s t ) is the entropy term of the action; the liquid cooling strategy is dynamically adjusted according to the entropy level of the current strategy;

[0011] The Deep Diffusion Soft Actor-Critic (D2SAC) adaptive model predictive control algorithm includes the following steps: algorithm initialization, action sampling, experience storage, sampling and calculating the target Q-value, updating the Critic network, updating the Actor network, automatically adjusting the entropy coefficient, and softly updating the target Q-network. These steps are repeated until the policy converges or the stopping condition is met. The current Actor network (policy network) generates the probability distribution of new actions, i.e., it generates the expectation and variance of the new distribution to determine the Gaussian distribution used to sample new actions. The Critic network estimates the value function of each action, employing a Double Critic Network (also known as a Double Q-network). The smaller of the two Q-values ​​is used to update the target Q-network to reduce the possibility of overestimation and improve the algorithm's stability and performance. The target Q-network (target network) evaluates the value of actions. It is separated from the Critic network and uses soft updates to avoid rapid fluctuations in Q-values ​​during training and maintain update stability.

[0012] Furthermore, the algorithm initialization operation includes:

[0013] A. Initialize the environment, generate a policy network (Actor network) based on the diffusion model and use a double Q network as the value network (Critic network), and then set the neural network parameters, including initializing the noise parameters of the diffusion model (such as the number of steps T and the noise level σ) and using a Gaussian distribution to randomly initialize the weights of the double Q network.

[0014] B. Initialize the target Q-network, whose parameters are usually the same as those of the dual Q-network;

[0015] θ target ←θ main

[0016] Where, θ target Let θ be the parameters of the target Q-network. main These are the parameters of the dual-Q network.

[0017] C initializes an experience replay pool to store experience samples generated by the agent's interaction with the environment, including but not limited to: state s, action a, reward r, next state s′, and termination flag d;

[0018] The sampling operation includes:

[0019] A. Given the current state s, initialize a random initial vector from the diffusion model of the policy network. During the forward pass, noise is gradually added to the data:

[0020]

[0021] Where, α t It controls the intensity of noise. It is random noise.

[0022] In each step t, a deep neural network is used to infer the mean and variance of the denoised distribution. Based on the current state s and time step t, the mean and variance of the denoised distribution are output:

[0023]

[0024] A Gaussian distribution is generated based on the mean and variance, and a normally distributed distribution is added. The perturbation term is used to randomly sample an action α.

[0025] B inputs the sampled action a into the environment to obtain the next state s′, reward r, and a flag indicating whether to terminate;

[0026] The storage experience operation includes:

[0027] A stores the current state s, action a, reward r, next state s', and termination flag d into the experience replay pool;

[0028] The sampling and calculation of the target Q value includes:

[0029] A random sample of data is taken from the experience replay pool to calculate the target Q value and update the network parameters;

[0030] B calculates the Q-value of the next state from the target Q-network, where action a′ is generated through a back-diffusion process:

[0031] Q target (s′, a′) = Target Q Network (s′, a′)

[0032] C. Choose the smallest Q value to avoid overestimation:

[0033] Q min (s′, a′)=min(Q1(s′, a′), Q2(s′, a′))

[0034] Based on the objective function of D2SAC, and combining the reward and discount factor γ, calculate the objective Q value:

[0035] y = r + γQ min (s′,a′)

[0036] The aforementioned Critic network update operation includes:

[0037] A. Using the mean squared error loss function, calculate the difference between the Q-value output by the Critic double Q network and the target Q-value:

[0038]

[0039] B uses the backpropagation algorithm to minimize the loss function in order to update the parameters of the Critic double-Q network;

[0040]

[0041] Where η is the learning rate;

[0042] The aforementioned Actor network update operation includes:

[0043] A. Calculate the policy loss of the Actor network based on the minimum Q-value output of the Critic double Q-network:

[0044] L(θ) = -E s~ρ,a~diifusion process [Q min (s,a)+αH(π θ (a|s))]

[0045] Where, π θ (a|s) represents the probability distribution of the actions generated through denoising, H(π) θ (a|s)) is the entropy of policy π, Q min (s, a) is the minimum Q value of the current state and action, α is the entropy coefficient, the goal of the loss function is to maximize the Q value of the action in the state, and includes an entropy regularization term to encourage the randomness of the policy and avoid getting trapped in local optima;

[0046] B minimizes the policy loss to update the parameters of the Actor network;

[0047] The automatic adjustment of the entropy coefficient includes:

[0048] A. Calculate the loss of the entropy coefficient based on the difference between the actual entropy value and the target entropy value of the current strategy:

[0049] L(α)=α·(-logπ(a|s)-H target )

[0050] Where -logπ(a|s) is the actual entropy of the current policy, H target It is the target entropy value;

[0051] B calculates the gradient of the loss function L(α) with respect to the entropy coefficient α through backpropagation, and updates the entropy coefficient using the Adam algorithm:

[0052]

[0053] Where η is the learning rate;

[0054] This dynamically adjusts the entropy coefficient, balancing the exploration-exploitation relationship, so that the policy network can both choose behaviors that do not seem optimal at the moment to obtain more information and make optimal decisions based on the currently known information.

[0055] The soft update target Q network operation includes:

[0056] A performs a soft update on the parameters of the target Q network, with the following update rules:

[0057] θ target ←τθ main +(1-τ)θ target

[0058] Here, τ is a small constant (e.g., 0.005) between 0 and 1, used to control the target Q-network parameter θ. target To the Critic double Q network parameters θ main The update rate;

[0059] After B completes repeated training, it outputs the final policy network (Actor network), which can generate the optimal action given the state.

[0060] 3) Offline training: The above deep reinforcement learning model is trained offline to obtain an optimized control strategy for the liquid cooling system.

[0061] 4) Actual Control: Based on the optimized liquid cooling system control strategy described above, corresponding commands are sent to the host computer to achieve liquid cooling system control. Details are as follows:

[0062] A. Acquire data and preprocess it;

[0063] B. Divide the data into training and validation sets, using 80% of the original data as the training set and the remaining 20% ​​as the test set.

[0064] C performs a dimensionality transformation on the data so that it can be subsequently input into the MLP model;

[0065] The D2SAC algorithm is used for training. Based on the trained model results, corresponding instructions are sent to the host computer to control the liquid cooling system for regulation.

[0066] Furthermore, the Deep Diffusion Soft Actor-Critic (D2SAC) algorithm uses a dual Q-network to reduce estimation bias and incorporates an entropy term H(π) = E into the policy optimization.s~ρ,a~π [-logπ(a|s)] is used to increase the randomness of the policy, where H(π) is the entropy of policy π, s is the state, α is the action, and ρ is the state distribution;

[0067] The Deep Diffusion Soft Actor-Critic (D2SAC) algorithm monitors the entropy level of the current policy and minimizes the loss function L(α) = α·(-logπ(a|s)-H target The weighting coefficient α of the entropy term is dynamically adjusted, where -logπ(a|s) is the actual entropy of the current policy, and H tar get It is the target entropy value;

[0068] Through the above operations, the policy network can both choose behaviors that are not currently optimal to obtain more information and make optimal decisions based on currently known information. The entropy level of the current policy can be dynamically adjusted to cool the policy. For example, when the algorithm greedily chooses higher rewards and may gradually get trapped in local optima, the entropy term, as a penalty term, can effectively escape this predicament and encourage the algorithm to explore more unknown possibilities. At the same time, because the entropy coefficient is small, the algorithm is not disturbed by too many exploration behaviors and can make full use of the information obtained from training, achieving a balanced exploration-exploitation relationship.

[0069] Furthermore, the model training can also employ the SoftActor-Critic (SAC) algorithm for interactive training within the training environment to train a deep reinforcement learning model. The SAC algorithm includes a policy network (Actor), a double Q network (value network), objective value function update, and policy update. The policy network (Actor) uses a diffusion-based algorithm, progressively adding Gaussian noise to the action distribution during the forward pass to increase its randomness, and progressively removing noise through learning during the backward pass to recover an optimal Gaussian distribution as the action distribution. An action is sampled from this distribution, and the Q-value of each action-state pair is estimated through the double Q network (value network), and updated using the objective value function. SAC updates the Actor network by maximizing the sum of the expected reward of the policy and the entropy term, repeating the above process until the termination condition is met.

[0070] The training process of the Soft Actor-Critic (SAC) algorithm is as follows:

[0071] At the start of training, initialize the parameters of the Actor and the double-Q network, as well as the experience replay pool D.

[0072] B\ For each training iteration, obtain the initial state from the environment;

[0073] C\ At each time step, the Actor generates the optimal Gaussian distribution as the action distribution based on the current state through a denoising process, samples an action from the distribution, executes the action, and observes the new state and reward;

[0074] D\Use the new status and reward information to update the parameters of the dual-Q network;

[0075] E\ uses the current output of the double Q network and the entropy regularization term to update the Actor's parameters;

[0076] Repeat the above process until the training termination condition is met.

[0077] Current domestic research on liquid cooling control schemes for battery thermal management systems mainly relies on traditional technologies such as PID control, fuzzy control, and model predictive control, lacking the application of machine learning technologies such as deep learning and reinforcement learning. This invention introduces deep reinforcement learning into the liquid cooling control technology of lithium battery thermal management systems, providing fundamental theoretical and key technical support for the application of deep learning and reinforcement learning in the field of lithium battery thermal management. Furthermore, the adaptive model predictive control (Deep Diffusion Soft Actor-Critic (D2SAC)) and Soft Actor-Critic (SAC) algorithms proposed in this invention can further improve the efficiency and effectiveness of liquid cooling control by exploring different cooling strategies, reducing estimation bias, and adaptively adjusting the control strategy, building upon existing technologies. Attached Figure Description

[0078] Figure 1 This is a schematic diagram of the process of Embodiment 1 of the present invention;

[0079] Figure 2 This is a flowchart illustrating Embodiment 2 of the present invention. Detailed Implementation

[0080] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0081] In electric vehicles and energy storage systems, the efficiency and stability of the liquid-cooled battery thermal management system play a crucial role in the overall performance and lifespan of the battery pack. Because batteries are affected by various factors during use, such as charge-discharge cycles, ambient temperature, and load changes, their internal parameters (e.g., SOC, SOH) and external conditions (e.g., temperature, current, voltage) constantly change. Therefore, developing a liquid-cooling control strategy that can adapt to these changes is particularly important. The Adaptive Model Predictive Control (D2SAC) algorithm of this invention is an innovative solution designed to address this need. This algorithm combines the predictive power of Model Predictive Control (MPC) with the adaptive learning capability of Deep Reinforcement Learning (DRL). Through continuous learning and optimization, it can automatically follow the changes in the parameters of the liquid-cooled battery thermal management system, thereby achieving dynamic adjustment of the liquid-cooling control strategy. The D2SAC algorithm's workflow steps are as follows:

[0082] 1. Data Collection and Preprocessing

[0083] The system collects various parameters of the battery and liquid cooling system in real time from the energy storage system test platform, including key data such as temperature, current, and voltage.

[0084] The collected data is preprocessed, including noise reduction, normalization, and handling of missing values, to ensure data quality.

[0085] Determine the state space and control action space of the battery and liquid cooling system to provide a foundation for subsequent modeling and training.

[0086] 2. Dataset partitioning

[0087] The preprocessed data is divided into a training set and a validation set (or test set), typically using 80% of the data as the training set and the remaining 20% ​​as the test set. This is to evaluate the algorithm's ability to generalize on unseen data.

[0088] 3. Establish a Multilayer Perceptron (MLP) model

[0089] A multilayer perceptron (MLP) is used as the basic model to model the lithium battery and its surrounding environment. MLPs can handle nonlinear relationships and are suitable for predicting new states of the battery.

[0090] The current state of the battery (such as SOC, SOH, temperature, etc.) and control actions (such as liquid cooling temperature, liquid flow pressure, etc.) are used as inputs to the MLP, and the output is the predicted new state of the battery (such as future temperature, voltage, etc.).

[0091] Perform dimensionality transformation on the data to adapt it to the input requirements of the MLP model.

[0092] 4. Build a deep reinforcement learning training environment

[0093] A training environment for deep reinforcement learning (DRL) is built using an MLP model, where the MLP model is used to simulate the interaction between the current state of the battery and environmental conditions.

[0094] The D2SAC algorithm is designed, which combines reinforcement learning and entropy maximization to find a balance between exploration and exploitation, while encouraging policies with higher entropy (i.e., more randomness) to enhance the algorithm's robustness.

[0095] During training, the policy network (actor) generates noisy actions to increase the diversity of exploration. The value function of each action is estimated using a double Q-network and updated using the target value function to stabilize the training process.

[0096] The policy network is updated by maximizing the sum of the expected reward and the entropy term, allowing the policy to maintain a certain degree of randomness while pursuing high rewards. The liquid cooling strategy is dynamically adjusted based on the current entropy level of the policy to adapt to changes in system parameters.

[0097] 5. Real-time control

[0098] The trained model is deployed to the actual system, and corresponding instructions are sent to the host computer based on the model's results.

[0099] After receiving the command, the host computer controls the liquid cooling system to make corresponding adjustments in order to effectively manage parameters such as battery temperature.

[0100] Through the above steps, the D2SAC algorithm can achieve adaptive control of the liquid-cooled battery thermal management system, dynamically adjusting the control strategy according to changes in system parameters, thereby improving the overall performance and lifespan of the battery pack.

[0101] Example 1

[0102] The purpose of this embodiment is to provide a control method for real-time intelligent adjustment of the temperature of an energy storage liquid cooling system. The liquid-cooled battery thermal management system of this invention includes the following parameters: battery state of charge (SOC), battery state of health (SOH), temperature, current, voltage, liquid cooling temperature, and liquid flow pressure. The intelligent control method of the liquid-cooled battery thermal management system of this invention includes the following steps:

[0103] 1. Collect various parameters of the battery and liquid cooling system from the energy storage system experimental platform, such as temperature, current, and voltage data, preprocess the data, and determine the corresponding state and control action space.

[0104] 2. Obtain the preprocessed data; divide the data into training and validation sets, using 80% of the original data as the training set and the remaining 20% ​​as the test set.

[0105] 3. Establish a multilayer perceptron (MLP) model, which receives the current state of the battery and control actions as inputs and outputs the predicted new state of the battery to model the lithium battery and its surrounding environment; perform dimensionality transformation on the data and input it into the MLP model.

[0106] 4. A deep reinforcement learning training environment is built using a multilayer perceptron (MLP) model. The D2SAC algorithm is designed to simulate the interaction between the current state of the battery and environmental conditions, and the reinforcement learning model is trained offline to continuously optimize the control strategy.

[0107] The D2SAC algorithm's policy network (actor) generates noisy actions; it estimates the value function of each action using a double-Q network and updates it using the objective value function; it updates the policy network by maximizing the sum of the expected reward and the entropy term; and it dynamically adjusts the liquid cooling policy based on the current policy's entropy level.

[0108] 5. Based on the trained model results, send corresponding instructions to the host computer to control the liquid cooling system for adjustment.

[0109] In step 1, the parameters for the operation of the battery and liquid cooling system may include: water pump operating status, unit outlet water temperature, replenishment water pump status, compressor operating status, unit return water temperature, fault alarm code, compressor operating status, unit outlet water pressure, electric heating operating status, unit return water pressure, condenser fan operating status, unit ambient temperature, total voltage, minimum cell voltage, maximum cell temperature, fan relay, discharge allowance control, status of internal fire suppression system, total current, module number of maximum cell voltage, module number of minimum cell temperature, operating indication, maximum allowable charging current, status of internal fire suppression system, battery state of charge (SOC), module number of minimum cell voltage, main positive relay, alarm indication, maximum allowable discharge current, SOH, maximum cell temperature, main negative relay, fault indication, maximum allowable charging power, maximum cell voltage, minimum cell temperature, pre-charge relay, charging allowance control, and maximum allowable discharge power.

[0110] The data preprocessing in step 1 includes data denoising, supplementing missing data, correcting erroneous or out-of-permission data, and data normalization.

[0111] In step 3, the multilayer perceptron (MLP) model consists of an input layer, a hidden layer, and an output layer. The input layer is responsible for receiving data, and the number of nodes is consistent with the number of features in the input data. Each neuron in the hidden layer is connected to all nodes in the previous layer and is weighted and summed. The output layer uses a linear activation function to output the regression result.

[0112] In step 4, the deep reinforcement learning model is trained using the D2SAC algorithm. The D2SAC algorithm uses a policy network based on a diffusion model and a double Q network as the value network to reduce estimation bias. Additionally, an entropy term H(π) = E is added to the policy optimization. s~ρ,a~π The randomness of the policy is increased by [-logπ(a|s)], where H(π) is the entropy of policy π, s is the state, α is the action, and ρ is the state distribution. Simultaneously, D2SAC monitors the entropy level of the current policy and minimizes the loss function L(α) = α·(-logπ(a|s)-H target The weighting coefficient α of the entropy term is dynamically adjusted, where -logπ / a|s) is the actual entropy of the current policy, and H target This is the target entropy value. Through the above operations, the policy network can both choose actions that are not currently optimal to obtain more information and make optimal decisions based on currently known information, that is, balance the exploration-exploitation relationship.

[0113] In step 4, the D2SAC algorithm training operation steps include: algorithm initialization, sampling action, storing experience, sampling and calculating the target Q value, updating the Critic double Q network, updating the Actor network, automatically adjusting the entropy coefficient, and softly updating the target Q network.

[0114] The algorithm initialization operation includes:

[0115] A. Initialize the environment, generate a policy network (Actor network) based on the diffusion model and use a double Q network as the value network (Critic network), and then set the neural network parameters, including initializing the noise parameters of the diffusion model (such as the number of steps T and the noise level σ) and using a Gaussian distribution to randomly initialize the weights of the double Q network.

[0116] B. Initialize the target Q-network, whose parameters are usually the same as those of the Critic dual Q-network;

[0117] θ target ←θ main

[0118] Where, θ target Let θ be the parameters of the target Q-network. main These are the parameters of the Critic dual-Q network.

[0119] C initializes an experience replay pool to store experience samples generated by the agent's interaction with the environment, including but not limited to: state s, action a, reward r, next state s', and termination flag d;

[0120] The sampling operation includes:

[0121] A. Given the current state s, initialize a random initial vector from the diffusion model of the policy network. During the forward pass, noise is gradually added to the data:

[0122]

[0123] Where, α t It controls the intensity of noise. It is random noise.

[0124] In each step t, a deep neural network is used to infer the mean and variance of the denoised distribution. Based on the current state s and time step t, the mean and variance of the denoised distribution are output:

[0125]

[0126] A Gaussian distribution is generated based on the mean and variance, and a normally distributed distribution is added. The perturbation term is used to randomly sample an action α.

[0127] B inputs the sampled action a into the environment to obtain the next state s′, reward r, and a flag indicating whether to terminate;

[0128] The storage experience operation includes:

[0129] A stores the current state s, action a, reward r, next state s′, and termination flag d into the experience replay pool;

[0130] The sampling and calculation of the target Q value includes:

[0131] A random sample of data is taken from the experience replay pool to calculate the target Q value and update the network parameters;

[0132] B calculates the Q-value of the next state from the target Q-network, where action a′ is generated through a back-diffusion process:

[0133] Q target (s′, a′) = Target Q Network (s′, a′)

[0134] C. Choose the smallest Q value to avoid overestimation:

[0135] Q min (s′, a′)=min(Q1(s′, a′), Q2(s′, a′))

[0136] Based on the objective function of D2SAC, and combining the reward and discount factor γ, calculate the objective Q value:

[0137] y = r + γQ min (s′,a′)

[0138] The aforementioned Critic dual-Q network update operation includes:

[0139] A. Using the mean squared error loss function, calculate the difference between the Q-value output by the Critic double Q network and the target Q-value:

[0140]

[0141] B uses the backpropagation algorithm to minimize the loss function in order to update the parameters of the Critic double-Q network;

[0142]

[0143] Where η is the learning rate;

[0144] The aforementioned Actor network update operation includes:

[0145] A. Calculate the policy loss of the Actor network based on the minimum Q-value output of the Critic double Q-network:

[0146] L(θ) = -E s~p,a~diiffusion process [Q min (s,a)+αH(π θ (a|s))]

[0147] Where, π θ (a|s) represents the probability distribution of the actions generated through denoising, H(π) θ (a|s)) is the entropy of policy π, Q min (s, a) is the minimum Q value of the current state and action, α is the entropy coefficient, the goal of the loss function is to maximize the Q value of the action in the state, and includes an entropy regularization term to encourage the randomness of the policy and avoid getting trapped in local optima;

[0148] B minimizes the policy loss to update the parameters of the Actor network;

[0149] The automatic adjustment of the entropy coefficient includes:

[0150] A. Calculate the loss of the entropy coefficient based on the difference between the actual entropy value and the target entropy value of the current strategy:

[0151] L(α)=α·(-logπ(a|s)-H target )

[0152] Where -logπ(a|s) is the actual entropy of the current policy, H tar get It is the target entropy value;

[0153] B calculates the gradient of the loss function L(α) with respect to the entropy coefficient α through backpropagation, and updates the entropy coefficient using the Adam algorithm:

[0154]

[0155] Where η is the learning rate;

[0156] This dynamically adjusts the entropy coefficient, balancing the exploration-exploitation relationship, so that the policy network can both choose behaviors that do not seem optimal at the moment to obtain more information and make optimal decisions based on the currently known information.

[0157] The soft update target Q network operation includes:

[0158] A performs a soft update on the parameters of the target Q network, with the following update rules:

[0159] θ target ←τθ main +(1-τ)θ target

[0160] Here, τ is a small constant (e.g., 0.005) between 0 and 1, used to control the target Q-network parameter θ. target To the Critic double Q network parameters θ main The update rate.

[0161] After B completes repeated training, the final policy network (Actor network) is output, which can generate the optimal action given the state.

[0162] Since the parameters of the liquid-cooled battery thermal management system itself will change with the operating conditions and environment, the adaptive model predictive control Deep Diffusion Soft Actor-Critic (D2SAC) algorithm of this invention can automatically follow the changes in system parameters and continuously optimize the liquid cooling control strategy.

[0163] Alternatively, in the above embodiments, the Soft Actor-Critic (SAC) algorithm can be used to interact in the training environment to train a deep reinforcement learning model. The SAC algorithm includes a policy network (Actor), a double Q network (value network), an objective value function update, and a policy update. The policy network (Actor) generates noisy actions, estimates the value function of each action through the double Q network (value network), and updates it using the objective value function. SAC updates the Actor network by maximizing the sum of the expected reward of the policy and the entropy term, and dynamically adjusts the liquid cooling policy according to the current entropy level of the policy.

[0164] The training process of the Soft Actor-Critic (SAC) algorithm is as follows:

[0165] At the start of training, initialize the parameters of the Actor and the double-Q network;

[0166] B\ For each training iteration, obtain the initial state from the environment;

[0167] C\ At each time step, the Actor generates a noisy action based on the current state, executes the action, and observes the new state and reward.

[0168] D\Use the new status and reward information to update the parameters of the dual-Q network;

[0169] E\ uses the current output of the double Q network and the entropy regularization term to update the Actor's parameters;

[0170] Repeat the above process until the training termination condition is met.

[0171] Example 2

[0172] The purpose of this embodiment is to provide a control method for real-time intelligent adjustment of the temperature of an energy storage liquid cooling system. The method uses the Soft Actor-Critic (SAC) algorithm to train the model and includes the following steps:

[0173] 1. Collect various parameters of the battery and liquid cooling system from the energy storage system experimental platform, such as temperature, current, and voltage data, preprocess the data, and determine the corresponding state and control action space.

[0174] 2. Obtain the preprocessed data; divide the data into training and validation sets, using 80% of the original data as the training set and the remaining 20% ​​as the test set.

[0175] 3. Establish a multilayer perceptron (MLP) model, which receives the current state of the battery and control actions as inputs and outputs the predicted new state of the battery to model the lithium battery and its surrounding environment; perform dimensionality transformation on the data and input it into the MLP model.

[0176] 4. A deep reinforcement learning training environment is built using a multilayer perceptron (MLP) model. The SAC algorithm is designed to simulate the interaction between the current state of the battery and environmental conditions, and the reinforcement learning model is trained offline to continuously optimize the control strategy.

[0177] The SAC algorithm's policy network (actor) generates noisy actions; it estimates the value function of each action using a double-Q network and updates it using the objective value function; it updates the policy network by maximizing the sum of the expected reward and the entropy term; and it dynamically adjusts the liquid cooling policy based on the current policy's entropy level.

[0178] 5. Based on the trained model results, send corresponding instructions to the host computer to control the liquid cooling system for adjustment.

[0179] In step 1, the parameters for the operation of the battery and liquid cooling system may include: water pump operating status, unit outlet water temperature, replenishment water pump status, compressor operating status, unit return water temperature, fault alarm code, compressor operating status, unit outlet water pressure, electric heating operating status, unit return water pressure, condenser fan operating status, unit ambient temperature, total voltage, minimum cell voltage, maximum cell temperature, fan relay, discharge allowance control, status of internal fire suppression system, total current, module number of maximum cell voltage, module number of minimum cell temperature, operating indication, maximum allowable charging current, status of internal fire suppression system, battery state of charge (SOC), module number of minimum cell voltage, main positive relay, alarm indication, maximum allowable discharge current, SOH, maximum cell temperature, main negative relay, fault indication, maximum allowable charging power, maximum cell voltage, minimum cell temperature, pre-charge relay, charging allowance control, and maximum allowable discharge power.

[0180] The data preprocessing in step 1 includes data denoising, supplementing missing data, correcting erroneous or out-of-permission data, and data normalization.

[0181] In step 3, the multilayer perceptron (MLP) model consists of an input layer, a hidden layer, and an output layer. The input layer is responsible for receiving data, and the number of nodes is consistent with the number of features in the input data. Each neuron in the hidden layer is connected to all nodes in the previous layer and is weighted and summed. The output layer uses a linear activation function to output the regression result.

[0182] In step 4, the deep reinforcement learning model is trained using the SAC algorithm. The SAC algorithm uses a double Q-network to reduce estimation bias, entropy regularization to enhance exploration, and automatic adjustment of the entropy coefficient to balance exploration and exploitation. While using a double Q-network to reduce estimation bias, the SAC algorithm also incorporates an entropy term H(π) = E into the policy optimization. s~ρ,a~π The randomness of the policy is increased by using [-logπ(a|s)], where H(π) is the entropy of policy π, s is the state, α is the action, and ρ is the state distribution. Simultaneously, SAC monitors the entropy level of the current policy and minimizes the loss function L(α) = α·(-logπ(a|s)-H target The weighting coefficient α of the entropy term is dynamically adjusted, where -logπ(a|s) is the actual entropy of the current policy, and H target This is the target entropy value. Through the above operations, the policy network can both choose actions that are not currently optimal to obtain more information and make optimal decisions based on currently known information, that is, balance the exploration-exploitation relationship.

[0183] In step 4, the SAC algorithm training steps include: algorithm initialization, sampling action, storing experience, sampling and calculating the target Q value, updating the Critic double Q network, updating the Actor network, automatically adjusting the entropy coefficient, and softly updating the target Q network.

[0184] The algorithm initialization operation includes:

[0185] A. Initialize the environment and set the parameters for the Actor network (policy network) and the dual-Q network (value network);

[0186] B. Initialize the target Q-network, whose parameters are usually the same as those of the Critic dual Q-network;

[0187] θ target ←θ main

[0188] Where, θ target Let θ be the parameters of the target Q-network. main These are the parameters of the Critic dual-Q network.

[0189] C initializes an experience replay pool to store experience samples generated by the agent's interaction with the environment, including but not limited to: state s, action a, reward r, next state s′, and termination flag d;

[0190] The sampling operation includes:

[0191] A. Given the current state s, initialize a random initial vector from the diffusion model of the policy network. During the forward pass, noise is gradually added to the data:

[0192]

[0193] Where, α t It controls the intensity of noise. It is random noise.

[0194] In each step t, a deep neural network is used to infer the mean and variance of the denoised distribution. Based on the current state s and time step t, the mean and variance of the denoised distribution are output:

[0195]

[0196] A Gaussian distribution is generated based on the mean and variance, and a normally distributed distribution is added. The perturbation term is used to randomly sample an action α.

[0197] B inputs the sampled action into the environment to obtain the next state s′, reward r, and a flag d indicating whether to terminate;

[0198] The storage experience operation includes:

[0199] A stores the current state s, action a, reward r, next state s′, and termination flag d into the experience replay pool;

[0200] The sampling and calculation of the target Q value includes:

[0201] A random sample of data is taken from the experience replay pool to calculate the target Q value and update the network parameters;

[0202] B. Calculate the Q-value of the next state from the target Q-network:

[0203] Q target (s′,a′)=Target Q Network(s′,a′)

[0204] C. Choose the smallest Q value to avoid overestimation:

[0205] Q min (s′,a′)=min(Q1(s′,a′),Q2(s′,a′))

[0206] D\Calculate the objective Q value based on the objective function of SAC, combined with the reward and discount factor γ:

[0207] y = r + γQ min (s′,a′)

[0208] The aforementioned Critic dual-Q network update operation includes:

[0209] A. Using the mean squared error loss function, calculate the difference between the Q-value output by the Critic double Q network and the target Q-value:

[0210]

[0211] B uses the backpropagation algorithm to minimize the loss function in order to update the parameters of the Critic double-Q network;

[0212]

[0213] Where η is the learning rate.

[0214] The aforementioned Actor network update operation includes:

[0215] A. Calculate the policy loss of the Actor network based on the minimum Q-value output of the Critic double Q-network:

[0216]

[0217] Wherein, H(π) θ (a|s)) is the entropy of policy π, Q min (s,a) represents the minimum Q-value of the current state and action, and α is the entropy coefficient. The goal of the loss function is to maximize the Q-value of the action in the current state, and it includes an entropy regularization term to encourage policy randomness and avoid getting trapped in local optima.

[0218] B minimizes the policy loss to update the parameters of the Actor network;

[0219] The automatic adjustment of the entropy coefficient includes:

[0220] A. Calculate the loss of the entropy coefficient based on the difference between the actual entropy value and the target entropy value of the current strategy:

[0221] L(α)=α·(-logπ(a|s)-H target )

[0222] B calculates the gradient of the loss function L(α) with respect to the entropy coefficient α through backpropagation, and updates the entropy coefficient using the Adam algorithm:

[0223]

[0224] Where η is the learning rate.

[0225] This allows for dynamic adjustment of the entropy coefficient, enabling the policy network to both select behaviors that are not currently optimal to obtain more information and make optimal decisions based on currently known information, thus balancing the exploration-exploitation relationship.

[0226] The soft update target Q network operation includes:

[0227] A performs a soft update on the parameters of the target Q network, with the following update rules:

[0228] θ target ←τθ main +(1-τ)θ target

[0229] Here, τ is a small constant (e.g., 0.005) between 0 and 1, used to control the target Q-network parameter θ. target To the Critic double Q network parameters θ main The update rate.

[0230] After B completes repeated training, the final policy network (Actor network) is output, which can generate the optimal action given the state.

[0231] Both Embodiments 1 and 2 aim to solve the decision-making problem of liquid-cooled battery thermal management systems in complex environments by combining deep learning and reinforcement learning methods. They both employ a dual-Q network to reduce the possibility of overestimating the Q value and incorporate an entropy term in policy optimization to enhance exploration capabilities. This allows for the efficient training of a liquid-cooled battery thermal management strategy capable of coping with complex environmental changes. This strategy can dynamically adjust control actions according to different system states and environmental conditions to optimize the overall performance and lifespan of the battery pack.

[0232] The D2SAC algorithm in Example 1 employs a policy network based on a diffusion model, which can potentially generate more diverse and stochastic policies. In liquid cooling control, this means the algorithm can explore a wider range of cooling strategies, including combinations of parameters such as coolant flow rate, temperature setpoint, and pump speed, to find a better cooling solution. The combination of the diffusion model and entropy regularization in Example 1 enables the D2SAC algorithm to exhibit stronger adaptability in liquid cooling control. As system operating conditions change (such as increased load, ambient temperature fluctuations, and changes in the battery pack's chemical properties over time), the algorithm can automatically adapt to these changes and adjust the cooling strategy to ensure the system always operates in an optimal state as much as possible.

[0233] The SAC algorithm in Example 2 enhances its exploration capability by automatically adjusting the entropy regularization of the entropy coefficient. In liquid cooling control, this means that the method can dynamically adjust its exploration behavior based on the current system operating status and performance feedback. When the Q value tends to stabilize, the entropy term can break the current state to obtain strategies for exploring the unknown space and discover potential better strategies; at the same time, the model will also tend to choose behaviors with higher rewards, which is a utilization of the original learning results, allowing the model to make full use of known information to improve performance.

[0234] The above is a detailed description of the present invention in conjunction with specific embodiments, and it should not be construed that the specific embodiments of the present invention are limited to these descriptions. For those skilled in the art, any equivalent substitutions or obvious modifications made without departing from the concept of the present invention, and which have the same performance or use, should be considered to fall within the patent protection scope defined by the submitted claims.

Claims

1. A control method for a liquid-cooled battery thermal management system, characterized in that, Includes the following steps: 1) Environment Setup: Collect various data on the operation of the battery and liquid cooling system, and establish a multilayer perceptron (MLP) model consisting of an input layer, hidden layers, and an output layer to build a deep reinforcement learning training environment. The input layer is responsible for receiving data, and the number of nodes is consistent with the number of features in the input data. Each neuron in the hidden layer is connected to all nodes in the previous layer, and the connections are weighted and summed. The output layer uses a linear activation function. Output the regression results; 2) Model Training: A deep reinforcement learning training environment is built using the aforementioned MLP model. An adaptive model predictive control algorithm (D2SAC) is used to train the deep reinforcement learning model through interaction within the training environment. The D2SAC algorithm's policy network uses a diffusion-based algorithm. During the forward pass, Gaussian noise is gradually added to the action distribution to increase its randomness. During the backward pass, noise is gradually removed through learning to recover an optimal Gaussian distribution as the action distribution, and an action is sampled from this distribution. , This is the current state. The actions are generated by the policy network, and σ represents the noise level; the value function of each action is estimated using a dual-Q network as the value network. To reduce overestimation; Use target value function Update It's a reward. It is a discount factor. It is a termination marker; The objective function is composed of the expected reward of the strategy and the entropy term. Update the policy network. It is the entropy coefficient. It is the entropy term of the action; the action probability distribution is updated with the updated policy network to generate a new action, the new action is executed and the new state and reward are observed, and the training is repeated iteratively until convergence; 3) Offline training: The above deep reinforcement learning model is trained offline to obtain an optimized control strategy for the liquid cooling system. 4) Actual control: Based on the optimized liquid cooling system control strategy described above, send corresponding instructions to the host computer to realize the control of the liquid cooling system.

2. The control method for a liquid-cooled battery thermal management system according to claim 1, characterized in that, The environment setup includes: 1) Build a data collection platform that communicates in real time with the battery management system and liquid cooling control system. Collect various parameters and operating data of the battery from the energy storage system experimental platform, upload the data to the cloud server database, and the training equipment extracts data from the cloud server and performs data preprocessing. The data preprocessing includes: data denoising, supplementing missing data, correcting erroneous or data exceeding the permissible range, and data normalization. By establishing a physical simulation model of the battery and liquid cooling system, obtain more battery operating information under more conditions than the real data to achieve data augmentation. 2) Based on the processed data, an MLP model is established to receive the current state of the battery and control actions as inputs and output the predicted new state of the battery, thereby modeling the lithium battery and its surrounding environment.

3. The control method for a liquid-cooled battery thermal management system according to claim 1, characterized in that, The operation of the adaptive model predictive control algorithm includes algorithm initialization, sampling action, storing experience, sampling and calculating the target Q value, updating the double Q network, updating the policy network, automatically adjusting the entropy coefficient, and softly updating the target Q network; repeat the above operations until the policy converges or the stopping condition is met.

4. The control method for a liquid-cooled battery thermal management system according to claim 1, characterized in that, The adaptive model predictive control algorithm employs a policy network based on a diffusion model and uses a dual-Q network as the value network for alternating estimation to reduce estimation bias. An entropy term is also incorporated into the policy optimization process. To increase the randomness of the strategy, where For strategy entropy, For state, The adaptive model predictive control algorithm monitors the entropy level of the current policy and minimizes the loss function. Dynamically adjust the entropy coefficient ,in, It is the actual entropy of the current strategy. It is the target entropy value; Through the above operations, the policy network can both choose behaviors that are not currently optimal to obtain more information and make optimal decisions based on currently known information, thus balancing the exploration-exploitation relationship.

5. The control method for a liquid-cooled battery thermal management system according to claim 3, characterized in that, The algorithm initialization operation includes: A. Initialize the environment, initialize the policy network generated based on the diffusion model and use a double-Q network as the value network, then set the neural network parameters, including initializing the noise parameters of the diffusion model and randomly initializing the weights of the double-Q network using a Gaussian distribution; the noise parameters are the number of steps. and noise level ; B. Initialize the target Q-network, whose parameters are usually the same as those of the double Q-network: ,in, The parameters of the target Q-network, These are the parameters of the dual-Q network; C initializes an experience replay pool to store experience samples generated by the agent's interactions with the environment, including: states. ,action ,award Next state Termination mark ; The sampling operation includes: A\Given the current state Initialize a random initial vector from the diffusion model of the policy network. During the forward pass, noise is gradually added to the data: ,in, It controls the intensity of noise. It is random noise; At every step In this process, a deep neural network is used to infer the mean and variance of the denoised data; based on the current state... and time step The mean and variance of the output denoised distribution: Based on the mean and variance, a Gaussian distribution is generated, and a normally distributed distribution is added. The perturbation term from which an action is randomly sampled. ; B will sample the action Input into the environment to obtain the next state. ,award and whether it is terminated. ; The storage experience operation includes: A will change the current state. ,action ,award Next state and termination mark Stored in the experience replay pool; The sampling and calculation of the target Q value includes: A random sample of data is taken from the experience replay pool to calculate the target Q value and update the network parameters; B calculates the Q-value of the next state from the target Q-network, where the action... It is generated through a reverse diffusion process: , C. Choose the smallest Q value to avoid overestimation: , D\ is based on the objective function of D2SAC, combined with reward and discount factors. Calculate the target Q value: , The aforementioned dual-Q network update operation includes: A. Using the mean squared error loss function, calculate the difference between the Q-value output by the dual-Q network and the target Q-value: , B uses the backpropagation algorithm to minimize the loss function in order to update the parameters of the double-Q network: , It's the learning rate. The network operations for updating the policy include: A. Calculate the policy loss of the policy network based on the minimum Q-value output by the double-Q network: ,in, This represents the probability distribution of actions generated through denoising. It is a strategy entropy, It is the minimum Q-value of the current state and action. It is the entropy coefficient. The goal of the loss function is to maximize the Q-value of the action in the state, and it includes an entropy regularization term to encourage the randomness of the policy and avoid getting trapped in local optima. B minimizes the policy loss to update the parameters of the policy network; The automatic adjustment of the entropy coefficient includes: A. Calculate the loss of the entropy coefficient based on the difference between the actual entropy value and the target entropy value of the current strategy: ,in, It is the actual entropy of the current strategy. It is the target entropy value; B calculates the loss function through backpropagation. For entropy coefficient The gradient is calculated, and the entropy coefficient is updated using the Adam algorithm: , It is the learning rate; This allows the entropy coefficient to be dynamically adjusted to balance the exploration-exploitation relationship, meaning that the policy network can both choose behaviors that do not seem optimal at the moment to obtain more information and make optimal decisions based on currently known information. The soft update target Q network operation includes: A performs a soft update on the parameters of the target Q network, with the following update rules: ,in, It is a small constant between 0 and 1, used to control the parameters of the target Q-network. To the parameters of the dual-Q network The update rate; After B completes repeated training, it outputs the final policy network, which can generate the optimal action given the state.

6. The control method for a liquid-cooled battery thermal management system according to claim 1, characterized in that, The steps of the liquid cooling system control strategy are as follows: A. Obtain the pre-processed data; B. Divide the data into training and validation sets, using 80% of the original data as the training set and the remaining 20% ​​as the test set. C performs a dimensionality transformation on the data so that it can be subsequently input into the MLP model; D uses the SAC algorithm to train the MLP model, and sends corresponding instructions to the host computer based on the trained model results to control the liquid cooling system for regulation.