Method for optimizing control strategy of thermal management system in low-temperature fast-charging scenario and storage medium

By optimizing the control strategy of the electric vehicle thermal management system through reinforcement learning algorithms, the contradiction between charging speed and passenger cabin thermal comfort in low-temperature environments was resolved, achieving a balance between charging time, passenger cabin thermal comfort and energy efficiency, and improving the overall energy efficiency of the system.

CN120995718BActive Publication Date: 2026-06-19CHONGQING UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHONGQING UNIV
Filing Date
2025-09-30
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In low-temperature environments, the thermal management system of electric vehicles struggles to balance charging speed with passenger cabin thermal comfort. Existing rule-based control strategies cannot precisely adjust these factors, leading to reduced energy efficiency.

Method used

The agent is trained using reinforcement learning algorithms. By constructing Actor and Critic networks, the control strategy of the thermal management system is optimized. By combining the action space and state space, a multi-objective reward function is formulated to achieve a balance between charging time, passenger cabin thermal comfort, and thermal management system energy efficiency.

Benefits of technology

It achieves multi-objective optimization control of charging time, passenger cabin thermal comfort, and thermal management system energy consumption in low-temperature fast charging scenarios. Compared with traditional methods, it more accurately balances multiple needs and improves overall energy efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120995718B_ABST
    Figure CN120995718B_ABST
Patent Text Reader

Abstract

This invention relates to a method for optimizing the control strategy of a thermal management system in a low-temperature fast charging scenario, comprising the following steps: Establishing a training environment: determining the operating mode of the electric vehicle's thermal management system in a low-temperature fast charging scenario, and establishing a vehicle simulation model; constructing an action space and a state space, and determining the value range of the action space and the scope of consideration of the state space; formulating training conditions, using uniformly distributed sampling to determine the ambient temperature and the set temperature of the passenger compartment at the start of each training round; constructing a reward function, establishing an Actor network and a Critic network, and performing reinforcement learning training. This invention also proposes a storage medium. This invention trains the agent using a reinforcement learning algorithm to find a balance between charging time, passenger compartment thermal comfort, and thermal management system energy efficiency in a low-temperature fast charging scenario. It can solve the multi-objective optimization control problem of charging time, passenger compartment thermal comfort, and thermal management system energy consumption in a low-temperature fast charging scenario.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of electric vehicle technology, specifically to a method for optimizing the control strategy of a thermal management system and a storage medium for low-temperature fast charging scenarios. Background Technology

[0002] Electric vehicles, powered by electricity, offer numerous advantages such as low operating costs, good power performance, and low noise, leading to their increasing popularity in recent years. For everyday use, AC slow charging is sufficient to meet the battery replenishment needs of electric vehicles. However, in scenarios requiring urgent charging, such as long-distance highway driving or temporary long-distance trips, DC fast charging becomes a crucial technological solution.

[0003] Compared to normal temperature environments, power batteries exhibit higher internal resistance at low temperatures, resulting in a significant decrease in charging performance. Therefore, to achieve faster charging speeds during DC fast charging in low-temperature environments, thermal management systems are often used to heat the power battery, reducing its internal resistance and restoring its charging performance. However, in many low-temperature scenarios, the passenger compartment also requires heating. In these cases, the thermal management system needs to allocate energy simultaneously for both power battery heating and passenger compartment heating, creating a conflict between the demands for charging speed and the need for passenger compartment thermal comfort.

[0004] In existing technologies, most electric vehicle thermal management systems employ rule-based control strategies to address the aforementioned conflicts. However, this rule-based control strategy suffers from the following technical problems:

[0005] Rule-based control strategies can only distribute heat in a fixed proportion, which can only meet the functional requirements. Moreover, the calibration of key parameters in such rule-based control strategies (such as battery heating exit temperature) is highly dependent on the experience of engineers, making it difficult to accurately balance the charging speed requirements with the thermal comfort requirements of the passenger cabin.

[0006] Adopting a rule-based control strategy may lead to a decrease in the overall energy efficiency of the thermal management system due to unreasonable heat distribution. Summary of the Invention

[0007] The purpose of this invention is to propose a method for optimizing the control strategy of a thermal management system and a storage medium in a low-temperature fast charging scenario, so as to alleviate or eliminate at least one of the above-mentioned technical problems.

[0008] The present invention provides a method for optimizing the control strategy of a thermal management system in a low-temperature fast charging scenario, comprising the following steps:

[0009] S100: Set up the training environment: Determine the working mode of the thermal management system of electric vehicles in low temperature fast charging scenarios, and build a whole vehicle simulation model, which includes the power battery, passenger compartment and thermal management system.

[0010] S200: Constructing the action space: Taking the adjustment variables of the actuators that need to be controlled in the working mode as action variables, determining the value range of the action variables, the action variables including the charging rate;

[0011] S300: Constructing the state space: Define direct state variables as information that can be directly measured by on-board sensors, define indirect state variables as information constructed based on the direct state variables, and determine the scope of consideration for the direct and indirect state variables. The direct state variables include the state of charge of the power battery, the temperature of the power battery, the temperature of the passenger compartment, and the electrical power of the thermal management system. The indirect state variables include the rate of change of the state of charge of the power battery, the passenger compartment temperature error, and the integral of the passenger compartment temperature error.

[0012] S400: Set training conditions: Determine the initial state of charge of the battery at the start of charging, determine the final state of charge of the battery at the end of charging, determine the ambient temperature range and the crew cabin set temperature range, and uniformly distribute sampling within the ambient temperature range and the crew cabin set temperature range to determine the ambient temperature and the crew cabin set temperature at the start of each training round.

[0013] S500: Constructing a reward function: Constructing a reward function with multiple objectives for optimization, including power battery charging speed, power battery temperature, passenger cabin thermal comfort, and thermal management system energy consumption;

[0014] S600: Constructing the Actor network and Critic network: Constructing the Actor network and Critic network based on the number of action variables and the number of state variables, wherein the state variables include the direct state variables and the indirect state variables;

[0015] S700: Reinforcement Learning Training: Determine the reinforcement learning algorithm, determine the hyperparameters and training termination conditions of the reinforcement learning algorithm, and based on the training environment, the training conditions and the reward function, use an agent that includes the Actor network and the Critic network, takes the action space as the output space and the state space as the input space, to interact with the training environment and perform reinforcement learning training with the goal of maximizing the cumulative reward until the training converges or the training termination condition is reached.

[0016] Optionally, determining the operating mode of the electric vehicle's thermal management system in a low-temperature fast-charging scenario includes the following steps:

[0017] Determine the state of topology components related to the thermal management system of electric vehicles in a low-temperature fast charging scenario, wherein the topology components are parts that can change the topology of the thermal management system;

[0018] Identify the actuators that need to be controlled in the thermal management system of electric vehicles under low-temperature fast charging scenarios.

[0019] Optionally, the topology components include a battery three-way valve, a HVAC three-way valve, and an electric three-way valve; the action variables also include the PTC speed, the fan speed, the blower speed, the battery water pump speed, the HVAC water pump speed, and the electric water pump speed.

[0020] Optionally, the reinforcement learning training further includes the following steps:

[0021] The output range of the agent is defined as [-1, 1]. The output range of the agent is converted to the value range of the action variable and then input into the training environment.

[0022] The input range of the agent is defined as [-1, 1]. The state variable is converted to the input range of the agent and then input to the agent.

[0023] Optionally, the training conditions further include the following steps: under the lowest ambient temperature condition, a simulation is performed based on the vehicle simulation model with a constant charging rate of 1 and the thermal management system turned off, and the simulation duration is rounded down to the decision period Ts of the agent to determine the training duration for a single round.

[0024] Optionally, the reward function is:

[0025]

[0026] Where DiffSOC is the rate of change of state of charge of the power battery, and RBatt1 is a function of the rate of change of state of charge of the power battery, and the calculation formula is the first formula as follows:

[0027]

[0028] TBatt represents the battery temperature, and RBatt2 is a function of the battery temperature as the independent variable. The calculation formula is the second formula below:

[0029]

[0030] TErr represents the cabin temperature error, and RCabin is a function with cabin temperature error as the independent variable. The calculation formula is the third formula below:

[0031]

[0032] TmsPwr represents the electrical power of the thermal management system, and RTmsPwr is a function of the electrical power of the thermal management system as the independent variable. The calculation formula is the fourth formula below:

[0033]

[0034] The construction of the reward function includes the following steps:

[0035] Substituting (-DiffSOCMax, R1), (0, 0) and (DiffSOCMax, R1) into the first formula, we can obtain the values ​​of a, b and c. R1 is the reward when the state of charge change rate of the power battery reaches the maximum value DiffSOCMax.

[0036] Substituting (TAmb1, -0.1 *R2) and (TBattOpt1, 0) into the second formula, we can obtain the values ​​of k1 and d1. Substituting (TBattOpt2, 0) and (TBattMax, -R2) into the second formula, we can obtain the values ​​of k2 and d2. R2 is the reward for the power battery temperature being in the optimal charging temperature range (TBattOpt1, TBattOpt2), TBattMax is the maximum charging temperature, and TAmb1 is the minimum ambient temperature.

[0037] Substituting (TErrMin, -0.1*R3) and (TErrOpt1, 0) into the third formula, we can obtain the values ​​of k3 and d3; substituting (TErrOpt2, 0) and (TErrMax, -R3) into the third formula, we can obtain the values ​​of k4 and d4. R3 is the reward for the crew cabin temperature error being within the optimal crew cabin temperature error range (TErrOpt1, TErrOpt2), TErrMin is the lower limit threshold of the crew cabin temperature error, and TErrMax is the upper limit threshold of the crew cabin temperature error.

[0038] Substituting (0, 0) and (Pmax, -R4) into the fourth formula, we can obtain the values ​​of e and f. -R4 is the reward when the electrical power of the thermal management system reaches its maximum value Pmax.

[0039] Optionally, the Actor network includes a first state input layer, a first fully connected layer, a first ReLU activation function layer, a second fully connected layer, and a Tanh activation function layer connected in sequence. The Actor network also includes a first branch and a second branch. The first branch includes a mean fully connected layer, a second ReLU activation function layer, and a mean output fully connected layer connected in sequence. The second branch includes a standard deviation fully connected layer, a third ReLU activation function layer, and a SoftPlus activation function layer connected in sequence. The mean fully connected layer and the standard deviation fully connected layer are both connected to the output of the Tanh activation function layer.

[0040] The number of units in the first fully connected layer is NumObs*NumAct*N*2, the number of units in the second fully connected layer is NumObs*NumAct*N, the number of units in the mean fully connected layer is NumAct, the number of units in the mean output fully connected layer is NumAct, and the number of units in the standard deviation fully connected layer is NumAct, where NumObs is the number of state variables, NumAct is the number of action variables, and N is the training parameter;

[0041] The Critic network comprises a second state input layer, a third fully connected layer, a fourth ReLU activation function layer, a state output fully connected layer, an action input layer, an action output fully connected layer, an addition layer, a fifth ReLU activation function layer, and a value fully connected layer. The second state input layer, the third fully connected layer, the fourth ReLU activation function layer, and the state output fully connected layer are connected in sequence. The action input layer and the action output fully connected layer are connected in sequence. The addition layer, the fifth ReLU activation function layer, and the value fully connected layer are connected in sequence. The input of the addition layer is connected to the output of the action output fully connected layer and the output of the state output fully connected layer.

[0042] The number of units in the third fully connected layer is NumObs*NumAct*N*2, the number of units in the state output fully connected layer is NumObs*NumAct*N, the number of units in the value fully connected layer is 1, and the number of units in the action output fully connected layer is NumObs*NumAct*N.

[0043] Optionally, the reinforcement learning training includes the following steps:

[0044] The convergence criterion is set as the stability of the reward in the most recent 20 training rounds being less than 10%, and the training termination condition is set as the number of training rounds reaching 300. The stability of the reward is equal to the standard deviation of the reward divided by the mean of the reward.

[0045] The training parameter N starts from 1, and the value of N is gradually increased when the convergence criterion is not met, N=2, 3, 4, ..., until the training converges or the training termination condition is met.

[0046] Optionally, the reinforcement learning algorithm is the SAC algorithm.

[0047] The present invention also proposes a storage medium storing one or more computer-readable programs, which, when executed by one or more controllers, can implement the steps of the thermal management system control strategy optimization method described above for low-temperature fast charging scenarios.

[0048] The present invention has the following advantages:

[0049] This invention provides a method and storage medium for optimizing the control strategy of a thermal management system in low-temperature fast charging scenarios based on reinforcement learning training. By training an agent using a reinforcement learning algorithm, it seeks a balance between charging time, passenger cabin thermal comfort, and thermal management system energy efficiency in low-temperature fast charging scenarios. This method can solve the multi-objective optimization control problem of charging time, passenger cabin thermal comfort, and thermal management system energy consumption in low-temperature fast charging scenarios. Compared with traditional rule-based control strategies, it can more accurately obtain the balance between various requirements in a quantitative manner.

[0050] This invention provides a composite reward function framework. By adjusting the weight coefficients in the reward function framework, the optimization of different bias control strategies can be achieved, providing a basis for comparing different style schemes.

[0051] In practical training, a normalization method for the range of action space values ​​and the range of state space considerations was proposed, along with methods for constructing Actor and Critic networks. Reasonable settings for training conditions and methods were also implemented to ensure the effectiveness of agent training and improve training efficiency. Attached Figure Description

[0052] Figure 1 This is a flowchart of the thermal management system control strategy optimization method in a low-temperature fast charging scenario described in some embodiments;

[0053] Figure 2 This is a schematic diagram of the working principle of the thermal management system described in some embodiments;

[0054] Figure 3 This is a schematic diagram of the vehicle simulation model described in some embodiments;

[0055] Figure 4 This is a table of maximum charging rates as described in some embodiments;

[0056] Figure 5 This is a schematic diagram of the Actor network described in some embodiments;

[0057] Figure 6 This is a schematic diagram of the Critic network described in some embodiments;

[0058] Figure 7 The return curve described in some embodiments;

[0059] Figure 8 This is the return stability curve for the most recent 20 rounds as described in some embodiments. Detailed Implementation

[0060] The embodiments of the present invention will be described below with reference to the accompanying drawings and preferred embodiments. Those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be understood that the preferred embodiments are only for illustrating the present invention and not for limiting the scope of protection of the present invention.

[0061] It should be noted that the illustrations provided in the following embodiments are only schematic representations of the basic concept of the present invention. The illustrations only show the components related to the present invention and are not drawn according to the actual number, shape and size of the components in the actual implementation. In the actual implementation, the form, quantity and proportion of each component can be arbitrarily changed, and the layout of the components may also be more complex.

[0062] This application proposes an optimization method for the control strategy of a thermal management system in low-temperature fast charging scenarios. First, the operating mode of the thermal management system is determined, and a vehicle simulation model including the power battery, passenger compartment, and thermal management system is built. Second, an action space and a state space are constructed, and the value range of the action space and the scope of consideration of the state space are determined. Then, training conditions are formulated, and the ambient temperature and the set temperature of the passenger compartment at the start of each training round are determined by uniformly distributed sampling. Next, a reward function is constructed, comprehensively considering the influence of charging speed, battery temperature, and other weighted terms on the total reward. Finally, an Actor network and a Critic network are built, the hyperparameters of the reinforcement learning algorithm and the training termination criteria are set, and training is initiated. This method seeks a balance between charging time, passenger compartment thermal comfort, and thermal management system energy efficiency in low-temperature fast charging scenarios, providing a solution to the comprehensive problems of low-temperature fast charging of electric vehicles with passenger compartment heating.

[0063] In some embodiments, such as Figure 1 As shown, the optimization method for thermal management system control strategy in low-temperature fast charging scenarios includes the following steps:

[0064] S100: Set up the training environment: Determine the working mode of the thermal management system of electric vehicles in low temperature fast charging scenarios, and build a whole vehicle simulation model, which includes the power battery, passenger compartment and thermal management system.

[0065] S200: Constructing the motion space: Taking the adjustment variables of the actuators that need to be controlled in the working mode as motion variables, and determining the range of values ​​for the motion variables, including the charging rate;

[0066] S300: Constructing the state space: Define direct state variables as information that can be directly measured by on-board sensors, and define indirect state variables as information constructed based on direct state variables. Determine the scope of consideration for direct and indirect state variables. Direct state variables include the state of charge of the power battery, the temperature of the power battery, the temperature of the passenger compartment, and the electrical power of the thermal management system. Indirect state variables include the rate of change of the state of charge of the power battery, the temperature error of the passenger compartment, and the integral of the temperature error of the passenger compartment.

[0067] S400: Define training conditions: Determine the initial state of charge of the battery at the start of charging, determine the final state of charge of the battery at the end of charging, determine the ambient temperature range and the crew cabin set temperature range, and uniformly distribute sampling within the ambient temperature range and the crew cabin set temperature range to determine the ambient temperature and the crew cabin set temperature at the start of each training round.

[0068] S500: Constructing a reward function: Constructing a reward function with multiple objectives for optimization, including power battery charging speed, power battery temperature, passenger cabin thermal comfort, and thermal management system energy consumption;

[0069] S600: Constructing Actor and Critic Networks: Constructing Actor and Critic networks based on the number of action variables and the number of state variables, including direct and indirect state variables;

[0070] S700: Reinforcement Learning Training: Determine the reinforcement learning algorithm, its hyperparameters, and training termination conditions. Based on the training environment, training conditions, and reward function, an agent containing an Actor network and a Critic network, with the action space as the output space and the state space as the input space, interacts with the training environment to maximize the cumulative reward during reinforcement learning training until training converges or the training termination conditions are met.

[0071] By employing the aforementioned technical solution, reinforcement learning algorithms are used to train the agent to find a balance between charging time, passenger cabin thermal comfort, and thermal management system energy efficiency in low-temperature fast charging scenarios. This approach can solve the multi-objective optimization control problem of charging time, passenger cabin thermal comfort, and thermal management system energy consumption in low-temperature fast charging scenarios. Compared to traditional rule-based control strategies, it can more accurately obtain the balance between various needs in a quantitative manner.

[0072] The aforementioned reinforcement learning training process optimizes the control strategy through continuous interaction between the agent and the vehicle simulation model. Based on the current state of the training environment, such as the battery state of charge, battery temperature, passenger compartment temperature, and thermal management system power, the agent generates control actions, such as charging rate and water pump speed, via the Actor network. After the training environment executes the action, it provides feedback on the next state and a comprehensive reward (using a reward function to balance charging speed, battery temperature, thermal comfort, and energy consumption). The Critic network evaluates the value of the state and guides the Actor network updates. Through numerous rounds of exploration and learning, the system ultimately converges to the optimal control strategy that autonomously makes fast, safe, comfortable, and efficient decisions.

[0073] As a specific example, determining the operating mode of an electric vehicle's thermal management system in a low-temperature fast-charging scenario includes the following steps:

[0074] Determine the state of topology components related to the thermal management system of electric vehicles in low-temperature fast charging scenarios. Topology components are parts that can change the topology of the thermal management system.

[0075] Identify the actuators that need to be controlled in the thermal management system of electric vehicles under low-temperature fast charging scenarios.

[0076] As a specific example, the topology components include a battery-operated three-way valve, a HVAC three-way valve, and an electrically driven three-way valve; the action variables also include the PTC speed, fan speed, blower speed, battery-operated water pump speed, HVAC water pump speed, and electrically driven water pump speed. Determining appropriate topology components and action variables helps ensure the effectiveness of agent training and improves training efficiency.

[0077] In some embodiments, reinforcement learning training further includes the following steps:

[0078] Define the output range of the agent as [-1, 1], and input the output range of the agent into the range of action variables before inputting it into the training environment;

[0079] The input range of the agent is defined as [-1, 1]. The range of the state variables is transformed into the input range of the agent before being input to the agent. Using the above normalization method for the range of values ​​in the action space and the range of values ​​in the state space helps ensure the effectiveness of agent training and improves training efficiency.

[0080] As a specific example, defining training conditions also includes the following steps: Under the lowest ambient temperature condition, a simulation is performed based on a whole vehicle simulation model with a constant charging rate of 1 and the thermal management system turned off. The simulation duration is rounded down to the agent's decision cycle Ts to determine the training duration per round. Appropriately setting training conditions helps ensure the effectiveness of agent training and improves training efficiency.

[0081] In some embodiments, the reward function is:

[0082]

[0083] Where DiffSOC is the rate of change of state of charge of the power battery, and RBatt1 is a function of the rate of change of state of charge of the power battery, and the calculation formula is the first formula as follows:

[0084]

[0085] TBatt represents the battery temperature, and RBatt2 is a function of the battery temperature as the independent variable. The calculation formula is the second formula below:

[0086]

[0087] TErr represents the cabin temperature error, and RCabin is a function with cabin temperature error as the independent variable. The calculation formula is the third formula below:

[0088]

[0089] TmsPwr represents the electrical power of the thermal management system, and RTmsPwr is a function of the electrical power of the thermal management system as the independent variable. The calculation formula is the fourth formula below:

[0090]

[0091] Constructing the reward function involves the following steps:

[0092] Substituting (-DiffSOCMax, R1), (0, 0) and (DiffSOCMax, R1) into the first formula, we can obtain the values ​​of a, b and c. R1 is the reward when the state of charge change rate of the power battery reaches the maximum value DiffSOCMax.

[0093] Substituting (TAmb1, -0.1 *R2) and (TBattOpt1, 0) into the second formula, we can obtain the values ​​of k1 and d1. Substituting (TBattOpt2, 0) and (TBattMax, -R2) into the second formula, we can obtain the values ​​of k2 and d2. R2 is the reward for the power battery temperature being in the optimal charging temperature range (TBattOpt1, TBattOpt2), TBattMax is the maximum charging temperature, and TAmb1 is the minimum ambient temperature.

[0094] Substituting (TErrMin, -0.1*R3) and (TErrOpt1, 0) into the third formula, we can obtain the values ​​of k3 and d3; substituting (TErrOpt2, 0) and (TErrMax, -R3) into the third formula, we can obtain the values ​​of k4 and d4. R3 is the reward for the crew cabin temperature error being within the optimal crew cabin temperature error range (TErrOpt1, TErrOpt2), TErrMin is the lower limit threshold of the crew cabin temperature error, and TErrMax is the upper limit threshold of the crew cabin temperature error.

[0095] Substituting (0, 0) and (Pmax, -R4) into the fourth formula, we can obtain the values ​​of e and f. -R4 is the reward when the electrical power of the thermal management system reaches its maximum value Pmax.

[0096] The above technical solution provides a composite reward function framework. By adjusting the weight coefficients in the reward function framework, optimization of different bias control strategies can be achieved, providing a basis for comparing different style schemes. The weight coefficients include R1, R2, R3, and R4.

[0097] In some embodiments, the Actor network includes a first state input layer, a first fully connected layer, a first ReLU activation function layer, a second fully connected layer, and a Tanh activation function layer connected in sequence. The Actor network also includes a first branch and a second branch. The first branch includes a mean fully connected layer, a second ReLU activation function layer, and a mean output fully connected layer connected in sequence. The second branch includes a standard deviation fully connected layer, a third ReLU activation function layer, and a SoftPlus activation function layer connected in sequence. The mean fully connected layer and the standard deviation fully connected layer are both connected to the output of the Tanh activation function layer.

[0098] The number of units in the first fully connected layer is NumObs*NumAct*N*2, the number of units in the second fully connected layer is NumObs*NumAct*N, the number of units in the mean fully connected layer is NumAct, the number of units in the mean output fully connected layer is NumAct, and the number of units in the standard deviation fully connected layer is NumAct. Where NumObs is the number of state variables, NumAct is the number of action variables, and N is the training parameter.

[0099] The Critic network consists of a second state input layer, a third fully connected layer, a fourth ReLU activation function layer, a state output fully connected layer, an action input layer, an action output fully connected layer, an addition layer, a fifth ReLU activation function layer, and a value fully connected layer. The second state input layer, the third fully connected layer, the fourth ReLU activation function layer, and the state output fully connected layer are connected in sequence. The action input layer and the action output fully connected layer are connected in sequence. The addition layer, the fifth ReLU activation function layer, and the value fully connected layer are connected in sequence. The input of the addition layer is connected to the output of the action output fully connected layer and the output of the state output fully connected layer.

[0100] The third fully connected layer has NumObs*NumAct*N*2 units, the state output fully connected layer has NumObs*NumAct*N units, the value fully connected layer has 1 unit, and the action output fully connected layer has NumObs*NumAct*N units.

[0101] By adopting the above technical solution and constructing reasonable Actor and Critic networks, the effectiveness of agent training can be ensured and the training efficiency can be improved.

[0102] In some embodiments, reinforcement learning training includes the following steps:

[0103] The convergence criterion is set to the stability of the reward in the last 20 training rounds being less than 10%, and the training termination condition is set to the number of training rounds reaching 300. The stability of the reward is equal to the standard deviation of the reward divided by the mean of the reward.

[0104] The training parameter N starts from 1. When the convergence criterion is not met, the value of N is gradually increased, N=2, 3, 4, ..., until the training converges or the training termination condition is met.

[0105] By adopting the above technical solutions and setting up training methods appropriately, we can help ensure the effectiveness of agent training and improve training efficiency.

[0106] As a preferred example, the reinforcement learning algorithm is the SAC algorithm.

[0107] by Figure 2 Taking the thermal management system of a pure electric sedan as an example, this paper provides a more detailed explanation of the optimization method for the thermal management system control strategy in a low-temperature fast charging scenario.

[0108] Setting up the training environment:

[0109] like Figure 2As shown, battery three-way valve 101, HVAC three-way valve 102, and electric three-way valve 103 are defined as topological elements, and the opening degree of each of them is determined to be 50%. The actuators to be controlled are defined as a charge rate actuator, PTC 104, fan 105, blower 106, battery water pump 107, HVAC water pump 108, and electric water pump 109. Figure 3 As shown, a vehicle simulation model 204 containing a power battery 201, a passenger compartment 202, and a thermal management system 203 is built in the system simulation software Amesim.

[0110] Constructing the action space:

[0111] The charging rate, PTC speed, fan speed, blower speed, battery water pump speed, HVAC water pump speed, and electric drive water pump speed are used as action variables, denoted as ul to u7 respectively. The value range of each action variable is determined as shown in Table 1.

[0112] Table 1 Action variables and their value ranges

[0113]

[0114] The actual charging current I is calculated using the following formula.

[0115]

[0116] Where CBatt is the 1C rated capacity of the power battery, with a value of 150; ChrgRate is the charging rate; K is the dynamic gain coefficient, which is determined based on the real-time state of charge (SOC) and temperature (TBatt) of the power battery. Figure 4 The maximum charging rate table shown is used to obtain the maximum charging rate.

[0117] The output range of the agent is defined as [-1, 1]. Taking the value range of u1 as [0, 1] as an example, the action variable uNorml output by the agent is converted into ul by the following formula and then input to the training environment. The conversion of other action variables is similar.

[0118]

[0119] Constructing the state space:

[0120] The state of charge (SOC) of the power battery, the power battery temperature (TBatt), the passenger compartment temperature (TCabin), and the thermal management system power (TmsPwr) are selected as direct state variables; the rate of change of the power battery's state of charge is constructed. Crew cabin temperature error Integral of crew cabin temperature error As an indirect state variable, TSet is the set temperature of the crew cabin.

[0121] The range for the state of charge (SOC) of the power battery is determined to be [30, 80], the range for the passenger compartment temperature (TCabin) is [-10, 20], and the range for the passenger compartment set temperature is [0, 10]. Correspondingly, the range for the passenger compartment temperature error (TErr) is [-20, 20], and the range for the passenger compartment temperature error integral (ITErr) is [-20*Tf, 20*Tf], where Tf is the simulation duration.

[0122] Among them, the minimum value of the power battery state-of-charge rate change (DiffSOC) is 0, and the maximum value (DiffSOCMax) is calculated based on the maximum value (ChrgRateMax=1.6) in the maximum charging rate table using the following formula.

[0123]

[0124] In this context, the minimum test value for the power battery temperature TBatt is the lowest ambient temperature TAmbl = -10; the maximum test value is the maximum charging temperature TBattMax = 50 in the maximum charging rate table.

[0125] Among them, the minimum value of the thermal management system power TmsPwr is 0; the maximum value is the maximum operating power Pmax calculated by the whole vehicle simulation model when all thermal management action variables (u2 to u7) take their corresponding maximum values ​​under the lowest ambient temperature TAmb1=-10. Pmax=7600.

[0126] The input range of the agent is defined as [-1, 1]. Taking the state variable X1, the state of charge (SOC) of the power battery, within the range of [30, 80], as an example, the state variable X1 output from the training environment is converted into XNorm1 using the following formula and then input to the agent. The conversion of other state variables follows the same principle.

[0127]

[0128] Establish training conditions:

[0129] The initial battery state of charge (SOC1) at the start of charging is determined to be 30, and the final battery state of charge (SOC2) at the end of charging is determined to be 80. The ambient temperature range [TAmbl, TAmb2] = [-10, 0] is determined. Samples are uniformly distributed within the ambient temperature range [-10, 0] and the crew cabin set temperature range [0, 10] to determine the ambient temperature and crew cabin set temperature at the start of each training round.

[0130] The method for determining the training time of a single round is as follows: under the condition of the lowest ambient temperature TAmbl=-10, based on the whole vehicle simulation model, the simulation is carried out with the charging rate constant at 1 and the thermal management system turned off. The simulation time when the state of charge of the power battery reaches 80% is 3474. The decision cycle Ts of the agent is set to 20, and the final training time of a single round Tf is determined to be 3480.

[0131] Construct the reward function:

[0132] Construct the reward function as shown in the following formula:

[0133]

[0134] Where DiffSOC is the rate of change of state of charge of the power battery, and RBatt1 is a function of the rate of change of state of charge of the power battery, and the calculation formula is the first formula as follows:

[0135]

[0136] R1=20 is the reward when the state-of-charge rate of the power battery reaches its maximum value DiffSOCMax=0.0444. Substituting (-0.0444, 20), (0, 0), and (0.0444, 20) into the first formula, we can find that the values ​​of a, b, and c are 10145, 0, and 0, respectively.

[0137] Where TBatt is the power battery temperature, and RBatt2 is a function of the power battery temperature as the independent variable, calculated using the second formula as follows:

[0138]

[0139] R2=2 is the reward for the power battery temperature being within the optimal charging temperature range (25, 40). Substituting (-10, -0.2) and (25, 0) into the second formula, the values ​​of k1 and d1 are 0.0057143 and -0.14286, respectively; substituting (40, 0) and (50, -2) into the second formula, the values ​​of k2 and d2 are -0.2 and 8, respectively.

[0140] Where TErr is the cabin temperature error, and RCabin is a function with cabin temperature error as the independent variable, calculated using the third formula below:

[0141]

[0142] R3=10 is the reward for the crew cabin temperature error being within the optimal crew cabin temperature error range (-1, 1). Substituting (-20, -1) and (-1, 0) into the third formula, the values ​​of k3 and d3 are found to be 0.052632 and 0.052632, respectively; substituting (1, 0) and (20, -10) into the third formula, the values ​​of k4 and d4 are found to be -0.52632 and -0.52632, respectively.

[0143] Where TmsPwr is the electrical power of the thermal management system, and RTmsPwr is a function of the electrical power of the thermal management system as the independent variable, and the calculation formula is the fourth formula as follows:

[0144]

[0145] -R4=-2 is the reward when the electrical power of the thermal management system reaches Pmax. Substituting (0, 0) and (7600, -2) into the fourth formula, we obtain the values ​​of e and f as -0.00026316 and 0, respectively.

[0146] Building the Actor network and Critic network:

[0147] The Actor network and Critic network are constructed based on the number of action variables NumAct and the number of state variables NumObs.

[0148] like Figure 5 As shown, the Actor network includes a first state input layer ObsIn1, a first fully connected layer FC1, a first ReLU activation function layer ReLU1, a second fully connected layer FC2, and a Tanh activation function layer Tanh, which are connected in sequence. The Actor network also includes a first branch and a second branch. The first branch includes a mean fully connected layer MeanFC, a second ReLU activation function layer ReLU2, and a mean output fully connected layer MeanOutFC, which are connected in sequence. The second branch includes a standard deviation fully connected layer StdFC, a third ReLU activation function layer ReLU3, and a SoftPlus activation function layer StdOut, which are connected in sequence. The mean fully connected layer MeanFC and the standard deviation fully connected layer StdFC are both connected to the output of the Tanh activation function layer Tanh.

[0149] The number of units in the first fully connected layer FC1 is NumObs*NumAct*N*2, the number of units in the second fully connected layer FC2 is NumObs*NumAct*N, the number of units in the mean fully connected layer MeanFC is NumAct, the number of units in the mean output fully connected layer MeanOutFC is NumAct, and the number of units in the standard deviation fully connected layer StdFC is NumAct, where N is the training parameter;

[0150] like Figure 6 As shown, the Critic network includes a second state input layer ObsIn2, a third fully connected layer FC3, a fourth ReLU activation function layer ReLU4, a state output fully connected layer ObsOutFC, an action input layer ActIn, an action output fully connected layer ActOutFC, an addition layer Add, a fifth ReLU activation function layer ReLU5, and a value fully connected layer QValueFC. The second state input layer ObsIn2, the third fully connected layer FC3, the fourth ReLU activation function layer ReLU4, and the state output fully connected layer ObsOutFC are connected sequentially. The action input layer ActIn and the action output fully connected layer ActOutFC are connected sequentially. The addition layer Add, the third fully connected layer Add, and the fourth fully connected layer Add. The addition layer, the fifth ReLU activation function layer ReLU5, and the value fully connected layer ObsOutFC are connected sequentially. The input of the addition layer Add is connected to the output of the action output fully connected layer ActOutFC and the output of the state output fully connected layer ObsOutFC. The number of units in the third fully connected layer FC3 is NumObs*NumAct*N*2, the number of units in the state output fully connected layer ObsOutFC is NumObs*NumAct*N, the number of units in the value fully connected layer QValueFC is 1, and the number of units in the action output fully connected layer ActOutFC is NumObs*NumAct*N, where N is the training parameter.

[0151] Reinforce learning and training:

[0152] The SAC reinforcement learning algorithm was used for training, and the hyperparameters were set as shown in Table 2.

[0153] Table 2 SAC Algorithm Hyperparameter Settings

[0154]

[0155] The stability of the reward is defined as the standard deviation of the reward divided by the mean of the reward. The convergence criterion is set as the stability of the reward in the last 20 training epochs being less than 10%. The training termination condition is set as training convergence or the number of training epochs reaching 300. With N=1, the number of units in nodes FC1, FC2, MeanFC, MeanOutFC, and StdFC in the Actor network is determined to be 98, 49, 7, 7, and 7 respectively; the number of units in nodes FC3, ObsOutFC, QValueFC, and ActOutFC in the Critic network is determined to be 98, 49, 1, and 49 respectively.

[0156] Training begins. When the training rounds reach 162, the convergence criterion is met, and training ends. The reward curve and the average reward curve for the last 20 rounds are shown below. Figure 7 As shown; the return stability curve for the most recent 20 rounds is as follows. Figure 8 As shown.

[0157] The present invention also proposes a storage medium storing one or more computer-readable programs, which, when executed by one or more controllers, can implement the steps of the thermal management system control strategy optimization method in any of the above-mentioned low-temperature fast charging scenarios.

[0158] The above embodiments are merely preferred embodiments provided to fully illustrate the present invention, and the scope of protection of the present invention is not limited thereto. Equivalent substitutions or modifications made by those skilled in the art based on the present invention are all within the scope of protection of the present invention. In the description of this specification, the reference to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., means that a specific feature, structure, material, or characteristic associated with that embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials, or characteristics described can be combined in any suitable manner in one or more embodiments or examples. Furthermore, those skilled in the art can combine and integrate the different embodiments or examples described in this specification.

Claims

1. A method for optimizing a control strategy of a thermal management system in a low-temperature fast-charging scenario, characterized in that, Includes the following steps: S100: Set up the training environment: Determine the working mode of the thermal management system of electric vehicles in low temperature fast charging scenarios, and build a whole vehicle simulation model, which includes the power battery, passenger compartment and thermal management system. S200: Constructing the action space: Taking the adjustment variables of the actuators that need to be controlled in the working mode as action variables, determining the value range of the action variables, the action variables including the charging rate; S300: Constructing the state space: Define direct state variables as information that can be directly measured by on-board sensors, define indirect state variables as information constructed based on the direct state variables, and determine the scope of consideration for the direct and indirect state variables. The direct state variables include the state of charge of the power battery, the temperature of the power battery, the temperature of the passenger compartment, and the electrical power of the thermal management system. The indirect state variables include the rate of change of the state of charge of the power battery, the passenger compartment temperature error, and the integral of the passenger compartment temperature error. S400: Set training conditions: Determine the initial state of charge of the battery at the start of charging, determine the final state of charge of the battery at the end of charging, determine the ambient temperature range and the crew cabin set temperature range, and uniformly distribute sampling within the ambient temperature range and the crew cabin set temperature range to determine the ambient temperature and the crew cabin set temperature at the start of each training round. S500: Constructing a reward function: Constructing a reward function with multiple objectives for optimization, including power battery charging speed, power battery temperature, passenger cabin thermal comfort, and thermal management system energy consumption; S600: Constructing the Actor network and Critic network: Constructing the Actor network and Critic network based on the number of action variables and the number of state variables, wherein the state variables include the direct state variables and the indirect state variables; S700: Reinforcement Learning Training: Determine the reinforcement learning algorithm, determine the hyperparameters and training termination conditions of the reinforcement learning algorithm, and based on the training environment, the training conditions and the reward function, use an agent that includes the Actor network and the Critic network, takes the action space as the output space and the state space as the input space, to interact with the training environment and perform reinforcement learning training with the goal of maximizing the cumulative reward until the training converges or the training termination condition is reached; The reward function is: Where DiffSOC is the rate of change of state of charge of the power battery, and RBatt1 is a function of the rate of change of state of charge of the power battery, and the calculation formula is the first formula as follows: TBatt represents the battery temperature, and RBatt2 is a function of the battery temperature as the independent variable. The calculation formula is the second formula below: TErr represents the cabin temperature error, and RCabin is a function with cabin temperature error as the independent variable. The calculation formula is the third formula below: TmsPwr represents the electrical power of the thermal management system, and RTmsPwr is a function of the electrical power of the thermal management system as the independent variable. The calculation formula is the fourth formula below: The construction of the reward function includes the following steps: Substituting (-DiffSOCMax, R1), (0, 0) and (DiffSOCMax, R1) into the first formula, we can obtain the values ​​of a, b and c. R1 is the reward when the state of charge change rate of the power battery reaches the maximum value DiffSOCMax. Substituting (TAmb1, -0.1 *R2) and (TBattOpt1, 0) into the second formula, we can obtain the values ​​of k1 and d1. Substituting (TBattOpt2, 0) and (TBattMax, -R2) into the second formula, we can obtain the values ​​of k2 and d2. R2 is the reward for the power battery temperature being in the optimal charging temperature range (TBattOpt1, TBattOpt2), TBattMax is the maximum charging temperature, and TAmb1 is the minimum ambient temperature. Substituting (TErrMin, -0.1*R3) and (TErrOpt1, 0) into the third formula, we can obtain the values ​​of k3 and d3; substituting (TErrOpt2, 0) and (TErrMax, -R3) into the third formula, we can obtain the values ​​of k4 and d4. R3 is the reward for the crew cabin temperature error being within the optimal crew cabin temperature error range (TErrOpt1, TErrOpt2), TErrMin is the lower limit threshold of the crew cabin temperature error, and TErrMax is the upper limit threshold of the crew cabin temperature error. Substituting (0, 0) and (Pmax, -R4) into the fourth formula, we can obtain the values ​​of e and f. -R4 is the reward when the electrical power of the thermal management system reaches its maximum value Pmax.

2. The method of claim 1, wherein, Determining the operating mode of the electric vehicle's thermal management system in a low-temperature fast charging scenario includes the following steps: Determine the state of topology components related to the thermal management system of electric vehicles in a low-temperature fast charging scenario, wherein the topology components are parts that can change the topology of the thermal management system; Identify the actuators that need to be controlled in the thermal management system of electric vehicles under low-temperature fast charging scenarios.

3. The method of claim 2, wherein, The topological components include a battery three-way valve, a HVAC three-way valve, and an electric three-way valve; the action variables also include the PTC speed, the fan speed, the blower speed, the speed of the battery water pump, the speed of the HVAC water pump, and the speed of the electric water pump.

4. The method of claim 1, wherein, The reinforcement learning training also includes the following steps: The output range of the agent is defined as [-1, 1]. The output range of the agent is converted to the value range of the action variable and then input into the training environment. The input range of the agent is defined as [-1, 1]. The state variable is converted to the input range of the agent and then input to the agent.

5. The method of claim 1, wherein, The training conditions also include the following steps: under the lowest ambient temperature condition, a simulation is performed based on the vehicle simulation model with a constant charging rate of 1 and the thermal management system turned off. The simulation duration is rounded down to the decision period Ts of the agent to determine the training duration for a single round.

6. The method of claim 1, wherein, The Actor network includes a first state input layer, a first fully connected layer, a first ReLU activation function layer, a second fully connected layer, and a Tanh activation function layer connected in sequence. The Actor network also includes a first branch and a second branch. The first branch includes a mean fully connected layer, a second ReLU activation function layer, and a mean output fully connected layer connected in sequence. The second branch includes a standard deviation fully connected layer, a third ReLU activation function layer, and a SoftPlus activation function layer connected in sequence. The mean fully connected layer and the standard deviation fully connected layer are both connected to the output of the Tanh activation function layer. The number of units in the first fully connected layer is NumObs*NumAct*N*2, the number of units in the second fully connected layer is NumObs*NumAct*N, the number of units in the mean fully connected layer is NumAct, the number of units in the mean output fully connected layer is NumAct, and the number of units in the standard deviation fully connected layer is NumAct, where NumObs is the number of state variables, NumAct is the number of action variables, and N is the training parameter; The Critic network comprises a second state input layer, a third fully connected layer, a fourth ReLU activation function layer, a state output fully connected layer, an action input layer, an action output fully connected layer, an addition layer, a fifth ReLU activation function layer, and a value fully connected layer. The second state input layer, the third fully connected layer, the fourth ReLU activation function layer, and the state output fully connected layer are connected in sequence. The action input layer and the action output fully connected layer are connected in sequence. The addition layer, the fifth ReLU activation function layer, and the value fully connected layer are connected in sequence. The input of the addition layer is connected to the output of the action output fully connected layer and the output of the state output fully connected layer. The number of units in the third fully connected layer is NumObs*NumAct*N*2, the number of units in the state output fully connected layer is NumObs*NumAct*N, the number of units in the value fully connected layer is 1, and the number of units in the action output fully connected layer is NumObs*NumAct*N.

7. The method of claim 6, wherein, The reinforcement learning training includes the following steps: The convergence criterion is set as the stability of the reward in the most recent 20 training rounds being less than 10%, and the training termination condition is set as the number of training rounds reaching 300. The stability of the reward is equal to the standard deviation of the reward divided by the mean of the reward. The training parameter N starts from 1, and the value of N is gradually increased when the convergence criterion is not met, until the training converges or the training termination condition is met.

8. The method of claim 1, wherein, The reinforcement learning algorithm is the SAC algorithm.

9. A storage medium, characterized by It stores one or more computer-readable programs, which, when executed by one or more controllers, can implement the steps of the thermal management system control strategy optimization method for any of the low-temperature fast charging scenarios as described in any of claims 1 to 8.