Training method, control method and system of induction motor control model

By training the induction motor control model using the TD3 algorithm and optimizing the master strategy learning network, the limitations of PI regulator parameter design were solved, achieving high-precision control and improved stability of the induction motor. The coordinated control of the outer and inner loops further improved system performance.

CN122247262APending Publication Date: 2026-06-19HITACHI BUILDING TECH GUANGZHOU CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HITACHI BUILDING TECH GUANGZHOU CO LTD
Filing Date
2026-03-17
Publication Date
2026-06-19

Smart Images

  • Figure CN122247262A_ABST
    Figure CN122247262A_ABST
Patent Text Reader

Abstract

This invention discloses a training method, control method, and system for an induction motor control model. By continuously learning and updating the parameters of the master policy learning network, an optimized nonlinear control strategy is obtained. After training, the master policy learning network replaces the existing current PI regulator, avoiding the problem of low control accuracy caused by the design limitations of the PI regulator parameters, thus improving the control accuracy of the induction motor, as well as its torque response and operational stability. Furthermore, this invention also uses the speed variable of the outer loop (i.e., the measured speed value) as an observation value for reinforcement learning, achieving deep collaboration between the outer loop (speed control loop) and the inner loop (current loop). This integrates two independent control levels into a holistic optimized intelligent control, resulting in superior system-level dynamic performance and further improving the control accuracy of the induction motor.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to motor control technology, and more particularly to a training method, control method, and system for an induction motor control model. Background Technology

[0002] An induction motor, also known as an asynchronous motor, is a motor in which the rotor is placed in a rotating magnetic field. Under the influence of this magnetic field, the rotor receives a torque and thus rotates. Due to its high durability, low cost, and ease of maintenance, induction motors have become the "mainstay motor" in industrial production and daily life.

[0003] In vector control of induction motors, the current loop and speed loop constitute a typical dual-closed-loop control structure. Typically, the current loop is regulated by a PI controller. Although PI controllers are simple in structure and effective under most operating conditions, the performance of the current loop using a PI controller is usually affected by the following factors: stator and rotor resistances change with temperature, and stator and rotor inductances change with magnetic saturation. This makes it difficult to accurately decouple the current loop from feedforward. Furthermore, the tuning of the current PI controller parameters is usually based on the motor's mathematical model. If the model parameters are inaccurate, the carefully tuned current PI controller parameters will no longer be optimal, leading to a decrease in the motor's torque response and stability. Summary of the Invention

[0004] This invention provides a training method, control method, and system for an induction motor control model, which avoids the problem of low control accuracy caused by the design limitations of PI regulator parameters, improves the control accuracy of the induction motor, and enhances the torque response and operational stability of the induction motor.

[0005] In a first aspect, the present invention provides a training method for an induction motor control model, comprising: The training requires experience samples from the experience pool. These experience samples include the state vector at the current time step, the action vector inferred by the main policy learning network of the TD3 algorithm model based on the state vector at the current time step, the state vector at the next time step, and the reward for taking the action vector under the state vector at the current time step. The state vector includes the measured values ​​of the d-axis component of the stator current, the measured values ​​of the q-axis component of the stator current, the deviation of the d-axis component of the stator current, the deviation of the q-axis component of the stator current, the integral of the deviation of the d-axis component of the stator current, the integral of the deviation of the q-axis component of the stator current, the measured value of the rotational speed, and the reference value of the rotational speed. The action vector includes the reference values ​​of the d-axis component of the stator voltage and the reference values ​​of the q-axis component of the stator voltage. The TD3 algorithm model is trained using the empirical samples, and the parameters of the TD3 algorithm model are updated until the reward of the TD3 algorithm model is greater than the reward threshold or the number of training rounds reaches the maximum number of training rounds. The master policy learning network after training is used to predict the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component during the operation of the induction motor.

[0006] Optionally, the TD3 algorithm model includes a main policy learning network, a first value evaluation network, a second value evaluation network, a target policy learning network, a first target value evaluation network, and a second target value evaluation network. The TD3 algorithm model is trained using the empirical samples, and its parameters are updated until the reward of the TD3 algorithm model exceeds a reward threshold or the number of training epochs reaches the maximum number of training epochs. This includes: Samples are taken from the experience pool to obtain batch sample data including multiple experience samples; The batch sample data is input into the TD3 algorithm model for training, and the parameters of the first value evaluation network and the second value evaluation network are updated. The parameters of the main policy learning network are updated every preset number of times the parameters of the first value evaluation network and the second value evaluation network are updated. The parameters of the target policy learning network, the first target value evaluation network, and the second target value evaluation network are updated using a soft update method. Repeat the above steps until the reward of the TD3 algorithm model is greater than the reward threshold or the number of training rounds reaches the maximum number of training rounds. Each round includes multiple time steps, and each time step includes the entire process of the TD3 algorithm model from receiving experience samples, inferring action vectors, returning rewards, and the state vector of the next time step.

[0007] Optionally, the batch sample data is input into the TD3 algorithm model for training, updating the parameters of the first value evaluation network and the second value evaluation network, including: The state vector of the next time step is input into the target policy learning network for inference to obtain the first action vector; Add truncated noise to the first action vector to obtain the target action vector; The target action vector is input into the first target value evaluation network to calculate the first action value of the first action vector; The target action vector is input into the second target value evaluation network to calculate the second action value of the first action vector; The smaller of the first action value and the second action value is taken as the target action value; Calculate the discount sum of the reward and the value of the target action to obtain the target evaluation value; Input the action vector corresponding to the state vector at the current time step into the first value evaluation network to calculate the third action value of the action vector corresponding to the state vector at the current time step. Input the action vector corresponding to the state vector at the current time step into the second value evaluation network to calculate the fourth action value of the action vector corresponding to the state vector at the current time step. The parameters of the first value evaluation network are calculated using the gradient descent algorithm to minimize the error between the target evaluation value and the value of the third action, and the parameters of the first value evaluation network are updated. The parameters of the second value evaluation network are calculated using the gradient descent algorithm to minimize the error between the target evaluation value and the value of the fourth action, and the parameters of the second value evaluation network are updated.

[0008] Optionally, the parameters of the main policy learning network are updated every preset number of times the parameters of the first and second value evaluation networks are updated, including: When the parameters of the first value evaluation network and the second value evaluation network are updated a preset number of times, the state vector of the next time step is input into the main policy learning network for inference to obtain the second action vector. The second action vector is input into the first value evaluation network to calculate the fifth action value of the second action vector; The parameters of the master policy learning network that maximize the value of the fifth action are calculated using the gradient ascent algorithm, and the parameters of the master policy learning network are updated.

[0009] Optionally, the above steps are repeated until the reward of the TD3 algorithm model is greater than the reward threshold or the number of rounds in this round reaches the maximum number of training rounds, including: After updating the parameters of the target policy learning network, the first target value evaluation network, and the second target value evaluation network using a soft update method, it is determined whether the time step of this round has reached the maximum time step of a single round. If the time step of this round reaches the maximum time step of a single round, then end the training for this round and determine whether the number of training rounds has reached the evaluation interval. If the time step of this round does not reach the maximum time step of a single round, then return to the step of sampling from the experience pool to obtain batch sample data including multiple experience samples; If the number of training rounds has not reached the evaluation interval, then proceed to determine whether the number of rounds in the current round is greater than the maximum number of training rounds. If the number of training rounds reaches the evaluation interval, then a test sample is taken from the experience pool, and the TD3 algorithm model is run for a preset number of rounds. Calculate the cumulative discount reward generated at all time steps in each round within the preset round as the return; Calculate the average of all rewards for the preset round; Determine if the average of all returns is greater than the return threshold; If so, then end the training process; If not, then determine whether the number of rounds in this round is greater than the maximum number of training rounds; If so, then end the training process; If not, return to the step of sampling from the experience pool to obtain batch sample data including multiple experience samples.

[0010] Optionally, sampling is performed from the experience pool to obtain batch sample data including multiple experience samples, including: Initialize the three-phase current of the induction motor and the measured values ​​of the rotor speed and rotor position; Calculate the measured values ​​of the d-axis component and q-axis component of the stator current based on the three-phase current and rotor position measurements; Input the measured speed value and the reference speed value into the speed PI controller to calculate the reference values ​​of the d-axis component and the q-axis component of the stator current; The state vector composed of the measured values ​​of the stator current d-axis component, the measured values ​​of the stator current q-axis component, the stator current d-axis component deviation, the stator current q-axis component deviation, the integral of the stator current d-axis component deviation, the integral of the stator current q-axis component deviation, the measured speed value, and the speed reference value is input into the main policy learning network of the TD3 algorithm model for inference to obtain the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component. By performing the inverse Parker transform on the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component, the reference values ​​of the stator voltage α-axis component and the stator voltage β-axis component are obtained. Space vector pulse width modulation is performed based on the reference values ​​of the stator voltage α-axis component and the stator voltage β-axis component to obtain a pulse width modulation signal, which is used to control the inverter of the induction motor. The reward for taking the action vector at the current time step is calculated based on the stator current d-axis component deviation, stator current q-axis component deviation, reference values ​​of the stator voltage d-axis component, and stator voltage q-axis component. The system collects the three-phase current, rotor speed measurement value, and rotor position measurement value of the induction motor, and returns to the step of calculating the stator current d-axis component measurement value and stator current q-axis component measurement value based on the three-phase current and rotor position measurement value to obtain the state vector of the next time step. The action vector composed of the current time step's state vector, the reference value of the stator voltage d-axis component, and the reference value of the stator voltage q-axis component, the reward, and the state vector of the next time step are combined into an experience sample and stored in the experience pool. Determine whether the number of experience samples in the experience pool has reached the preset number required for training. If so, sample from the experience pool to obtain batch sample data including multiple experience samples; If not, return to the steps of initializing the three-phase current of the induction motor, the rotor speed measurement value, and the rotor position measurement value, until the experience samples in the experience pool reach the preset number required for training.

[0011] Optionally, the reward for taking the action vector at the current time step is calculated based on the reference values ​​of the stator current d-axis component deviation, the stator current q-axis component deviation, the stator voltage d-axis component, and the stator voltage q-axis component, using the following formula: in, The reward for taking the action vector at the state vector of the k-th time step is defined as Q1, Q2, and R, which are hyperparameters. The stator current d-axis component deviation at the k-th time step. The stator current q-axis component deviation at the k-th time step. This is the reference value for the d-axis component of the stator voltage at the (k-1)th time step. This is the reference value for the q-axis component of the stator voltage at the (k-1)th time step.

[0012] Secondly, the present invention also provides an induction motor control method, comprising a master policy learning network trained based on the training method described in the first aspect of the present invention, including: During the operation of the induction motor, the three-phase current, rotor speed measurement value, and rotor position measurement value of the induction motor are determined; Calculate the measured values ​​of the d-axis component and q-axis component of the stator current based on the three-phase current and rotor position measurements; Input the measured speed value and the reference speed value into the speed PI controller to obtain the reference values ​​of the d-axis component and the q-axis component of the stator current; The state vector composed of the measured values ​​of the stator current d-axis component, the measured values ​​of the stator current q-axis component, the stator current d-axis component deviation, the stator current q-axis component deviation, the integral of the stator current d-axis component deviation, the integral of the stator current q-axis component deviation, the measured speed value, and the speed reference value is input into the main policy learning network of the TD3 algorithm model for inference to obtain the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component. By performing the inverse Parker transform on the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component, the reference values ​​of the stator voltage α-axis component and the stator voltage β-axis component are obtained. Space vector pulse width modulation is performed based on the reference values ​​of the stator voltage α-axis component and the stator voltage β-axis component to obtain a pulse width modulation signal, which is used to control the inverter of the induction motor. Repeat the above steps until a stop signal from the induction motor is received.

[0013] Thirdly, the present invention also provides a training system for an induction motor control model, comprising: A digital signal processor, the digital signal processor including a playback buffer, the playback buffer being used to store an experience pool; A field-programmable gate array (FPGA), wherein the FPGA is communicatively connected to the digital signal processor; The server is configured to execute the training method as described in the first aspect of the invention and deploy the trained master policy learning network to the field programmable gate array.

[0014] Optionally, the digital signal processor further includes a speed PI regulator and a reward calculation module, and the field-programmable gate array further includes an inverse Parker transform module, a space vector pulse width modulation module, an analog-to-digital converter module, a Clarke transform module, a Parker transform module, and a speed and position calculation module. The analog-to-digital converter module is used to acquire the three-phase current of the induction motor and convert the three-phase current into digital signals; The speed and position calculation module is used to calculate the rotor's position measurement value and rotor speed measurement value based on the sensor's acquired signals and three-phase current; The Clarke transform module is used to perform Clarke transform on the digital signal of the three-phase current to obtain the measured values ​​of the α-axis component and the β-axis component of the stator current. The Parker transformation module is used to perform Parker transformation on the measured values ​​of the stator current α-axis component and the stator current β-axis component to obtain the measured values ​​of the stator current d-axis component and the stator current q-axis component. The speed PI regulator is used to calculate the reference values ​​of the stator current d-axis component and the stator current q-axis component based on the measured speed value and the speed reference value. The reward calculation module is used to calculate the reward for taking the action vector under the state vector of the current time step based on the stator current d-axis component deviation, stator current q-axis component deviation, stator voltage d-axis component reference value and stator voltage q-axis component reference value, and to calculate the cumulative discount reward generated by all time steps in each round within the preset round as the reward.

[0015] The training method for the induction motor control model provided by this invention obtains the necessary experience samples from an experience pool. These experience samples include the state vector of the current time step, the action vector inferred by the main policy learning network of the TD3 algorithm model based on the state vector of the current time step, the state vector of the next time step, and the reward for taking the action vector under the state vector of the current time step. The state vector includes the measured values ​​of the stator current d-axis component, the measured values ​​of the stator current q-axis component, the stator current d-axis component deviation, the stator current q-axis component deviation, the integral of the stator current d-axis component deviation, the integral of the stator current q-axis component deviation, the measured speed value, and the speed reference value. The action vector includes the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component. The TD3 algorithm model is trained using these experience samples, and the parameters of the TD3 algorithm model are updated until the reward of the TD3 algorithm model is greater than the reward threshold or the number of training rounds reaches the maximum. The trained main policy learning network is used to predict the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component during the operation of the induction motor. By continuously learning and updating the parameters of the master policy learning network, an optimized nonlinear control strategy is obtained. After training, the master policy learning network replaces the existing current PI regulator, avoiding the problem of low control accuracy caused by the design limitations of the PI regulator parameters. This improves the control accuracy of the induction motor, as well as its torque response and operational stability. Furthermore, this invention also uses the speed variable of the outer loop (i.e., the measured speed value) as an observation value for reinforcement learning, achieving deep collaboration between the outer loop (speed control loop) and the inner loop (current loop). This integrates two independent control levels into a holistic, optimized intelligent control, resulting in superior system-level dynamic performance and further improving the control accuracy of the induction motor.

[0016] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of the present invention, nor is it intended to limit the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 A flowchart of a training method for an induction motor control model provided by the present invention; Figure 2 A flowchart of another training method for an induction motor control model provided by the present invention; Figure 3 A flowchart of an induction motor control method provided by the present invention; Figure 4 A schematic diagram of the structure of a training system for an induction motor control model provided by the present invention; Figure 5 This is a schematic diagram of the structure of an electronic device provided by the present invention.

[0019] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation

[0020] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0021] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0022] Figure 1 This is a flowchart illustrating a training method for an induction motor control model provided by the present invention. This embodiment is applicable to model training based on the TD3 (Twin Delayed Deep Deterministic policy gradient) algorithm. During induction motor operation, the master policy learning network in the trained TD3 algorithm model predicts the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component. This method can be executed by the induction motor control model training device provided by the present invention. This device can be implemented in software and / or hardware, and is typically configured in electronic devices, such as... Figure 1 As shown, the training method for this induction motor control model includes the following steps: S101. Obtain the experience samples required for training from the experience pool. The experience samples include the state vector of the current time step, the action vector inferred by the main policy learning network of the TD3 algorithm model based on the state vector of the current time step, the state vector of the next time step, and the reward for taking the action vector under the state vector of the current time step. The state vector includes the measured value of the d-axis component of the stator current, the measured value of the q-axis component of the stator current, the deviation of the d-axis component of the stator current, the deviation of the q-axis component of the stator current, the integral of the deviation of the d-axis component of the stator current, the integral of the deviation of the q-axis component of the stator current, the measured value of the rotational speed, and the reference value of the rotational speed. The action vector includes the reference value of the d-axis component of the stator voltage and the reference value of the q-axis component of the stator voltage.

[0023] In the TD3 algorithm, the Replay Buffer is a limited-capacity cache used to store the agent's interaction history with the environment (i.e., experience). It is the core carrier of the "experience replay" technique in deep reinforcement learning. The data unit stored in the Replay Buffer is a complete experience, that is, the transformation generated at each time step. In this embodiment of the invention, the experience in the Replay Buffer is: in, Let k be the state vector at the k-th time step. The main policy learning network (Actor) for the TD3 algorithm model is based on state vectors. The action vector obtained through reasoning This is the state vector for the next time step. State vector Take action vector The reward. A time step includes the entire process of the TD3 algorithm model from receiving experience samples, inferring action vectors, returning rewards, and the state vector for the next time step.

[0024] The state vector includes the measured values ​​of the stator current d-axis component, the measured values ​​of the stator current q-axis component, the stator current d-axis component deviation, the stator current q-axis component deviation, the integral of the stator current d-axis component deviation, the integral of the stator current q-axis component deviation, the measured speed value, and the speed reference value. The action vector includes the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component.

[0025] In this embodiment of the invention, the experience in the experience pool can be generated in real time during the training process. The agent continuously generates new experiences by interacting with the environment and stores them in the experience pool. Subsequently, it samples from the experience pool to update the network.

[0026] Experience samples are obtained by sampling from the experience pool when the amount of experience in the pool meets the requirements for training. Experience samples include the state vector at a certain time step, the action vector obtained by the main policy learning network of the TD3 algorithm model based on the state vector, the state vector at the next time step, and the reward for taking the action vector under the state vector at a certain time step.

[0027] Here, the reward refers to the feedback signal obtained after the induction motor operates according to the reference values ​​of the stator voltage d-axis component and stator voltage q-axis component as shown in the action vector under the conditions of the state vector. It is used to measure the immediate quality of taking the action vector at a certain state vector. The stator current d-axis component deviation is the difference between the reference value and the measured value of the stator current d-axis component. The stator current q-axis component deviation is the difference between the reference value and the measured value of the stator current q-axis component. The reference values ​​of the stator current d-axis component and stator current q-axis component are calculated based on the motor speed. The integral of the stator current d-axis component deviation is the cumulative value of the stator current d-axis component deviation, and the integral of the stator current q-axis component deviation is the cumulative value of the stator current q-axis component deviation. The rotor speed measurement value can be calculated based on the frequency or period of the pulse signal output by the sensor (e.g., incremental encoder). The rotor position measurement value can be calculated from the reference values ​​of the stator current d-axis component, the stator current q-axis component, and the speed measurement value.

[0028] S102. Train the TD3 algorithm model using empirical samples, and update the parameters of the TD3 algorithm model until the reward of the TD3 algorithm model is greater than the reward threshold or the number of training rounds reaches the maximum number of training rounds. The master policy learning network after training is used to predict the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component during the operation of the induction motor.

[0029] In this embodiment of the invention, the TD3 algorithm is sampled, and the TD3 algorithm model is trained using empirical samples. The parameters of the TD3 algorithm model are continuously learned and updated until the reward of the TD3 algorithm model exceeds the reward threshold or the maximum number of training rounds is reached, thereby obtaining an optimized nonlinear control strategy. The trained master policy learning network is used to predict the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component during the operation of the induction motor, for space vector modulation of the induction motor. The reward of the TD3 algorithm model refers to the long-term cumulative benefit; in this embodiment, it refers to the cumulative discounted reward generated over multiple time steps. A training round is a sequence of multiple consecutive time steps.

[0030] Furthermore, this invention also uses the speed variable of the outer loop (i.e., the speed measurement value) as the observation value for reinforcement learning, realizing deep collaboration between the outer loop (speed control loop) and the inner loop (current loop), integrating the two independent control levels into a whole optimized intelligent control, resulting in superior system-level dynamic performance and further improving the control accuracy of the induction motor.

[0031] The training method for the induction motor control model provided by this invention obtains the necessary experience samples from an experience pool. These experience samples include the state vector of the current time step, the action vector inferred by the main policy learning network of the TD3 algorithm model based on the state vector of the current time step, the state vector of the next time step, and the reward for taking the action vector under the state vector of the current time step. The state vector includes the measured values ​​of the stator current d-axis component, the measured values ​​of the stator current q-axis component, the stator current d-axis component deviation, the stator current q-axis component deviation, the integral of the stator current d-axis component deviation, the integral of the stator current q-axis component deviation, the measured speed value, and the speed reference value. The action vector includes the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component. The TD3 algorithm model is trained using these experience samples, and the parameters of the TD3 algorithm model are updated until the reward of the TD3 algorithm model is greater than the reward threshold or the number of training rounds reaches the maximum. The trained main policy learning network is used to predict the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component during the operation of the induction motor. By continuously learning and updating the parameters of the master policy learning network, an optimized nonlinear control strategy is obtained. After training, the master policy learning network replaces the existing current PI regulator, avoiding the problem of low control accuracy caused by the design limitations of the PI regulator parameters. This improves the control accuracy of the induction motor, as well as its torque response and operational stability. Furthermore, this invention also uses the speed variable of the outer loop (i.e., the measured speed value) as an observation value for reinforcement learning, achieving deep collaboration between the outer loop (speed control loop) and the inner loop (current loop). This integrates two independent control levels into a holistic, optimized intelligent control, resulting in superior system-level dynamic performance and further improving the control accuracy of the induction motor.

[0032] Figure 2 This is a flowchart illustrating another training method for an induction motor control model provided by the present invention. This training process is an offline training process conducted in a simulation environment. The simulation environment establishes the necessary environment for reinforcement learning, including an induction motor simulation model, a three-phase inverter simulation model, an SVPWM (Space Vector Pulse Width Modulation) module, a rotor position calculation module, a slip speed calculation module, a Clarke transform module, a Parker inverse transform module, a Parker transform and per-unitization module, and a speed PI regulator module. The reinforcement learning TD3-Agent interacts with the environment to train a suitable primary policy learning network (Actor). The TD3 algorithm model includes a primary policy learning network (Actor), a first value evaluation network (Critic1), a second value evaluation network (Critic2), a target policy learning network (Target Actor), a first target value evaluation network (Target Critic1), and a second target value evaluation network (Target Critic2). (Reference) Figure 2The training method for this induction motor control model includes: 1. Set training parameters.

[0033] The training parameters include: maximum number of training epochs MaxEpisode = 1000; maximum number of iterations per epoch MaxStep = 10000; replay buffer size = 100000; mini-batch size = 512; discount factor γ = 0.9; and learning rate τ = 0.001.

[0034] 2. Initialize the environment and model.

[0035] The initialization environment includes induction motor parameters, three-phase inverter parameters, rotor position measurements, speed PI regulator parameters, speed reference values, rotor flux reference values, speed measurements, and load torque. It also initializes the Actor network in TD3-Agent. Critic1 network Critic2 network Target Actor Network Target Critic1 network Target Critic2 network Initialize the playback buffer.

[0036] 3. Reset the environment.

[0037] Environment reset ensures the continuity of training. Without a reset, the agent would be stuck in a terminated state after a round, unable to continue exploring. The reset mechanism allows the agent to try repeatedly and continuously collect experience, which is the foundation of round-based task training.

[0038] 4. Generate experience samples and store them in the experience pool.

[0039] (1) Calculate the measured values ​​of the d-axis component of the stator current and the q-axis component of the stator current based on the three-phase current and the rotor position measurement values.

[0040] In this embodiment of the invention, the initial values ​​of the three-phase currents are input into the Clarke transform module for Clarke transform to obtain the measured values ​​of the stator current α-axis component and the stator current β-axis component. The measured values ​​of the stator current α-axis component and the stator current β-axis component are then input into the Park transform and per-unit transformation module for Park transform and per-unit transformation to obtain the measured values ​​of the stator current d-axis component (per-unit value) and the stator current q-axis component (per-unit value).

[0041] (2) Input the measured speed value and the reference speed value into the speed PI controller, and calculate the reference value of the d-axis component of the stator current and the reference value of the q-axis component of the stator current.

[0042] For example, the measured speed value and the speed reference value are compared, the difference between the speed reference value and the measured speed value is calculated, and this difference is input into the speed PI controller. The speed PI controller calculates the torque reference value through proportional-integral adjustment. For example, the calculation formula is as follows: in, This is the torque reference value. , These are the parameters of the speed PI controller. This is a reference value for rotational speed. This is the measured rotational speed value.

[0043] Magnetic flux reference value The optimal operating range of the motor is crucial, directly impacting its efficiency and control stability. When the motor operates below its rated speed, maintaining a constant air gap flux is desirable to maximize the utilization of the motor's core. Therefore, the flux linkage reference value is typically set to a constant rated value. This value can be easily obtained by looking up a table or directly assigned. When the motor needs to operate above its rated speed, the back electromotive force increases with speed, limiting the maximum voltage that the inverter can provide. In this case, field weakening control is needed to actively reduce the flux linkage reference value to maintain voltage balance, allowing the motor to continue accelerating. The field weakening control algorithm dynamically calculates a suitable flux linkage setpoint based on the current speed and voltage margin.

[0044] The reference value for the q-axis component of the stator current can be calculated using the following formula: The reference value of the d-axis component of the stator current can be calculated using the following formula: in, For mutual induction.

[0045] (3) Input the state vector composed of the measured value of the d-axis component of the stator current, the measured value of the q-axis component of the stator current, the deviation of the d-axis component of the stator current, the deviation of the q-axis component of the stator current, the integral of the deviation of the d-axis component of the stator current, the integral of the deviation of the q-axis component of the stator current, the measured value of the speed and the reference value of the speed into the main policy learning network of the TD3 algorithm model for inference, and obtain the reference value of the d-axis component of the stator voltage and the reference value of the q-axis component of the stator voltage.

[0046] In this embodiment of the invention, the measured value of the d-axis component of the stator current is... Measurement values ​​of the q-axis component of the stator current Stator current d-axis component deviation Stator current q-axis component deviation Integral of the stator current d-axis component deviation Integral of the stator current q-axis component deviation Speed ​​measurement value and speed reference value The state vector The action vector is obtained by inference within the principal policy learning network (Actor) of the TD3 algorithm model. Action vector Reference values ​​including the d-axis component of the stator voltage Reference values ​​for the q-axis component of the stator voltage .Right now .

[0047] (4) Perform Parker inverse transformation on the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component to obtain the reference values ​​of the stator voltage α-axis component and the stator voltage β-axis component.

[0048] In this embodiment of the invention, the reference value of the d-axis component of the stator voltage is... Reference values ​​for the q-axis component of the stator voltage Input the Parker inverse transform module to obtain the reference value of the d-axis component of the stator voltage. Reference values ​​for the q-axis component of the stator voltage Perform the inverse Parker transform to obtain the reference value of the α-axis component of the stator voltage. Reference values ​​for the stator voltage β-axis component .

[0049] (5) Based on the reference values ​​of the stator voltage α-axis component and the stator voltage β-axis component, space vector pulse width modulation is performed to obtain a pulse width modulation signal, which is used to control the inverter operation of the induction motor.

[0050] In this embodiment of the invention, the reference value of the α-axis component of the stator voltage is... Reference values ​​for the stator voltage β-axis component The input SVPWM module performs space vector pulse width modulation to obtain a pulse width modulation signal. This pulse width modulation signal is then input to the three-phase inverter simulation model, which inverts the input DC power into AC power to drive the induction motor simulation model.

[0051] (6) Calculate the reward for taking the action vector under the state vector at the current time step based on the stator current d-axis component deviation, stator current q-axis component deviation, stator voltage d-axis component reference value and stator voltage q-axis component reference value.

[0052] In this embodiment of the invention, the reference values ​​of the stator current d-axis component deviation, the stator current q-axis component deviation, the stator voltage d-axis component reference value, and the stator voltage q-axis component reference value are input into the reward function to calculate the state vector in the current time step. The reward is calculated based on the action vector. The reward function is as follows: in, The reward for taking the action vector at the state vector of the k-th time step is defined as Q1, Q2, and R, which are hyperparameters. The stator current d-axis component deviation at the k-th time step. The stator current q-axis component deviation at the k-th time step. This is the reference value for the d-axis component of the stator voltage at the (k-1)th time step. This is the reference value for the q-axis component of the stator voltage at the (k-1)th time step.

[0053] (7) Collect the three-phase current, rotor speed measurement value and rotor position measurement value of the induction motor, and return to the step of calculating the stator current d-axis component measurement value and stator current q-axis component measurement value based on the three-phase current and rotor position measurement value to obtain the state vector of the next time step.

[0054] In this embodiment of the invention, the three-phase current, rotor speed measurement, and rotor position measurement of the induction motor are collected again at the next time step. The process then returns to the step of calculating the stator current d-axis component and stator current q-axis component based on the three-phase current and rotor position measurement, thus obtaining the state vector for the next time step. .

[0055] The three-phase current of the induction motor in the next time step can be acquired by an analog-to-digital converter. The rotor speed measurement can be calculated based on the frequency or period of the pulse signal output by a sensor (e.g., an incremental encoder). The rotor position measurement can be calculated based on the reference values ​​of the stator current d-axis component and q-axis component from the previous time step, and the speed measurement value from the current time step. For example, firstly, the slip velocity is calculated using the reference values ​​of the stator current d-axis component and q-axis component from the previous time step. The slip velocity is then added to the speed measurement value from the current time step to obtain the rotor's synchronous speed. Finally, the synchronous speed is integrated to obtain the rotor position measurement value. The formula for calculating the slip velocity is: in, This is the per-unit value of the slip speed. For rotor resistance, This is the rotor inductance.

[0056] (8) Combine the current time step state vector, the reference value of the stator voltage d-axis component and the reference value of the stator voltage q-axis component into an action vector, a reward and the next time step state vector into an experience sample and store it in the experience pool.

[0057] The state vector Reference values ​​for the d-axis component of stator voltage Reference values ​​for the q-axis component of the stator voltage Composition of action vectors ,award and the state vector of the next time step Combined into empirical samples Store it in the experience pool.

[0058] 5. Determine whether the number of experience samples in the experience pool has reached the preset number required for training.

[0059] In this embodiment of the invention, each time an experience sample is stored, it is determined whether the number of experience samples in the experience pool has reached the preset number required for training. The preset number can be the number of experience samples required for one training round. If yes, step 6 is executed; otherwise, the process returns to the steps of initializing the three-phase current of the induction motor, the rotor speed measurement value, and the rotor position measurement value, until the number of experience samples in the experience pool reaches the preset number required for training. In a specific embodiment of the invention, refer to... Figure 2 If not, proceed to sub-step (1) of step 10.

[0060] 6. Sample from the experience pool to obtain batch sample data including multiple experience samples.

[0061] If the number of experience samples in the experience pool reaches the preset number required for training, then sample data is obtained from the experience pool, which includes multiple experience samples.

[0062] 7. Input batch sample data into the TD3 algorithm model for training, and update the parameters of the first value evaluation network and the second value evaluation network.

[0063] In this embodiment of the invention, batch sample data is input into the TD3 algorithm model for training, and the parameters of the first value evaluation network (Critic1) and the second value evaluation network (Critic2) are updated.

[0064] For example, the update process for the first value assessment network (Critic1) and the second value assessment network (Critic2) is as follows: (1) Input the state vector of the next time step into the target policy learning network for inference to obtain the first action vector.

[0065] In this embodiment of the invention, the state vector of the next time step is... The input target policy learning network (target Actor) is used for inference to obtain the first action vector. , .

[0066] (2) Add truncation noise to the first action vector to obtain the target action vector.

[0067] For the first action vector Add truncated noise to obtain the target action vector. For example, in the first action vector... Adding normally distributed noise yields the target action vector. , The noise is normally distributed. By adding truncated noise to the first action vector, it is prevented from deviating too much from the original action.

[0068] (3) Input the target action vector into the first target value evaluation network and calculate the first action value of the first action vector.

[0069] target action vector Input the first target value evaluation network (target Critic1) to calculate the first action vector. The value of the first action .

[0070] (4) Input the target action vector into the second target value evaluation network and calculate the second action value of the first action vector.

[0071] target action vector Input the first action vector into the second objective value evaluation network (objective Critic2). The second action value .

[0072] (5) Take the smaller of the first action value and the second action value as the target action value.

[0073] In this embodiment of the invention, the first action value Second action value The smaller of the values ​​is used as the target action value to reduce the risk of overestimation.

[0074] (6) Calculate the discount sum of the reward and the target action value to obtain the target evaluation value.

[0075] Calculate the product of the target action value and the discount factor γ, and sum this product with the current time step as the reward and the discounted sum of the target action value to obtain the target evaluation value. Specifically, the calculation formula is as follows: in, The target evaluation value, The reward for the k-th time step. As a discount factor, Value of the first action Second action value The smaller of the two, namely the value of the target action.

[0076] (7) Input the action vector corresponding to the state vector of the current time step into the first value evaluation network, and calculate the third action value of the action vector corresponding to the state vector of the current time step.

[0077] The state vector at the k-th time step Corresponding action vector Input into the first value evaluation network (Critic1) to calculate the state vector. Corresponding action vector The third action value .

[0078] (8) Input the action vector corresponding to the state vector of the current time step into the second value evaluation network, and calculate the fourth action value of the action vector corresponding to the state vector of the current time step.

[0079] The state vector at the k-th time step Corresponding action vector Input into the second value evaluation network (Critic2) to calculate the state vector. Corresponding action vector The value of the fourth action .

[0080] (9) Calculate the parameters of the first value evaluation network that minimizes the error between the target evaluation value and the third action value using the gradient descent algorithm, and update the parameters of the first value evaluation network.

[0081] In this embodiment of the invention, the gradient descent algorithm is used to calculate the target evaluation value. Value of the third action The parameters of the first value evaluation network (Critic1) that minimizes the error. θ 1. And update the parameters of the first value assessment network (Critic1).

[0082] (10) Calculate the parameters of the second value evaluation network that minimizes the error between the target evaluation value and the fourth action value using the gradient descent algorithm, and update the parameters of the second value evaluation network.

[0083] In this embodiment of the invention, the gradient descent algorithm is used to calculate the target evaluation value. Value of the fourth action The parameters of the second value evaluation network (Critic2) that minimizes the error. θ 2. And update the parameters of the second value assessment network (Critic2).

[0084] For example, in this embodiment of the invention, updates are made by minimizing the mean square error. and ,Right now Where N is the number of empirical samples in the batch sample data.

[0085] 8. Update the parameters of the main policy learning network every preset number of times the parameters of the first value evaluation network and the second value evaluation network are updated.

[0086] In this embodiment of the invention, the parameters of the main policy learning network are updated once every preset number of times (e.g., twice) when the parameters of the first value evaluation network (Critic1) and the second value evaluation network (Critic2) are updated. If the number of updates of the first value evaluation network and the second value evaluation network does not meet the update timing of the main policy learning network, sub-step (1) of step 10 is executed.

[0087] For example, the process of updating the parameters of the master policy learning network is as follows: (1) When the parameters of the first value evaluation network and the second value evaluation network are updated a preset number of times, the second state vector is input into the main policy learning network for inference to obtain the second action vector.

[0088] Every time the parameters of the first value evaluation network (Critic1) and the second value evaluation network (Critic2) are updated a preset number of times (e.g., twice), the state vector for the next time step is changed. The input is used for inference in the main policy learning network (Actor) to obtain the second action vector. , .

[0089] (2) Input the second action vector into the first value evaluation network and calculate the fifth action value of the second action vector.

[0090] The second action vector Input the data into the first value assessment network (Critic1) to calculate the second action vector. The value of the fifth action .

[0091] (3) Use the gradient ascent algorithm to calculate the parameters of the main policy learning network that maximizes the value of the fifth action, and update the parameters of the main policy learning network.

[0092] The gradient ascent algorithm is used to obtain the parameter ϕ that maximizes the value Q of the fifth action, and the main policy learning network (Actor) is then updated. That is: in, The objective function of the main policy learning network (Actor) Regarding the gradient of the network parameter ϕ, The number of empirical samples in the batch sample data. The value Q of the fifth action is relative to the second action vector. gradient, For the second action vector The gradient with respect to the network parameter ϕ.

[0093] 9. Update the parameters of the target policy learning network, the first target value evaluation network, and the second target value evaluation network using a soft update method.

[0094] In this embodiment of the invention, the parameters of the target policy learning network (target Actor), the first target value evaluation network (target Critic1), and the second target value evaluation network (target Critic1) are updated using a soft update method. That is: ; ; in, This is the learning rate.

[0095] During the training process of this invention, the update frequency of the main policy learning network and the target network (including the target policy learning network, the first target value evaluation network, and the second target value evaluation network) is lower than that of the first value evaluation network and the second value evaluation network. This allows the first value evaluation network and the second value evaluation network to converge faster, provide more accurate action evaluation values, avoid drastic policy oscillations, and improve the stability and final performance of the training.

[0096] 10. Repeat the above steps until the reward of the TD3 algorithm model is greater than the reward threshold or the number of training rounds reaches the maximum number of training rounds. Each round includes multiple time steps, and each time step includes the entire process of the TD3 algorithm model from receiving experience samples, inferring action vectors, returning rewards, and the state vector of the next time step.

[0097] Repeat steps 1-9 above until the reward of the TD3 algorithm model is greater than the reward threshold or the number of training rounds reaches the maximum number of training rounds. Each round includes multiple time steps, and each time step includes the entire process of the TD3 algorithm model from receiving experience samples, inferring action vectors, returning rewards, and the state vector of the next time step.

[0098] Specifically, such as Figure 2 As shown, after updating the parameters of the target policy learning network, the first target value evaluation network, and the second target value evaluation network using a soft update method, the process also includes: (1) Determine whether the time step of this round has reached the maximum time step of a single round.

[0099] After updating the parameters of the target policy learning network, the first target value evaluation network and the second target value evaluation network each time using a soft update method, it is determined whether the time step of this round has reached the maximum time step MaxStep of a single round. If yes, the following step (2) is executed. If no, the step of sampling from the experience pool to obtain batch sample data including multiple experience samples is returned. In a specific embodiment of the present invention, if no, the step of step 4 is returned to generate new experience.

[0100] (2) End the training for this round and determine whether the number of training rounds has reached the evaluation interval.

[0101] If the time step of this round reaches the maximum time step of a single round (MaxStep), it means that the training of this round has been completed. Then, the training of this round ends, and it is determined whether the number of training rounds has reached the evaluation interval. If yes, then the following step (3) is executed; if no, then the step of determining whether the number of rounds of this round is greater than the maximum number of training rounds is executed.

[0102] (3) Conduct model performance evaluation to determine whether the model performance meets the training stopping criteria.

[0103] If the number of training rounds reaches the evaluation interval, model performance is evaluated to determine whether the model's performance meets the training stopping criteria. If yes, the training process ends; otherwise, step (4) is executed. Specifically, the model performance evaluation process is as follows: (3.1) Take test samples from the experience pool and run the TD3 algorithm model for a preset number of rounds.

[0104] In this embodiment of the invention, if the number of training rounds reaches the evaluation interval, test samples are taken from the experience pool and the TD3 algorithm model is run continuously for a preset number of rounds (e.g., 3 rounds).

[0105] (3.2) Calculate the cumulative discount reward generated by all time steps in each round within the preset round as a reward.

[0106] In this embodiment of the invention, the cumulative discount reward generated at all time steps in each round within a preset round is calculated as the return. For example, the cumulative discount reward generated at all time steps in the i-th round is calculated as follows: ; in, The cumulative discount reward generated over all time steps in round i. The total number of time steps in one round. The reward for the k-th time step. This is the discount factor.

[0107] (3.3) Calculate the average of all rewards for the preset round.

[0108] In this embodiment of the invention, the average value of all rewards in a preset round is calculated.

[0109] (3.4) Determine whether the average of all returns is greater than the return threshold.

[0110] In this embodiment of the invention, it is determined whether the average value of all returns is greater than the return threshold. If yes, it means that the model's performance has reached the training stopping criterion, and the training process ends. If no, it means that the model's performance has not yet reached the training stopping criterion, and the following step (4) is executed.

[0111] (4) Determine whether the number of rounds in this round is greater than the maximum number of training rounds.

[0112] Determine if the current training round number is greater than the maximum training round number MaxEpisode. If yes, end the training process; otherwise, return to the step of sampling from the experience pool to obtain batch sample data including multiple experience samples. In a specific embodiment of the invention, if no, return to step 3, environment reset. That is, the environment needs to be reset after each training round.

[0113] The present invention also provides an induction motor control method, based on a master policy learning network trained using the training method provided in any of the foregoing embodiments of the present invention. Figure 3 A flowchart of an induction motor control method provided by the present invention is shown below. Figure 3 As shown, the induction motor control method includes: S201. During the operation of the induction motor, determine the three-phase current, rotor speed measurement value, and rotor position measurement value of the induction motor.

[0114] During the operation of the induction motor, the three-phase current, rotor speed measurement, and rotor position measurement are determined in real time. For example, as mentioned earlier, the three-phase current of the induction motor can be acquired through an analog-to-digital converter module. The rotor speed measurement can be calculated based on the frequency or period of the pulse signal output by a sensor (e.g., an incremental encoder). The rotor position measurement can be calculated based on reference values ​​of the stator current d-axis component and q-axis component from the previous time step, and the speed measurement from the current time step. For example, firstly, the slip velocity is calculated using the reference values ​​of the stator current d-axis component and q-axis component from the previous time step. The slip velocity is then added to the speed measurement from the current time step to obtain the rotor's synchronous speed. Finally, the synchronous speed is integrated to obtain the rotor's position measurement.

[0115] S202. Calculate the measured values ​​of the d-axis component and the q-axis component of the stator current based on the three-phase current and the rotor position measurement values.

[0116] In this embodiment of the invention, the three-phase current is input into the Clarke transform module for Clarke transform to obtain the measured values ​​of the stator current α-axis component and the stator current β-axis component. The measured values ​​of the stator current α-axis component and the stator current β-axis component are then input into the Park transform and per-unit transformation module for Park transform and per-unit transformation to obtain the measured values ​​of the stator current d-axis component (per-unit value) and the stator current q-axis component (per-unit value).

[0117] S203. Input the measured speed value and the reference speed value into the speed PI controller to obtain the reference values ​​of the d-axis component and the q-axis component of the stator current.

[0118] For example, the measured speed value and the speed reference value are compared, the difference between the speed reference value and the measured speed value is calculated, and this difference is input into the speed PI controller. The speed PI controller calculates the torque reference value through proportional-integral adjustment. For example, the calculation formula is as follows: in, This is the torque reference value. , These are the parameters of the speed PI controller. This is a reference value for rotational speed. This is the measured rotational speed value.

[0119] Magnetic flux reference value The optimal operating range of the motor is crucial, directly impacting its efficiency and control stability. When the motor operates below its rated speed, maintaining a constant air gap flux is desirable to maximize the utilization of the motor's core. Therefore, the flux linkage reference value is typically set to a constant rated value. This value can be easily obtained by looking up a table or directly assigned. When the motor needs to operate above its rated speed, the back electromotive force increases with speed, limiting the maximum voltage that the inverter can provide. In this case, field weakening control is needed to actively reduce the flux linkage reference value to maintain voltage balance, allowing the motor to continue accelerating. The field weakening control algorithm dynamically calculates a suitable flux linkage setpoint based on the current speed and voltage margin.

[0120] The reference value for the q-axis component of the stator current can be calculated using the following formula: The reference value of the d-axis component of the stator current can be calculated using the following formula: in, For mutual induction.

[0121] S204. The state vector composed of the measured values ​​of the stator current d-axis component, the measured values ​​of the stator current q-axis component, the stator current d-axis component deviation, the stator current q-axis component deviation, the integral of the stator current d-axis component deviation, the integral of the stator current q-axis component deviation, the measured speed value, and the speed reference value is input into the main policy learning network of the TD3 algorithm model for inference to obtain the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component.

[0122] In this embodiment of the invention, the measured value of the d-axis component of the stator current is... Measurement values ​​of the q-axis component of the stator current Stator current d-axis component deviation Stator current q-axis component deviation Integral of the stator current d-axis component deviation Integral of the stator current q-axis component deviation Speed ​​measurement value and speed reference value The state vector The action vector is obtained by inference within the principal policy learning network (Actor) of the TD3 algorithm model. Action vector Reference values ​​including the d-axis component of the stator voltage Reference values ​​for the q-axis component of the stator voltage .Right now .

[0123] S205. Perform an inverse Parker transform on the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component to obtain the reference values ​​of the stator voltage α-axis component and the stator voltage β-axis component.

[0124] In this embodiment of the invention, the reference value of the d-axis component of the stator voltage is... Reference values ​​for the q-axis component of the stator voltage Input the Parker inverse transform module to obtain the reference value of the d-axis component of the stator voltage. Reference values ​​for the q-axis component of the stator voltage Perform the inverse Parker transform to obtain the reference value of the α-axis component of the stator voltage. Reference values ​​for the stator voltage β-axis component .

[0125] S206. Based on the reference values ​​of the stator voltage α-axis component and the stator voltage β-axis component, space vector pulse width modulation is performed to obtain a pulse width modulation signal, which is used to control the inverter operation of the induction motor.

[0126] In this embodiment of the invention, the reference value of the α-axis component of the stator voltage is... Reference values ​​for the stator voltage β-axis component The input is processed by the SVPWM module to perform space vector pulse width modulation, resulting in a pulse width modulation signal. This pulse width modulation signal is then input to a three-phase inverter, which converts the input DC power into AC power to drive the induction motor.

[0127] S207. Repeat the above steps until a stop signal from the induction motor is received.

[0128] The induction motor control method provided by this invention is based on a master strategy learning network trained using the training method described in the first aspect of this invention. After training, the master strategy learning network is used to predict reference values ​​for the d-axis and q-axis components of the stator voltage during induction motor operation. By continuously learning and updating the parameters of the master strategy learning network, an optimized nonlinear control strategy is obtained. The trained master strategy learning network replaces the existing current PI regulator, avoiding the problem of low control accuracy caused by the design limitations of the PI regulator parameters, thus improving the control accuracy of the induction motor, and enhancing its torque response and operational stability. Furthermore, this invention also uses the speed variable of the outer loop (i.e., the measured speed value) as an observation value for reinforcement learning, achieving deep collaboration between the outer loop (speed control loop) and the inner loop (current loop). This integrates two independent control levels into a holistic optimized intelligent control, resulting in superior system-level dynamic performance and further improving the control accuracy of the induction motor.

[0129] This invention also provides a training system for an induction motor control model. Figure 4 A schematic diagram of the structure of a training system for an induction motor control model provided by the present invention is shown below. Figure 4 As shown, the training system for the induction motor control model includes: A digital signal processor (DSP) includes a playback buffer, which is used to store the experience pool.

[0130] Field-Programmable Gate Array (FPGA) is a communication interface between a FPGA and a digital signal processor.

[0131] The server is used to execute the training method as described in any of the foregoing embodiments of the present invention and to deploy the trained master policy learning network to a field-programmable gate array.

[0132] The server communicates with the digital signal processor (DSP) to perform functions such as data collection, remote monitoring, model training, and collaborative optimization. Cloud communication transforms the induction motor (IM) control system from an isolated intelligent device into a continuously learning and collaboratively optimizing intelligent network. This not only solves the problem of limited edge computing capabilities but also opens up new possibilities such as adaptive control, predictive maintenance, and global optimization. In some embodiments of this invention, the server can periodically execute the above training process and update the parameters of the master policy learning network (Actor) in the field-programmable gate array (FPGA), enabling the parameters of the master policy learning network (Actor) to dynamically match the actual operating environment of the induction motor.

[0133] In some embodiments of the present invention, such as Figure 4 As shown, the digital signal processor also includes a speed PI regulator and a reward calculation module, and the field-programmable gate array also includes an inverse Park transform module, a space vector pulse width modulation module (SVPWM), an analog-to-digital converter (ADC), a Clarke transform module, a Park transform module (Park transform and per-unit), and a speed and position calculation module. The analog-to-digital converter module is used to acquire the three-phase current of the induction motor and convert the three-phase current into digital signals.

[0134] The speed and position calculation module is used to calculate the rotor's position and rotational speed measurements based on the sensor's acquired signals and three-phase current. The specific calculation process can be found in the foregoing embodiments, and will not be repeated here.

[0135] The Clarke transform module is used to perform Clarke transform on the digital signals of three-phase currents to obtain the measured values ​​of the stator current α-axis component and the stator current β-axis component.

[0136] The Parker transform module is used to perform a Parker transform on the measured values ​​of the stator current α-axis component and the stator current β-axis component to obtain the measured values ​​of the stator current d-axis component and the stator current q-axis component. In one specific embodiment, the Parker transform module can be used to normalize the measured values ​​of the stator current d-axis component and the stator current q-axis component.

[0137] The speed PI regulator is used to calculate the torque reference value based on the measured speed value and the speed reference value, and then to calculate the reference values ​​for the d-axis component and the q-axis component of the stator current based on the torque reference value, the rotor flux linkage reference value, and the measured speed value. The specific calculation process can be found in the foregoing embodiments, and will not be repeated here.

[0138] The reward calculation module is used to calculate the reward for the action vector taken under the state vector of the current time step based on the stator current d-axis component deviation, stator current q-axis component deviation, stator voltage d-axis component reference value, and stator voltage q-axis component reference value, and to calculate the cumulative discount reward generated by all time steps in each round within a preset round as the reward. The specific calculation process can be referred to the foregoing embodiments, and will not be repeated here.

[0139] The present invention also provides a training device for an induction motor control model, comprising: The experience sample acquisition module is used to acquire the experience samples required for training from the experience pool. The experience samples include the state vector of the current time step, the action vector inferred by the main policy learning network of the TD3 algorithm model based on the state vector of the current time step, the state vector of the next time step, and the reward for taking the action vector under the state vector of the current time step. The state vector includes the measured value of the d-axis component of the stator current, the measured value of the q-axis component of the stator current, the deviation of the d-axis component of the stator current, the deviation of the q-axis component of the stator current, the integral of the deviation of the d-axis component of the stator current, the integral of the deviation of the q-axis component of the stator current, the measured value of the rotational speed, and the reference value of the rotational speed. The action vector includes the reference value of the d-axis component of the stator voltage and the reference value of the q-axis component of the stator voltage. The model training module is used to train the TD3 algorithm model using the empirical samples and update the parameters of the TD3 algorithm model until the reward of the TD3 algorithm model is greater than the reward threshold or the number of training rounds reaches the maximum number of training rounds. The master policy learning network after training is used to predict the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component during the operation of the induction motor.

[0140] In some embodiments of the present invention, the TD3 algorithm model includes a main policy learning network, a first value evaluation network, a second value evaluation network, a target policy learning network, a first target value evaluation network, and a second target value evaluation network. The model training module includes: The sampling submodule is used to sample from the experience pool to obtain batch sample data including multiple experience samples; The first parameter update submodule is used to input the batch sample data into the TD3 algorithm model for training and update the parameters of the first value evaluation network and the second value evaluation network. The second parameter update submodule is used to update the parameters of the main policy learning network every preset number of times the parameters of the first value evaluation network and the second value evaluation network are updated. The third parameter update submodule is used to update the parameters of the target policy learning network, the first target value evaluation network, and the second target value evaluation network using a soft update method. The repeat execution submodule is used to repeatedly execute the above steps until the reward of the TD3 algorithm model is greater than the reward threshold or the number of training rounds reaches the maximum number of training rounds. Each round includes multiple time steps, and each time step includes the entire process of the TD3 algorithm model from receiving experience samples, inferring action vectors, returning rewards, and the state vector of the next time step.

[0141] In some embodiments of the present invention, the first parameter update submodule includes: The first action vector calculation unit is used to input the state vector of the next time step into the target policy learning network for inference to obtain the first action vector. A noise-adding unit is used to add truncation noise to the first action vector to obtain the target action vector; The first action value calculation unit is used to input the target action vector into the first target value evaluation network and calculate the first action value of the first action vector. The second action value calculation unit is used to input the target action vector into the second target value evaluation network and calculate the second action value of the first action vector; The value-taking unit is used to take the smaller of the first action value and the second action value as the target action value; The target evaluation value calculation unit is used to calculate the discount sum of the reward and the target action value to obtain the target evaluation value; The third action value calculation unit is used to input the action vector corresponding to the state vector of the current time step into the first value evaluation network and calculate the third action value of the action vector corresponding to the state vector of the current time step. The fourth action value calculation unit is used to input the action vector corresponding to the state vector of the current time step into the second value evaluation network and calculate the fourth action value of the action vector corresponding to the state vector of the current time step. The first parameter update unit is used to calculate the parameters of the first value evaluation network that minimize the error between the target evaluation value and the third action value using the gradient descent algorithm, and update the parameters of the first value evaluation network. The second parameter update unit is used to calculate the parameters of the second value evaluation network that minimize the error between the target evaluation value and the value of the fourth action using the gradient descent algorithm, and to update the parameters of the second value evaluation network.

[0142] In some embodiments of the present invention, the second parameter update submodule includes: The second action vector calculation unit is used to input the state vector of the next time step into the main policy learning network for inference when the parameters of the first value evaluation network and the second value evaluation network are updated a preset number of times, so as to obtain the second action vector. The fifth action value calculation unit is used to input the second action vector into the first value evaluation network and calculate the fifth action value of the second action vector; The third parameter update unit is used to calculate the parameters of the main policy learning network that maximizes the value of the fifth action using the gradient ascent algorithm, and to update the parameters of the main policy learning network.

[0143] In some embodiments of the present invention, the repeat execution submodule includes: The first judgment unit is used to determine whether the time step of the current round has reached the maximum time step of a single round after updating the parameters of the target policy learning network, the first target value evaluation network and the second target value evaluation network using a soft update method. The second judgment unit is used to end the training of the current round if the time step of the current round reaches the maximum time step of a single round, and to determine whether the number of training rounds has reached the evaluation interval. The first execution unit is used to return to the step of sampling from the experience pool to obtain batch sample data including multiple experience samples if the time step of the current round does not reach the maximum time step of a single round. The second execution unit is used to perform the step of determining whether the number of training rounds is greater than the maximum number of training rounds if the number of training rounds has not reached the evaluation interval. The model running unit is used to retrieve test samples from the experience pool and run the TD3 algorithm model for a preset number of rounds if the number of training rounds reaches the evaluation interval. The reward calculation unit is used to calculate the cumulative discount reward generated at all time steps in each round within the preset round as the reward; The mean calculation unit is used to calculate the average value of all returns in the preset round; The third judgment unit is used to determine whether the average of all returns is greater than the return threshold; The first termination unit is used to terminate the training process if the condition is met. The fourth judgment unit is used to determine whether the number of rounds in the current round is greater than the maximum number of training rounds if the condition is not met. The second termination unit is used to terminate the training process if the condition is met. The third execution unit is used to, if not, return to the step of sampling from the experience pool to obtain batch sample data including multiple experience samples.

[0144] In some embodiments of the present invention, the sampling submodule includes: An initialization unit is used to initialize the three-phase current of the induction motor and the rotor speed measurement value and rotor position measurement value; The first calculation unit is used to calculate the measured values ​​of the d-axis component and the q-axis component of the stator current based on the three-phase current and the rotor position measurement values. The second calculation unit is used to input the measured speed value and the speed reference value into the speed PI regulator to calculate the reference value of the d-axis component of the stator current and the reference value of the q-axis component of the stator current. The third calculation unit is used to input the state vector composed of the measured values ​​of the stator current d-axis component, the measured values ​​of the stator current q-axis component, the stator current d-axis component deviation, the stator current q-axis component deviation, the integral of the stator current d-axis component deviation, the integral of the stator current q-axis component deviation, the speed measurement value, and the speed reference value into the main policy learning network of the TD3 algorithm model for inference, so as to obtain the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component. The fourth calculation unit is used to perform inverse Parker transformation on the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component to obtain the reference values ​​of the stator voltage α-axis component and the stator voltage β-axis component. The modulation unit is used to perform space vector pulse width modulation based on the reference values ​​of the stator voltage α-axis component and the stator voltage β-axis component to obtain a pulse width modulation signal, which is used to control the inverter of the induction motor. The reward calculation unit is used to calculate the reward for taking the action vector under the current time step based on the stator current d-axis component deviation, stator current q-axis component deviation, stator voltage d-axis component reference value and stator voltage q-axis component. The fourth execution unit is used to collect the three-phase current, rotor speed measurement value and rotor position measurement value of the induction motor, and return to execute the step of calculating the stator current d-axis component measurement value and stator current q-axis component measurement value based on the three-phase current and rotor position measurement value, so as to obtain the state vector of the next time step; The experience storage unit is used to combine the action vector composed of the current time step's state vector, the reference value of the stator voltage d-axis component, and the reference value of the stator voltage q-axis component, the reward, and the state vector of the next time step into an experience sample and store it in the experience pool. The fifth judgment unit is used to determine whether the number of experience samples in the experience pool has reached the preset number required for training. A sampling unit is used to sample from the experience pool if the condition is met, to obtain batch sample data including multiple experience samples. The fifth execution unit is used to, if not, return to the steps of initializing the three-phase current, rotor speed measurement value, and rotor position measurement value of the induction motor until the experience samples in the experience pool reach the preset number required for training.

[0145] In some embodiments of the present invention, the reward for taking the action vector at the current time step is calculated based on the stator current d-axis component deviation, the stator current q-axis component deviation, the reference value of the stator voltage d-axis component, and the reference value of the stator voltage q-axis component. The calculation formula is as follows: in, The reward for taking the action vector at the state vector of the k-th time step is defined as Q1, Q2, and R, which are hyperparameters. The stator current d-axis component deviation at the k-th time step. The stator current q-axis component deviation at the k-th time step. This is the reference value for the d-axis component of the stator voltage at the (k-1)th time step. This is the reference value for the q-axis component of the stator voltage at the (k-1)th time step.

[0146] The training device for the induction motor control model described above can execute the training method for the permanent induction motor control model provided in the foregoing embodiments of the present invention, and has the corresponding functional modules and beneficial effects for executing the training method for the induction motor control model.

[0147] The present invention also provides an induction motor control device, comprising a master policy learning network trained based on any of the foregoing embodiments of the present invention, including: The parameter determination module is used to determine the three-phase current, rotor speed measurement value, and rotor position measurement value of the induction motor during the operation of the induction motor. The measurement value calculation module is used to calculate the measured values ​​of the stator current d-axis component and the stator current q-axis component based on the three-phase current and rotor position measurement values; The reference value calculation module is used to input the measured speed value and the speed reference value into the speed PI regulator to obtain the reference values ​​of the d-axis component and the q-axis component of the stator current; The inference module is used to input the state vector composed of the measured values ​​of the d-axis component of the stator current, the measured values ​​of the q-axis component of the stator current, the deviation of the d-axis component of the stator current, the deviation of the q-axis component of the stator current, the integral of the deviation of the d-axis component of the stator current, the integral of the deviation of the q-axis component of the stator current, the measured value of the speed, and the reference value of the speed into the main policy learning network of the TD3 algorithm model for inference, so as to obtain the reference values ​​of the d-axis component and the q-axis component of the stator voltage. The Parker inverse transform module is used to perform Parker inverse transform on the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component to obtain the reference values ​​of the stator voltage α-axis component and the stator voltage β-axis component. The space vector modulation module is used to perform space vector pulse width modulation based on the reference values ​​of the stator voltage α-axis component and the stator voltage β-axis component to obtain a pulse width modulation signal, which is used to control the inverter of the induction motor. The repeat execution module is used to repeatedly execute the above steps until the stop signal of the induction motor is received.

[0148] The aforementioned induction motor control device can execute the induction motor control method provided in the foregoing embodiments of the present invention, and has the corresponding functional modules and beneficial effects for executing the induction motor control method.

[0149] Figure 5 This is a schematic diagram of an electronic device provided by the present invention. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices (such as helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the invention described and / or claimed herein.

[0150] like Figure 5As shown, the electronic device includes at least one processor and a memory, such as a read-only memory (ROM) or a random access memory (RAM), communicatively connected to the at least one processor. The memory stores computer programs executable by the at least one processor. The processor can perform various appropriate actions and processes based on the computer programs stored in the ROM or loaded from memory cells into the RAM. The RAM may also store various programs and data required for the operation of the electronic device. The processor, ROM, and RAM are interconnected via a bus. Input / output (I / O) interfaces are also connected to the bus.

[0151] Multiple components in an electronic device are connected to an I / O interface, including: input units such as keyboards and mice; output units such as various types of displays and speakers; storage units such as disks and optical discs; and communication units such as network interface cards (NICs), modems, and wireless transceivers. The communication unit allows the electronic device to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0152] A processor can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of processors include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The processor performs the various methods and processes described above, such as training methods for induction motor control models or induction motor control methods.

[0153] In some embodiments, the training method for the induction motor control model or the induction motor control method may be implemented as a computer program tangibly contained in a computer-readable storage medium, such as a storage unit. In some embodiments, part or all of the computer program may be loaded and / or installed on an electronic device via ROM and / or a communication unit. When the computer program is loaded into RAM and executed by a processor, one or more steps of the training method for the induction motor control model or the induction motor control method described above may be performed. Alternatively, in other embodiments, the processor may be configured to perform the training method for the induction motor control model or the induction motor control method by any other suitable means (e.g., by means of firmware).

[0154] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0155] Computer programs used to implement the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that when executed by the processor, the computer programs cause the functions / operations specified in the flowcharts and / or block diagrams to be performed. The computer programs may be executed entirely on a machine, partially on a machine, or as a standalone software package, partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0156] In the context of this invention, a computer-readable storage medium can be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, apparatus, or device. A computer-readable storage medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination thereof. Alternatively, a computer-readable storage medium may be a machine-readable signal medium. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0157] To provide interaction with a user, the systems and techniques described herein can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the electronic device. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0158] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as data servers), or middleware components (e.g., application servers), or frontend components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), blockchain networks, and the Internet.

[0159] A computing system can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a hosting product within the cloud computing service system to address the shortcomings of traditional physical hosts and VPS services, such as high management difficulty and weak business scalability.

[0160] This invention also provides a computer program product, including a computer program that, when executed by a processor, implements the training method for the induction motor control model or the induction motor control method provided in any embodiment of this application.

[0161] In implementing the computer program product, computer program code for performing the operations of this invention can be written in one or more programming languages ​​or a combination thereof. Programming languages ​​include object-oriented programming languages ​​such as Java, Smalltalk, and C++, as well as conventional procedural programming languages ​​such as C or similar languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0162] It should be understood that the various forms of processes shown above can be used, with steps reordered, added, or deleted. For example, the steps described in this invention can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution of this invention can be achieved, and this is not limited herein.

[0163] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.

Claims

1. A training method for an induction motor control model, characterized in that, include: The training requires experience samples from the experience pool. These experience samples include the state vector at the current time step, the action vector inferred by the main policy learning network of the TD3 algorithm model based on the state vector at the current time step, the state vector at the next time step, and the reward for taking the action vector under the state vector at the current time step. The state vector includes the measured values ​​of the d-axis component of the stator current, the measured values ​​of the q-axis component of the stator current, the deviation of the d-axis component of the stator current, the deviation of the q-axis component of the stator current, the integral of the deviation of the d-axis component of the stator current, the integral of the deviation of the q-axis component of the stator current, the measured value of the rotational speed, and the reference value of the rotational speed. The action vector includes the reference values ​​of the d-axis component of the stator voltage and the reference values ​​of the q-axis component of the stator voltage. The TD3 algorithm model is trained using the empirical samples, and the parameters of the TD3 algorithm model are updated until the reward of the TD3 algorithm model is greater than the reward threshold or the number of training rounds reaches the maximum number of training rounds. The master policy learning network after training is used to predict the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component during the operation of the induction motor.

2. The training method for the induction motor control model according to claim 1, characterized in that, The TD3 algorithm model includes a main policy learning network, a first value evaluation network, a second value evaluation network, a target policy learning network, a first target value evaluation network, and a second target value evaluation network. The TD3 algorithm model is trained using the empirical samples, and its parameters are updated until the reward of the TD3 algorithm model exceeds a reward threshold or the maximum number of training epochs is reached. This includes: Samples are taken from the experience pool to obtain batch sample data including multiple experience samples; The batch sample data is input into the TD3 algorithm model for training, and the parameters of the first value evaluation network and the second value evaluation network are updated. The parameters of the main policy learning network are updated every preset number of times the parameters of the first value evaluation network and the second value evaluation network are updated. The parameters of the target policy learning network, the first target value evaluation network, and the second target value evaluation network are updated using a soft update method. Repeat the above steps until the reward of the TD3 algorithm model is greater than the reward threshold or the number of training rounds reaches the maximum number of training rounds. Each round includes multiple time steps, and each time step includes the entire process of the TD3 algorithm model from receiving experience samples, inferring action vectors, returning rewards, and the state vector of the next time step.

3. The training method for the induction motor control model according to claim 2, characterized in that, The batch sample data is input into the TD3 algorithm model for training, updating the parameters of the first value evaluation network and the second value evaluation network, including: The state vector of the next time step is input into the target policy learning network for inference to obtain the first action vector; Add truncated noise to the first action vector to obtain the target action vector; The target action vector is input into the first target value evaluation network to calculate the first action value of the first action vector; The target action vector is input into the second target value evaluation network to calculate the second action value of the first action vector; The smaller of the first action value and the second action value is taken as the target action value; Calculate the discount sum of the reward and the value of the target action to obtain the target evaluation value; Input the action vector corresponding to the state vector at the current time step into the first value evaluation network to calculate the third action value of the action vector corresponding to the state vector at the current time step. Input the action vector corresponding to the state vector at the current time step into the second value evaluation network to calculate the fourth action value of the action vector corresponding to the state vector at the current time step. The parameters of the first value evaluation network are calculated using the gradient descent algorithm to minimize the error between the target evaluation value and the value of the third action, and the parameters of the first value evaluation network are updated. The parameters of the second value evaluation network are calculated using the gradient descent algorithm to minimize the error between the target evaluation value and the value of the fourth action, and the parameters of the second value evaluation network are updated.

4. The training method for the induction motor control model according to claim 2, characterized in that, Every preset number of parameter updates to the first and second value evaluation networks, the parameters of the main policy learning network are updated, including: When the parameters of the first value evaluation network and the second value evaluation network are updated a preset number of times, the state vector of the next time step is input into the main policy learning network for inference to obtain the second action vector. The second action vector is input into the first value evaluation network to calculate the fifth action value of the second action vector; The parameters of the master policy learning network that maximize the value of the fifth action are calculated using the gradient ascent algorithm, and the parameters of the master policy learning network are updated.

5. The training method for the induction motor control model according to claim 2, characterized in that, Repeat the above steps until the reward of the TD3 algorithm model is greater than the reward threshold or the number of rounds in this round reaches the maximum number of training rounds, including: After updating the parameters of the target policy learning network, the first target value evaluation network, and the second target value evaluation network using a soft update method, it is determined whether the time step of this round has reached the maximum time step of a single round. If the time step of this round reaches the maximum time step of a single round, then end the training for this round and determine whether the number of training rounds has reached the evaluation interval. If the time step of this round does not reach the maximum time step of a single round, then return to the step of sampling from the experience pool to obtain batch sample data including multiple experience samples; If the number of training rounds has not reached the evaluation interval, then proceed to determine whether the number of rounds in this round is greater than the maximum number of training rounds. If the number of training rounds reaches the evaluation interval, then a test sample is taken from the experience pool, and the TD3 algorithm model is run for a preset number of rounds. Calculate the cumulative discount reward generated at all time steps in each round within the preset round as the return; Calculate the average of all rewards for the preset round; Determine if the average of all returns is greater than the return threshold; If so, then end the training process; If not, then determine whether the number of rounds in this round is greater than the maximum number of training rounds; If so, then end the training process; If not, return to the step of sampling from the experience pool to obtain batch sample data including multiple experience samples.

6. The training method for the induction motor control model according to claim 2, characterized in that, Samples are drawn from the experience pool to obtain batch sample data comprising multiple experience samples, including: Initialize the three-phase current of the induction motor and the measured values ​​of the rotor speed and rotor position; Calculate the measured values ​​of the d-axis component and q-axis component of the stator current based on the three-phase current and rotor position measurements; Input the measured speed value and the reference speed value into the speed PI controller to calculate the reference values ​​of the d-axis component and the q-axis component of the stator current; The state vector composed of the measured values ​​of the stator current d-axis component, the measured values ​​of the stator current q-axis component, the stator current d-axis component deviation, the stator current q-axis component deviation, the integral of the stator current d-axis component deviation, the integral of the stator current q-axis component deviation, the measured speed value, and the speed reference value is input into the main policy learning network of the TD3 algorithm model for inference to obtain the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component. By performing the inverse Parker transform on the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component, the reference values ​​of the stator voltage α-axis component and the stator voltage β-axis component are obtained. Space vector pulse width modulation is performed based on the reference values ​​of the stator voltage α-axis component and the stator voltage β-axis component to obtain a pulse width modulation signal, which is used to control the inverter of the induction motor. The reward for taking the action vector at the current time step is calculated based on the stator current d-axis component deviation, stator current q-axis component deviation, reference values ​​of the stator voltage d-axis component, and stator voltage q-axis component. The system collects the three-phase current, rotor speed measurement value, and rotor position measurement value of the induction motor, and returns to the step of calculating the stator current d-axis component measurement value and stator current q-axis component measurement value based on the three-phase current and rotor position measurement value to obtain the state vector of the next time step. The action vector composed of the current time step's state vector, the reference value of the stator voltage d-axis component, and the reference value of the stator voltage q-axis component, the reward, and the state vector of the next time step are combined into an experience sample and stored in the experience pool. Determine whether the number of experience samples in the experience pool has reached the preset number required for training. If so, sample from the experience pool to obtain batch sample data including multiple experience samples; If not, return to the steps of initializing the three-phase current of the induction motor, the rotor speed measurement value, and the rotor position measurement value, until the experience samples in the experience pool reach the preset number required for training.

7. The training method for the induction motor control model according to claim 6, characterized in that, The reward for taking the action vector at the current time step is calculated based on the stator current d-axis component deviation, stator current q-axis component deviation, reference values ​​of the stator voltage d-axis component and stator voltage q-axis component, using the following formula: in, The reward for taking the action vector at the state vector of the k-th time step is defined as Q1, Q2, and R, which are hyperparameters. The stator current d-axis component deviation at the k-th time step. The stator current q-axis component deviation at the k-th time step. This is the reference value for the d-axis component of the stator voltage at the (k-1)th time step. This is the reference value for the q-axis component of the stator voltage at the (k-1)th time step.

8. A method for controlling an induction motor, characterized in that, The master policy learning network trained based on the training method according to any one of claims 1-7 includes: During the operation of the induction motor, the three-phase current, rotor speed measurement value, and rotor position measurement value of the induction motor are determined; Calculate the measured values ​​of the d-axis component and q-axis component of the stator current based on the three-phase current and rotor position measurements; Input the measured speed value and the reference speed value into the speed PI controller to obtain the reference values ​​of the d-axis component and the q-axis component of the stator current; The state vector composed of the measured values ​​of the stator current d-axis component, the measured values ​​of the stator current q-axis component, the stator current d-axis component deviation, the stator current q-axis component deviation, the integral of the stator current d-axis component deviation, the integral of the stator current q-axis component deviation, the measured speed value, and the speed reference value is input into the main policy learning network of the TD3 algorithm model for inference to obtain the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component. By performing the inverse Parker transform on the reference values ​​of the stator voltage d-axis component and the stator voltage q-axis component, the reference values ​​of the stator voltage α-axis component and the stator voltage β-axis component are obtained. Space vector pulse width modulation is performed based on the reference values ​​of the stator voltage α-axis component and the stator voltage β-axis component to obtain a pulse width modulation signal, which is used to control the inverter of the induction motor. Repeat the above steps until a stop signal from the induction motor is received.

9. A training system for an induction motor control model, characterized in that, include: A digital signal processor, the digital signal processor including a playback buffer, the playback buffer being used to store an experience pool; A field-programmable gate array (FPGA), wherein the FPGA is communicatively connected to the digital signal processor; The server is configured to execute the training method as described in any one of claims 1-7 and deploy the trained master policy learning network to the field programmable gate array.

10. The training system for the induction motor control model according to claim 9, characterized in that, The digital signal processor also includes a speed PI regulator and a reward calculation module, and the field programmable gate array also includes an inverse Parker transform module, a space vector pulse width modulation module, an analog-to-digital converter module, a Clarke transform module, a Parker transform module, and a speed and position calculation module. The analog-to-digital converter module is used to acquire the three-phase current of the induction motor and convert the three-phase current into digital signals; The speed and position calculation module is used to calculate the rotor's position measurement value and rotor speed measurement value based on the sensor's acquired signals and three-phase current; The Clarke transform module is used to perform Clarke transform on the digital signal of the three-phase current to obtain the measured values ​​of the α-axis component and the β-axis component of the stator current. The Parker transformation module is used to perform Parker transformation on the measured values ​​of the stator current α-axis component and the stator current β-axis component to obtain the measured values ​​of the stator current d-axis component and the stator current q-axis component. The speed PI regulator is used to calculate the reference values ​​of the stator current d-axis component and the stator current q-axis component based on the measured speed value and the speed reference value. The reward calculation module is used to calculate the reward for taking the action vector under the state vector of the current time step based on the stator current d-axis component deviation, stator current q-axis component deviation, stator voltage d-axis component reference value and stator voltage q-axis component reference value, and to calculate the cumulative discount reward generated by all time steps in each round within the preset round as the reward.