Deep reinforcement learning based quasi-zero stiffness system control method and system

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The PPO-LSTM agent, constructed using deep reinforcement learning and LSTM temporal networks, solves the control problem of quasi-zero stiffness systems under strong nonlinearity and time-varying environmental conditions, achieving adaptive low-frequency vibration isolation and energy management, and improving ship noise suppression.

CN122308101APending Publication Date: 2026-06-30HUNAN UNIV +1

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: HUNAN UNIV
Filing Date: 2026-04-13
Publication Date: 2026-06-30

AI Technical Summary

Technical Problem

Traditional control methods for quasi-zero stiffness systems rely on accurate models, which are difficult to adapt to strong nonlinearity and time-varying environmental conditions, resulting in poor control performance, especially in the suppression of low-frequency noise in ships.

Method used

By employing deep reinforcement learning, combining LSTM temporal networks and multi-objective reward functions, and through model-free learning and curriculum learning strategies, a PPO-LSTM agent is constructed to achieve adaptive vibration control.

Benefits of technology

It achieves efficient low-frequency vibration isolation in time-varying environments, has strong generalization ability, balances control energy output and real-time performance, and improves the robustness and engineering reliability of the controller.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122308101A_ABST

Patent Text Reader

Abstract

This invention relates to the field of vibration control technology, specifically to a control method and system for quasi-zero stiffness systems based on deep reinforcement learning. The method involves real-time acquisition of displacement and velocity signals, inputting them into a pre-trained policy network, and outputting optimal control force commands to drive actuators to generate active control forces to suppress vibration. The policy network employs a PPO-LSTM architecture with integrated long short-term memory layers, effectively extracting the temporal characteristics of the system's dynamics. The training process is conducted in a simulation environment containing nonlinear restoring force terms, using model-free reinforcement learning based on generalized dominance estimation, and guided by a multi-objective reward function that integrates displacement suppression, control cost, and smoothness. A curriculum learning strategy is also introduced, transitioning in stages from simple environments to complex time-varying environments with stochastic excitations to enhance the controller's environmental adaptability and generalization ability. This achieves adaptive control of strongly nonlinear systems without the need for precise models, effectively improving low-frequency vibration isolation performance.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of vibration control technology, specifically to a control method and system for quasi-zero stiffness systems based on deep reinforcement learning. Background Technology

[0002] The statements in this section are merely background information related to the present invention and do not necessarily constitute prior art.

[0003] A quasi-zero stiffness system is a vibration isolation device that achieves high static stiffness and low dynamic stiffness by combining positive and negative stiffness mechanisms. This design allows the system to have extremely low dynamic stiffness near its equilibrium position, thereby effectively isolating low-frequency vibrations while maintaining sufficient load-bearing capacity to support heavy-duty equipment.

[0004] In the field of ship vibration reduction and noise control, the low-frequency mechanical noise of ships (usually below 100Hz) is characterized by concentrated energy, slow attenuation, and long propagation distance, posing a serious threat to the acoustic stealth performance of ships. Quasi-zero stiffness systems, due to their excellent low-frequency vibration isolation performance, have become a key technical means to solve this problem.

[0005] Currently, control methods for near-zero stiffness systems are mainly classified into three categories: passive control, semi-active control, and active control. Passive control relies on the system's own mechanical properties and requires no external energy input; semi-active control adapts to environmental changes by adjusting the system's damping or stiffness parameters; and active control actively suppresses vibration by applying control forces through external actuators. Among these, active control has received widespread attention due to its better adaptability and control effect.

[0006] Traditional active control methods (such as PID control) heavily rely on accurate mathematical models of the system. For quasi-zero stiffness systems with strong nonlinear characteristics, accurate modeling is extremely difficult, and model errors can lead to a sharp decline in control performance or even system instability.

[0007] Semi-active control methods are typically based on preset control rules, such as ground damping and ceiling damping. These rules may work well under fixed operating conditions, but when environmental parameters (such as damping changes caused by speed changes, sudden load changes, etc.) change, the control effect will drop significantly and lack environmental adaptability.

[0008] While data-driven control methods avoid the difficulties of modeling, they require a large amount of high-quality training data, have high requirements for data acquisition conditions, and involve long training cycles. More importantly, these methods lack the ability to generalize to scenarios outside the training data, and the control effect is difficult to guarantee when encountering operating conditions not included in the training set. Summary of the Invention

[0009] This invention provides a control method for quasi-zero stiffness systems based on deep reinforcement learning, which solves the problem of adaptive vibration control of quasi-zero stiffness vibration isolation systems under time-varying environments. It overcomes the limitations of traditional control methods, such as dependence on accurate models, difficulty in nonlinear control, and insufficient generalization ability, and improves the low-frequency noise suppression effect of ships.

[0010] To achieve the above objectives, the present invention adopts the following technical solution:

[0011] The first aspect of this invention provides a control method for quasi-zero stiffness systems based on deep reinforcement learning, comprising the following steps:

[0012] The current state signal of the quasi-zero stiffness vibration isolation system is obtained. The current state signal and the historical state sequence are used as inputs to the strategy network to obtain the optimal control force. The active actuator of the quasi-zero stiffness vibration isolation system generates an active control force to counteract vibration based on the optimal control force.

[0013] The state signals include at least displacement and velocity;

[0014] The construction and training process of the policy network includes:

[0015] A dynamic simulation model of a quasi-zero stiffness vibration isolation system containing a nonlinear restoring force term is established as an interactive training environment for intelligent agents.

[0016] In the training environment, a proximal policy optimization algorithm based on generalized advantage estimation is used to drive the agent, which consists of a policy network and an evaluation network, to perform model-free learning. Both the policy network and the evaluation network integrate long short-term memory network layers to handle the temporal dependencies between states and actions.

[0017] During training, a multi-objective reward function is defined based on the system response. The reward function integrates at least displacement suppression, control force cost, and control smoothness indicators to guide the direction of policy optimization.

[0018] By adopting a course-based learning strategy, the agent is trained in stages in time-varying environments ranging from fixed parameters to random stimuli until the policy network converges, thereby obtaining a control policy that can directly map the system state time-series information to the optimal control force.

[0019] Furthermore, the dynamic model of the quasi-zero stiffness vibration isolation system is shown in the following equation:

[0020] ;

[0021] in, For the mass of the system oscillator, , , These are the system displacement, velocity, and acceleration, respectively. The damping coefficient is time-varying. The linear stiffness coefficient is... As an external incentive, For control;

[0022] Nonlinear quasi-zero stiffness restoring force term Defined as:

[0023]

[0024] in, The lateral nonlinear force when the quasi-zero stiffness system is in equilibrium. The horizontal distance between the centers of the inner and outer magnetic ring sections of the lateral negative stiffness structure of the quasi-zero stiffness system.

[0025] Furthermore, the policy network receives a state time sequence consisting of the current displacement, current velocity, and their historical values, extracts the temporal dynamic features through a long short-term memory network layer, and outputs a normalized continuous control force value.

[0026] Furthermore, a multi-objective reward function is used during the training of the policy network. As shown in the following formula:

[0027] ;

[0028] The components are defined as follows:

[0029] For displacement penalty terms, This is the displacement normalization coefficient;

[0030] For acceleration penalty items, This is the acceleration normalization coefficient;

[0031] As a control force penalty term, it constrains and controls energy consumption;

[0032] This is a penalty term for the rate of change of control force, used to suppress sudden changes in control force.

[0033] For energy change rewards, The total mechanical energy of the system, To prevent the coefficient from being divided by zero;

[0034] For sparse success rewards, when the system displacement continuous Multiple time steps below the preset tolerance At the final time step, a positive reward is given.

[0035] in, These are the weighting coefficients for each component.

[0036] Furthermore, the policy network consists of the following components: a shared feature extraction layer comprising two fully connected layers, each with 64 neurons, using the Mish activation function and equipped with a LayerNorm (layer normalization); a single-layer LSTM temporal processing layer with 32 hidden neurons, supporting long sequence feature extraction; and a policy network output layer comprising one fully connected layer with 16 neurons, using the Tanh activation function, outputting the action mean. The network structure and strategy are evaluated. The shared feature extraction layer and LSTM temporal processing layer have the same structure. The network output layer is evaluated, which contains a fully connected layer with 64 neurons and ReLU activation function. The output state value is evaluated. .

[0037] Furthermore, the control action is obtained from the mean output of the policy network in the following way. The sampling yielded:

[0038] From a multivariate Gaussian distribution Sampling a primitive action The covariance matrix The original action is compressed by applying the Tanh function to obtain the final execution action. Its range is limited to the interval [-1, 1].

[0039] In performing the action Log probability of the next strategy The calculation includes the Jacobian correction term, given by:

[0040] ;

[0041] in, It is the original action Log probability under Gaussian distribution To ensure numerical stability, the second term on the right-hand side of the equation is the logarithm of the Jacobian determinant introduced by the Tanh transformation.

[0042] Furthermore, both the policy network and the evaluation network integrate long short-term memory network layers, forming the PPO-LSTM agent architecture. The long short-term memory network layer encodes the input state temporal information, extracts the dynamic temporal features of the system, and outputs the corresponding temporal feature vector. The fully connected layers of the policy network and the evaluation network output action distribution parameters and state value estimates based on the temporal feature vectors, respectively.

[0043] Furthermore, the optimization objective of the proximal policy optimization algorithm (PPO algorithm) based on generalized advantage estimation includes policy network optimization terms. Evaluation of network optimization items and policy entropy optimization term ;

[0044] Among them, the optimization objective term for evaluating the network is:

[0045]

[0046] For the Critic network to state The value estimate, For target value, The advantage function is calculated using generalized advantage estimation (GAE).

[0047] Strategy entropy optimization term:

[0048] ;

[0049] in, For policy entropy, The adaptive entropy coefficient;

[0050] The objective function for clipping the proxy, used to optimize the policy network, is calculated as follows:

[0051] ;

[0052] in, This represents the probability ratio between the old and new strategies. This is for pruning hyperparameters.

[0053] Furthermore, the advantage function is estimated using the GAE (Generalized Advantage Estimation) method:

[0054] ;

[0055] in, This is the discount factor. For GAE balance parameters, for TD error at time t, for The actual reward at any moment, To evaluate the network's state The value estimate output.

[0056] Furthermore, an adaptive entropy coefficient is used during training. It is based on the actual entropy of the current strategy. With target entropy The difference is dynamically adjusted, and the adjustment formula is:

[0057] ;

[0058] in, For adaptive learning rate, This is the lower bound for the change in adaptive entropy coefficient. This represents the upper limit of the adaptive entropy coefficient variation.

[0059] Furthermore, training using a packaged sequence approach specifically involves: collecting data from several time steps in each batch, splitting the data according to sequence length, packaging the variable-length sequences from different rounds into a PackedSequence object, inputting it into the LSTM layer all at once, automatically initializing and resetting the hidden state, thereby improving training efficiency and the accuracy of temporal feature extraction.

[0060] Furthermore, the training process employs a curriculum learning strategy to learn the characteristics of the following stages:

[0061] The first stage involves preliminary training in a training environment where system parameters are fixed and external stimuli are deterministic.

[0062] The second stage involves training in a training environment where the system's damping and / or stiffness parameters change slowly and periodically over time.

[0063] The third stage involves training in a training environment where system parameters are time-varying and external stimuli contain random noise components.

[0064] The subsequent training stage uses the network parameters obtained from the convergence of the previous stage as initial parameters.

[0065] A second aspect of the present invention provides a system for implementing the above-described method, comprising:

[0066] The state parameter acquisition module is configured to acquire the current state signal of the quasi-zero stiffness vibration isolation system.

[0067] The control module is configured to use the current state signal and the historical state sequence as input to the strategy network to obtain the optimal control force. The active actuator of the quasi-zero stiffness vibration isolation system generates an active control force to counteract the vibration based on the optimal control force.

[0068] The state signals include at least displacement and velocity;

[0069] The construction and training process of the policy network includes:

[0070] A dynamic simulation model of a quasi-zero stiffness vibration isolation system containing a nonlinear restoring force term is established as an interactive training environment for intelligent agents.

[0071] In the training environment, a proximal policy optimization algorithm based on generalized advantage estimation is used to drive the agent, which consists of a policy network and an evaluation network, to perform model-free learning. Both the policy network and the evaluation network integrate long short-term memory network layers to handle the temporal dependencies between states and actions.

[0072] During training, a multi-objective reward function is defined based on the system response. The reward function integrates at least displacement suppression, control force cost, and control smoothness indicators to guide the direction of policy optimization.

[0073] By adopting a course-based learning strategy, the agent is trained in stages in time-varying environments ranging from fixed parameters to random stimuli until the policy network converges, thereby obtaining a control policy that can directly map the system state time-series information to the optimal control force.

[0074] A third aspect of the present invention provides a computer program product including computer-readable instructions that, when executed on an electronic device, cause the electronic device to implement the aforementioned adaptive bistable WEC intelligent control method incorporating wave prediction.

[0075] A fourth aspect of the present invention provides an electronic device including at least one processor and a memory connected to the processor, the memory being used to store a computer program; the processor being used to execute the computer program, enabling the electronic device to implement the above-described control method for a quasi-zero stiffness system based on deep reinforcement learning.

[0076] Compared with existing technologies, one or more of the above technical solutions have the following beneficial effects:

[0077] 1. A deep reinforcement learning PPO-LSTM model with integrated temporal feature processing capabilities is employed to interactively learn with a quasi-zero stiffness system environment in real time. Combining system displacement and velocity as observation inputs, the dynamic characteristics of the system are analyzed and captured. A multi-objective reward function is used to comprehensively consider system response, control input, and system energy. The PPO algorithm models the vibration isolation problem as a Markov decision process, adjusting the optimal control strategy based on different environmental conditions such as damping, stiffness, and excitation conditions. This ensures vibration isolation efficiency while also considering control energy output and real-time control performance. Furthermore, a phased training strategy based on course learning allows the controller to gradually adapt to various changing conditions, from the simplest fixed environmental parameter conditions to time-varying environments, and then to more complex environments incorporating random excitation and noise, demonstrating ideal generalization ability.

[0078] 2. By constructing a dynamic simulation model containing a nonlinear restoring force term as the training environment, and employing a proximal policy optimization algorithm based on generalized dominance estimation to drive the agent in model-free learning, the policy network can learn control strategies through direct interaction with the environment without the need for precise modeling. This process enables the controller to adaptively adjust the control force output online when faced with changes in system parameters such as time-varying damping and time-varying stiffness, overcoming the performance degradation problem of traditional model-dependent control methods under parameter perturbations, thereby ensuring the stable and efficient operation of the vibration isolation system under complex shipboard conditions.

[0079] 3. By integrating long short-term memory (LSTM) network layers to construct policy and evaluation networks, the temporal dependencies between states and actions can be effectively handled, and dynamic features can be extracted from historical state sequences. This enables the controller not only to respond to the current system state but also to make more forward-looking and stable control decisions based on historical trends. Combined with the optimized guidance of multi-objective reward functions, while suppressing displacement and acceleration, control energy consumption and command smoothness are also considered, thereby fundamentally improving the vibration isolation efficiency and control robustness of quasi-zero stiffness systems in the low-frequency domain, effectively coping with random disturbances.

[0080] 4. During the training phase, this scheme employs a phased learning strategy, gradually transitioning the agent from a simple, fixed-parameter environment to a complex, time-varying environment with random stimuli. This process simulates a learning pattern from easy to difficult, ensuring stable convergence of the policy network in the early stages of training and progressively strengthening its ability to cope with uncertainties and unknown disturbances in subsequent stages. Therefore, the finally deployed policy network possesses excellent generalization capabilities for conditions outside the training set and can be reliably applied to various complex real-world scenarios such as sudden load changes and sea state variations.

[0081] 5. The trained policy network is deployed online as a lightweight controller, requiring only real-time displacement and velocity as input. Through forward propagation, it can quickly generate continuous optimal control force commands. This approach avoids complex online optimization calculations and meets real-time requirements. Simultaneously, by penalizing the magnitude and rate of change of the control force in the reward function, effective vibration isolation is achieved while actively constraining the actuator's energy consumption and amplitude, suppressing potential high-frequency excitation, and ensuring the safety and lifespan of the actuator. This results in a good balance across multiple dimensions, including control performance, real-time performance, energy consumption, and engineering reliability. Attached Figure Description

[0082] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.

[0083] Figure 1A flowchart illustrating an active vibration control method for a quasi-zero stiffness system based on deep reinforcement learning, provided in one or more embodiments of the present invention.

[0084] Figure 2 A schematic diagram of a quasi-zero stiffness vibration isolation system provided in one or more embodiments of the present invention;

[0085] Figure 3 A schematic diagram of the principle of a quasi-zero stiffness vibration isolation system provided in one or more embodiments of the present invention;

[0086] Figure 4 This is a structural framework diagram of a deep reinforcement learning model provided for one or more embodiments of the present invention. Detailed Implementation

[0087] The present invention will be further described below with reference to the accompanying drawings and embodiments.

[0088] It should be noted that the following detailed descriptions are exemplary and intended to provide further illustration of the invention. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0089] As introduced in the background section, current control methods for near-zero stiffness systems are mainly divided into three categories: passive control, semi-active control, and active control. Passive control relies on the system's own mechanical properties and requires no external energy input; semi-active control adapts to environmental changes by adjusting the system's damping or stiffness parameters; and active control actively suppresses vibration by applying control forces through external actuators. Among these, active control has received widespread attention due to its better adaptability and control effect.

[0090] Traditional active control methods (such as PID control) heavily rely on accurate mathematical models of the system. For quasi-zero stiffness systems with strong nonlinear characteristics, accurate modeling is extremely difficult, and model errors can lead to a sharp decline in control performance or even system instability.

[0091] Semi-active control methods are typically based on preset control rules, such as ground damping and ceiling damping. These rules may work well under fixed operating conditions, but when environmental parameters (such as damping changes caused by speed changes, sudden load changes, etc.) change, the control effect will drop significantly and lack environmental adaptability.

[0092] While data-driven control methods avoid the difficulties of modeling, they require a large amount of high-quality training data, have high requirements for data acquisition conditions, and involve long training cycles. More importantly, these methods lack the ability to generalize to scenarios outside the training data, and the control effect is difficult to guarantee when encountering operating conditions not included in the training set.

[0093] These defects mainly stem from the following reasons:

[0094] The fundamental reasons are strong nonlinearity and modeling difficulties. The mechanical properties of quasi-zero stiffness systems exhibit significant nonlinearity; the stiffness approaches zero near the equilibrium position and increases rapidly away from the equilibrium position. This strong nonlinearity renders traditional linearization modeling methods ineffective, while accurate nonlinear models are difficult to establish, thus limiting the effectiveness of model-based control methods.

[0095] The contradiction between time-varying environmental conditions and fixed control rules. Ships face complex marine environments during actual navigation, where system parameters such as damping and stiffness dynamically change with speed and sea state, and external excitations also contain a large amount of randomness. Fixed control rules or controller parameters cannot adapt to these time-varying characteristics, which is the main reason for the decline in the effectiveness of semi-active and traditional active control.

[0096] The trade-off between generalization ability and data dependence. Existing intelligent control methods often perform well on specific training data, but lack the ability to adapt to unknown operating conditions. This is because these methods essentially learn statistical patterns in the data. When encountering scenarios outside the data distribution, the learned patterns no longer apply, leading to a decline in control effectiveness.

[0097] Therefore, this embodiment proposes a control method and system for quasi-zero stiffness systems based on deep reinforcement learning. It combines deep reinforcement learning with quasi-zero stiffness systems to construct a PPO control framework that integrates LSTM temporal networks. Learning is guided by a multi-objective reward function, and progressive training from simple to complex environments is achieved by combining a course learning strategy. Ultimately, model-free, adaptive, and highly generalized intelligent vibration control is realized.

[0098] This solution provides an active vibration control method for quasi-zero stiffness systems based on deep reinforcement learning, utilizing policy proximal optimization (PPO) and long short-term memory (LSTM) networks. For example... Figure 1 As shown, the method includes the following steps:

[0099] A dynamic model of a quasi-zero stiffness vibration isolation system is established. The model includes linear stiffness terms, time-varying damping terms, and nonlinear quasi-zero stiffness restoring force terms, which accurately describe the vibration characteristics of the time-varying nonlinear system.

[0100] Define the continuous state space and action space of the control system, and construct a multi-objective reward function that integrates displacement suppression, acceleration suppression, control force cost, control smoothness and energy change. The reward function also includes sparse success reward terms to guide the control strategy to evolve towards the optimal direction.

[0101] A dual-network control model is constructed, which includes a policy (Actor) network and an evaluation (Critic) network. Both the policy network and the evaluation network integrate LSTM time-series processing layers to extract time-series features from the system state sequence and output the control policy.

[0102] The state parameters, consisting of system displacement and velocity, are input into the policy network, and the actuator output control force is used as the output variable of the policy network. The state parameters and motion parameters are input into the evaluation network, and the corresponding value estimate of the state-motion pair is output.

[0103] The PPO algorithm based on generalized advantage estimation (GAE) is adopted to construct loss functions for policy optimization, value optimization, and policy entropy optimization. Variable-length time-series data is processed by the PackedSequence method. The dual network parameters are optimized with the goal of minimizing the loss function, thereby optimizing the control policy.

[0104] The training process employs an adaptive entropy coefficient adjustment strategy to balance exploration and utilization; at the same time, a course learning strategy is introduced to complete the training in stages from a simple fixed parameter environment to a complex random stimulus environment.

[0105] The trained policy network is deployed to a quasi-zero stiffness vibration isolation system, and continuously adjustable optimal control actions are generated based on the real-time observed system state to achieve environmentally adaptive vibration suppression.

[0106] I. System dynamics modeling and simulation environment construction.

[0107] The dynamic behavior of the quasi-zero stiffness vibration isolation system is described by the following differential equation:

[0108] ;

[0109] like Figure 2 and Figure 3 As shown, where, For the mass of the system oscillator, , , These are the system displacement, velocity, and acceleration, respectively. The damping coefficient is time-varying. The linear stiffness coefficient is... For the active control force to be designed, External incentives.

[0110] Among them, the time-varying damping coefficient By setting different damping ratios To achieve this, specifically, the damping ratio varies sinusoidally from 0.1 to 0.5, and the time-varying frequency is taken as 1 / 10 of the excitation frequency.

[0111] Among them, external incentives The amplitude range is set to 0-100N, the bandwidth covers 1-50HZ, and a sinusoidal excitation mode is adopted. In the third stage of training, a noise term will be introduced on this basis.

[0112] Nonlinear quasi-zero stiffness restoring force term Defined as:

[0113] ;

[0114] in, The lateral nonlinear force when the quasi-zero stiffness system is in equilibrium. The horizontal distance between the inner and outer magnetic rings of the lateral negative stiffness structure in a quasi-zero stiffness system.

[0115] To simulate real physical processes, the fourth-order Runge-Kutta (RK4) numerical integration method was used to solve the dynamic equations. The simulation step size was... Based on external excitation frequency Adaptive adjustment, here we take This is to ensure numerical stability and computational accuracy.

[0116] II. Construction of State Space and Action Space.

[0117] like Figure 4 As shown, the state space The design aims to enable the intelligent agent to fully perceive the dynamic characteristics of the system. The state space of this scheme is a two-dimensional continuous space. Action space Defined as the normalized control force output. The agent outputs a value at each time step... Continuous actions within the range Actual control Calculated by the following formula:

[0118] ;

[0119] in, The maximum control force that the actuator can provide.

[0120] III. Reward Function Design.

[0121] The reward function guides the agent to learn the optimal control strategy. This paper proposes a multi-objective composite reward function, the expression of which is:

[0122] .

[0123] The components are defined as follows:

[0124] Displacement suppression reward This encourages the system displacement to approach 0, specifically:

[0125] ;

[0126] in, This is the displacement reward weighting coefficient. This is the displacement normalization factor.

[0127] Acceleration suppression reward To reduce system acceleration, specifically:

[0128] ;

[0129] in, For the instantaneous acceleration of the system, The acceleration reward weighting coefficient, This is the acceleration normalization factor.

[0130] Control Cost Reward This is used to constrain control force and minimize control energy consumption, specifically:

[0131]

[0132] in, The weighting coefficient for controllability rewards.

[0133] Control smoothness reward To punish drastic changes in control and ensure smooth control commands, specifically:

[0134] ;

[0135] in, To control the smoothness of the reward weight coefficient.

[0136] Energy Change Reward This encourages the system to reduce vibrational energy, specifically by:

[0137] ;

[0138] in, The total mechanical energy of the system, This is the energy reward weighting coefficient. To prevent division by zero for small constants.

[0139] Sparse success reward When the system displacement continuous Step below preset tolerance At that time, a one-time large positive reward will be given. Encourage stable control.

[0140] In this embodiment, the weighting coefficients for each reward are as follows: , , , , .

[0141] IV. Construction of PPO-LSTM control network.

[0142] The policy network consists of the following components: a shared feature extraction layer containing two fully connected layers, each with 64 neurons and the Mish activation function, equipped with a LayerNorm (layer normalization); a single-layer LSTM temporal processing layer with 32 hidden neurons, supporting long sequence feature extraction; and a policy network output layer containing one fully connected layer with 16 neurons and the Tanh activation function, outputting the action mean. The network structure and strategy are evaluated. The shared feature extraction layer and LSTM temporal processing layer have the same structure. The network output layer is evaluated, which contains a fully connected layer with 64 neurons and ReLU activation function. The output state value is evaluated. .

[0143] like Figure 4 As shown, this embodiment constructs a dual-network control model comprising an Actor network and a Critic network. Since vibration control is currently a continuously time-varying problem, and traditional MLP neural networks struggle to capture dynamic temporal dependencies, this solution proposes an improved Actor-Critic architecture. This architecture incorporates an LSTM structure to enhance the agent's decision-making performance in active control. Specifically, LSTM modules are integrated into both the Actor and Critic networks, enabling them to learn temporal information relationships from continuous states. This design strengthens the agent's modeling of the implicit relationships between historical and current states, thereby achieving more accurate decision-making in highly dynamic environments.

[0144] 4.1 Actor Network (Policy Network).

[0145] Actor network reception status Its historical sequence, and the average value of the output control actions. Its structure is as follows:

[0146] Input layer: Receives the state vector .

[0147] Shared feature extraction layer: Consists of two fully connected layers of size 64, using the Mish activation function and LayerNorm. The Mish activation function is specifically... Compared to the ReLU activation function, Mish can effectively alleviate the gradient vanishing problem when the input value is large or small, making it suitable for learning tasks requiring high model performance, such as vibration isolation control. LayerNorm, on the other hand, is a commonly used layer normalization technique in deep learning, used to accelerate network training while improving model stability.

[0148] LSTM layer: consists of one hidden layer with 32 hidden units, used to extract temporal features.

[0149] Output layer: Outputs the average action value The activation function is Tanh.

[0150] 4.2 Critic Network (Evaluation Network).

[0151] Critic Network Evaluation Status value Its structure is similar to that of the Actor network, but its parameters are independent, meaning that the two networks are updated and optimized separately. The output layer is a linear layer that outputs the state value of the current state. .

[0152] The specific parameters of the network model are shown in Table 1.

[0153] Table 1 Parameters of the Critic network

[0154]

[0155] In the optimizer parameters, since the Actor network and Critic network are updated and optimized separately, ActorLearningRate represents the learning rate of the Actor network, and CriticLearningRate represents the learning rate of the Critic network. In the training hyperparameters, Timesteps_per_batch means that 3600 timesteps of data are collected from the environment each time. These data will be divided into multiple sequences, where seq_len represents the length of each sequence. Each iteration randomly selects several sequences of data for updating. Here, minibatch_size is 4 sequences of data. total_timesteps represents the total number of training steps, which is 1e7 here.

[0156] 4.3 Action sampling.

[0157] After the Actor network outputs the average action value, the controller samples the actions in the following way:

[0158] ;

[0159] ;

[0160] From a multivariate Gaussian distribution Sampling a primitive action The covariance matrix The original action is compressed by applying the Tanh function to obtain the final execution action. Its range is limited to the interval [-1, 1].

[0161] The calculation of the logarithmic probability of the strategy includes the Jacobian correction:

[0162] ;

[0163] in, It is the original action Log probability under Gaussian distribution To ensure numerical stability, the second term on the right-hand side of the equation is the logarithm of the Jacobian determinant introduced by the Tanh transformation.

[0164] V. Model parameter update based on PPO algorithm.

[0165] After the above action sampling is completed, the corresponding action is executed, that is, the corresponding active control force is applied to the quasi-zero stiffness system. The system dynamic differential equation is solved according to the RK4 numerical integration method to update the new environmental state. At the same time, the environmental reward corresponding to this state-action is returned according to the multi-objective composite reward function. The above interaction process is then continued until a preset number of Timesteps_per_batch of interaction data is collected, which is 3600 in this case, to form the batch data required for training.

[0166] 1. Critic network model update.

[0167] After obtaining sufficient interaction data, the system begins advantage estimation and target value calculation. The core of this step lies in accurately assessing the value advantage of each state-action pair. By defining a loss function for the value function, the deviation between the Critic network output value and the actual value is measured. When the loss function reaches its minimum value, the expected value estimation can be achieved. The calculation process of the state value function will be used as an example here.

[0168] The loss function is defined as follows:

[0169] ;

[0170] in, For the Critic network to state The value estimate is the actual value calculated through the generalized advantage estimate (GAE), and the specific calculation process is as follows:

[0171] First, calculate the time difference error (TD error). The TD error reflects the difference between the current value estimate and the actual situation, and is the basis for subsequent calculations. The calculation formula is as follows:

[0172] ;

[0173] in, This is the real-time reward calculated by the model based on the reward model at the current moment. This is the GAE discount factor, which is set to 0.97 here.

[0174] Based on the TD error, the system uses generalized advantage estimation (GAE) to calculate the advantage function:

[0175] ;

[0176] This advantage function measures the state. Take action below Compared to the average level, among which The GAE balancing parameter is set to 0.94 here. By using exponentially decaying weighted summation, the bias and variance of the balancing advantage estimate are balanced to provide a more stable learning signal.

[0177] After obtaining the advantage estimate, the system further calculates the target value function. :

[0178] ;

[0179] The target value function combines the current value estimate and advantage information. Substituting these into the loss function calculation formula yields the specific value of the loss function. By analyzing the loss function... Finding the partial derivative allows us to calculate the gradient of the loss function under different parameters. Then, we can use gradient descent to update the parameters, as shown in the following expression:

[0180] ;

[0181] in, The learning rate for the Critic network is 0.001.

[0182] 2. Actor network model update.

[0183] The output of the Actor network is the agent's state. Take different actions probability distribution The parameter updates of the Actor network employ the PPO pruning mechanism, which is crucial for ensuring training stability.

[0184] First, calculate the advantage function, using the same method as the Critic network, because the advantage function here...

[0185] It can also serve as a guide for strategy updates, when When positive, it indicates that the current action is better than the average level, and the system will increase the probability of selecting that action; when... When the value is negative, the system reduces the probability of selecting that action. This mechanism allows the policy network to evolve in the direction of obtaining higher cumulative rewards, ensuring consistency between policy evaluation and policy improvement.

[0186] Then calculate the importance sampling ratio. This is used to measure the similarity between the old and new strategies, and is expressed as follows:

[0187] ;

[0188] in For Actor network parameters, This represents the state under the old strategy. Take action The probability of represents the state under the new policy. Take action The probability of.

[0189] Based on importance sampling ratio and advantage estimation, the Actor loss function is defined as:

[0190] ;

[0191] ;

[0192] in, The pruning parameter is used to limit the magnitude of policy updates; here it is set to 0.2. This loss function limits the magnitude of policy updates through a pruning mechanism, preventing training instability caused by excessively large single updates.

[0193] In addition, the Actor loss also includes an entropy regularization term:

[0194] ;

[0195] in, For policy entropy, The adaptive entropy coefficient is based on the actual entropy of the current policy. With target entropy The differences are dynamically adjusted to ensure a balance between exploration and utilization during training. The adjustment formula is as follows:

[0196] ;

[0197] in, This is the exponential moving average of entropy. The target entropy value is defined by the entropy regularization term, which encourages the strategy to maintain a certain degree of randomness, thereby promoting exploration.

[0198] Therefore, the total Actor loss is:

[0199] ;

[0200] pass For network parameters We obtain the gradient of the loss function with different parameters by taking the partial derivative, and then update the Actor network parameters using the gradient ascent method:

[0201] ;

[0202] VI. Model Training.

[0203] The above outlines the entire algorithm's process of state acquisition, action sampling and interaction, reward construction, and network model parameter updates. To accelerate model training and the agent's generalization ability, this patent incorporates a curriculum learning strategy, learning the features of the following stages:

[0204] Phase 1: Initial training under fixed environmental parameters;

[0205] Phase Two: Training is conducted in an environment where environmental parameters change slowly and periodically over time;

[0206] The third stage: Training is conducted in an environment where environmental parameters change slowly and external stimuli contain random noise;

[0207] Phase 4: Training in a stochastic stimulus environment;

[0208] In this process, the network parameters obtained from the previous training stage are used as initial values for the next stage of training.

[0209] During training, convergence is determined when the system displacement response remains below a minimum threshold for an extended period. In this scheme, the threshold is set to [value missing]. .

[0210] This scheme employs a deep reinforcement learning PPO-LSTM model with integrated temporal feature processing capabilities to interactively learn from a quasi-zero stiffness system environment in real time. Combining system displacement and velocity as observation inputs, it analyzes and captures the system's dynamic characteristics. By using a multi-objective reward function to comprehensively consider system response, control input, and system energy, the PPO algorithm models the vibration isolation problem as a Markov decision process. Based on different environmental conditions such as damping, stiffness, and excitation conditions, it adjusts and generates the optimal control strategy, ensuring vibration isolation efficiency while also considering control energy output and real-time control performance. Furthermore, a phased training strategy using a course-based learning approach allows the controller to gradually adapt to various changing conditions, from the simplest fixed environmental parameter conditions to time-varying environments, and then to more complex environments incorporating random excitation and noise, demonstrating strong generalization ability.

[0211] This scheme constructs a dynamic simulation model containing a nonlinear restoring force term as the training environment and employs a proximal policy optimization algorithm based on generalized dominance estimation to drive the agent for model-free learning. This allows the policy network to learn control strategies through direct interaction with the environment without the need for precise modeling. This process enables the controller to adaptively adjust the control force output online when faced with changes in system parameters such as time-varying damping and time-varying stiffness, overcoming the performance degradation problem of traditional model-dependent control methods under parameter perturbations. This ensures the stable and efficient operation of the vibration isolation system under complex shipboard conditions.

[0212] This scheme integrates a long short-term memory (LSTM) network layer to construct a policy network and an evaluation network, effectively handling the temporal dependencies between states and actions and extracting dynamic features from historical state sequences. This enables the controller not only to respond to the current system state but also to make more forward-looking and stable control decisions based on historical trends. Combined with the optimized guidance of a multi-objective reward function, it suppresses displacement and acceleration while balancing control energy consumption and command smoothness, thereby fundamentally improving the vibration isolation efficiency and control robustness of the quasi-zero stiffness system in the low-frequency domain and effectively coping with random disturbances.

[0213] During the training phase, this scheme employs a phased learning strategy, gradually transitioning the agent from a simple, fixed-parameter environment to a complex, time-varying environment with random stimuli. This process simulates a learning pattern from easy to difficult, ensuring stable convergence of the policy network in the early stages of training and progressively strengthening its ability to cope with uncertainties and unknown disturbances in subsequent stages. Therefore, the final deployed policy network possesses excellent generalization capabilities to conditions outside the training set and can be reliably applied to various complex real-world scenarios such as sudden load changes and sea state variations.

[0214] This solution deploys a trained policy network as a lightweight controller online. It only requires real-time displacement and velocity as input, and through forward propagation, it can quickly generate continuous optimal control force commands. This approach avoids complex online optimization calculations and meets real-time requirements. Simultaneously, by penalizing the magnitude and rate of change of the control force in the reward function, effective vibration isolation is achieved while actively constraining the actuator's energy consumption and amplitude, suppressing potential high-frequency excitation, and ensuring the safety and lifespan of the actuator. Thus, a good balance is achieved across multiple dimensions, including control performance, real-time performance, energy consumption, and engineering reliability.

[0215] Correspondingly, the control system for quasi-zero stiffness systems based on deep reinforcement learning includes:

[0216] The state parameter acquisition module is configured to acquire the current state signal of the quasi-zero stiffness vibration isolation system.

[0217] The control module is configured to use the current state signal and the historical state sequence as input to the strategy network to obtain the optimal control force. The active actuator of the quasi-zero stiffness vibration isolation system generates an active control force to counteract the vibration based on the optimal control force.

[0218] The state signals include at least displacement and velocity;

[0219] The construction and training process of the policy network includes:

[0220] A dynamic simulation model of a quasi-zero stiffness vibration isolation system containing a nonlinear restoring force term is established as an interactive training environment for intelligent agents.

[0221] In the training environment, a proximal policy optimization algorithm based on generalized advantage estimation is used to drive the agent, which consists of a policy network and an evaluation network, to perform model-free learning. Both the policy network and the evaluation network integrate long short-term memory network layers to handle the temporal dependencies between states and actions.

[0222] During training, a multi-objective reward function is defined based on the system response. The reward function integrates at least displacement suppression, control force cost, and control smoothness indicators to guide the direction of policy optimization.

[0223] By adopting a course-based learning strategy, the agent is trained in stages in time-varying environments ranging from fixed parameters to random stimuli until the policy network converges, thereby obtaining a control policy that can directly map the system state time-series information to the optimal control force.

[0224] Correspondingly, a computer program product includes computer-readable instructions that, when executed on an electronic device, cause the electronic device to implement the aforementioned control method for a quasi-zero stiffness system based on deep reinforcement learning.

[0225] Accordingly, an electronic device includes at least one processor and a memory connected to the processor, the memory being used to store a computer program; the processor is used to execute the computer program, enabling the electronic device to implement the aforementioned control method for quasi-zero stiffness systems based on deep reinforcement learning.

[0226] Those skilled in the art will understand that embodiments of this solution can be embodied as a method, system, or computer program product. Therefore, this solution can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this solution can also take the form of a computer program product implemented on one or more computer-readable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0227] This solution is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this solution. It should be understood that each block in the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to create a machine such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, can implement one or more blocks in the flowchart and / or one or more blocks in the block diagram, specifying the function.

[0228] These computer program instructions may also be stored in a computer-readable storage medium capable of directing a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium generate an article of manufacture containing instruction means. These instruction means are used to implement the functions specified in one or more flowcharts and / or one or more blocks of a block diagram.

[0229] Furthermore, these computer program instructions can be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to generate a computer-implemented processing procedure. Thus, the instructions that execute on the computer or other programmable equipment will provide steps for implementing one or more processes of the flowchart and / or one or more blocks of the block diagram that specify the functionality.

[0230] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A control method for quasi-zero stiffness systems based on deep reinforcement learning, characterized in that, Includes the following steps: The current state signal of the quasi-zero stiffness vibration isolation system is obtained. The current state signal and the historical state sequence are used as inputs to the strategy network to obtain the optimal control force. The active actuator of the quasi-zero stiffness vibration isolation system generates an active control force to counteract vibration based on the optimal control force. The state signals include at least displacement and velocity; The construction and training process of the policy network includes: A dynamic simulation model of a quasi-zero stiffness vibration isolation system containing a nonlinear restoring force term is established as an interactive training environment for intelligent agents. In the training environment, a proximal policy optimization algorithm based on generalized advantage estimation is used to drive the agent, which consists of a policy network and an evaluation network, to perform model-free learning. Both the policy network and the evaluation network integrate long short-term memory network layers to handle the temporal dependencies between states and actions. During training, a multi-objective reward function is defined based on the system response. The reward function integrates at least displacement suppression, control force cost, and control smoothness indicators to guide the direction of policy optimization. By adopting a course-based learning strategy, the agent is trained in stages in time-varying environments ranging from fixed parameters to random stimuli until the policy network converges, thereby obtaining a control policy that can directly map the system state time-series information to the optimal control force.

2. The control method for quasi-zero stiffness systems based on deep reinforcement learning as described in claim 1, characterized in that, The dynamic model of the quasi-zero stiffness vibration isolation system is shown in the following equation: ； in, For the mass of the system oscillator, , , These are the system displacement, velocity, and acceleration, respectively. The damping coefficient is time-varying. The linear stiffness coefficient is... As an external incentive, For control; Nonlinear quasi-zero stiffness restoring force term Defined as: ； in, The lateral nonlinear force when the quasi-zero stiffness system is in equilibrium. The horizontal distance between the centers of the inner and outer magnetic ring sections of the lateral negative stiffness structure of the quasi-zero stiffness system.

3. The control method for quasi-zero stiffness systems based on deep reinforcement learning as described in claim 1, characterized in that, The policy network receives a state time sequence consisting of the current displacement, current velocity and its historical values, extracts the temporal dynamic features through a long short-term memory network layer, and outputs a normalized continuous control force value.

4. The control method for quasi-zero stiffness systems based on deep reinforcement learning as described in claim 1, characterized in that, The multi-objective reward function used during the training of the policy network As shown in the following formula: ； The components are defined as follows: For displacement penalty terms, This is the displacement normalization coefficient; For acceleration penalty items, This is the acceleration normalization coefficient; As a control force penalty term, it constrains and controls energy consumption; This is a penalty term for the rate of change of control force, used to suppress sudden changes in control force. For energy change rewards, The total mechanical energy of the system, To prevent the coefficient from being divided by zero; For sparse success rewards, when the system displacement continuous Multiple time steps below the preset tolerance At the final time step, a positive reward is given. in, These are the weighting coefficients for each component.

5. The control method for quasi-zero stiffness systems based on deep reinforcement learning as described in claim 1, characterized in that, Both the policy network and the evaluation network integrate long short-term memory (LSM) network layers, forming the PPO-LSTM agent architecture. The LSM network layer encodes the input state temporal information, extracts the dynamic temporal features of the system, and outputs the corresponding temporal feature vector. The fully connected layers of the policy network and the evaluation network output action distribution parameters and state value estimates based on the temporal feature vectors, respectively.

6. The control method for quasi-zero stiffness systems based on deep reinforcement learning as described in claim 1, characterized in that, The optimization objective of the proximal policy optimization algorithm based on generalized dominance estimation includes policy network optimization terms. Evaluation of network optimization items and policy entropy optimization term ; Among them, the optimization objective term for evaluating the network is: ； For the Critic network to state The value estimate, For target value, The advantage function is calculated using generalized advantage estimation (GAE). Strategy entropy optimization term: ； in, For policy entropy, The adaptive entropy coefficient; The objective function for clipping the proxy, used to optimize the policy network, is calculated as follows: ； in, This represents the probability ratio between the old and new strategies. This is for pruning hyperparameters.

7. The control method for quasi-zero stiffness systems based on deep reinforcement learning as described in claim 1, characterized in that, The training process employs a course-based learning strategy, learning the characteristics of the following stages respectively: The first stage involves preliminary training in a training environment where system parameters are fixed and external stimuli are deterministic. The second stage involves training in a training environment where the system's damping and / or stiffness parameters change slowly and periodically over time. The third stage involves training in a training environment where system parameters are time-varying and external stimuli contain random noise components. The subsequent training stage uses the network parameters obtained from the convergence of the previous stage as initial parameters.

8. A control system for a quasi-zero stiffness system based on deep reinforcement learning, characterized in that, include: The state parameter acquisition module is configured to acquire the current state signal of the quasi-zero stiffness vibration isolation system. The control module is configured to use the current state signal and the historical state sequence as input to the strategy network to obtain the optimal control force. The active actuator of the quasi-zero stiffness vibration isolation system generates an active control force to counteract the vibration based on the optimal control force. The state signals include at least displacement and velocity; The construction and training process of the policy network includes: A dynamic simulation model of a quasi-zero stiffness vibration isolation system containing a nonlinear restoring force term is established as an interactive training environment for intelligent agents. In the training environment, a proximal policy optimization algorithm based on generalized advantage estimation is used to drive the agent, which consists of a policy network and an evaluation network, to perform model-free learning. Both the policy network and the evaluation network integrate long short-term memory network layers to handle the temporal dependencies between states and actions. During training, a multi-objective reward function is defined based on the system response. The reward function integrates at least displacement suppression, control force cost, and control smoothness indicators to guide the direction of policy optimization. By adopting a course-based learning strategy, the agent is trained in stages in time-varying environments ranging from fixed parameters to random stimuli until the policy network converges, thereby obtaining a control policy that can directly map the system state time-series information to the optimal control force.

9. A computer program product, characterized in that: It includes computer-readable instructions that, when executed on an electronic device, cause the electronic device to implement the quasi-zero stiffness system control method based on deep reinforcement learning as described in any one of claims 1-7.

10. An electronic device, characterized in that: It includes at least one processor and a memory connected to the processor, the memory being used to store a computer program; the processor is used to execute the computer program, enabling the electronic device to implement the quasi-zero stiffness system control method based on deep reinforcement learning as described in any one of claims 1-7.