A Multi-Sensor Continuous Ultrasonic Welding Monitoring Method Based on Reinforcement Learning

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By using a multi-sensor array based on reinforcement learning and a deep reinforcement learning network model, dynamic adaptive monitoring and real-time parameter adjustment of the welding process were achieved. This solved the problems of insufficient dynamic adaptive capability and reliance on labeled data in traditional methods, and improved the stability and adaptability of the welding process.

CN122306145APending Publication Date: 2026-06-30SOUTHWEST JIAOTONG UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SOUTHWEST JIAOTONG UNIV
Filing Date: 2026-03-19
Publication Date: 2026-06-30

Application Information

Patent Timeline

19 Mar 2026

Application

30 Jun 2026

Publication

CN122306145A

IPC: G01D21/02; B23K20/10; G05B13/04; G06N3/04; G06N3/088; G06N3/092

AI Tagging

Technology Topics

Sensor arrayServo actuator

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Traditional multi-sensor monitoring methods lack dynamic adaptive capabilities and cannot respond to signal characteristic fluctuations and environmental interference during the welding process, resulting in parameter drift. Furthermore, existing intelligent monitoring methods rely on a large amount of labeled data, have insufficient generalization capabilities, and are difficult to adapt to welding scenarios with different materials and structures.

Method used

A multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning is adopted. Multi-sensor arrays are used to synchronously collect multi-dimensional physical signals during the welding process, a deep reinforcement learning network model is constructed, and a reward function is set by combining near-end policy optimization and deep deterministic policy gradient to achieve real-time parameter adjustment and closed-loop control.

Benefits of technology

It enables dynamic adaptive monitoring of the welding process, quickly responds to changes in working conditions and environmental interference, improves the stability and quality consistency of the welding process, adapts to welding scenarios with different materials and structures, and significantly enhances the intelligence level of ultrasonic welding.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122306145A_ABST

Patent Text Reader

Abstract

This invention discloses a multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning, comprising: synchronously acquiring multi-dimensional physical signals such as vibration, pressure, energy, and displacement during the welding process using a multi-sensor array; extracting calibration features of each signal and fusing them to form a high-dimensional state vector; constructing a deep reinforcement learning network model integrating two algorithm architectures, defining corresponding state and action spaces, and setting a reward function based on quality compliance, abnormal parameter states, and adjustment amplitude; within a single welding cycle, inputting the high-dimensional state vector into the model to generate optimal parameter adjustment action commands, and adjusting the ultrasonic generator output and pressure mechanism setpoints in real time through a servo actuator to form a complete closed-loop control process. This method, through the combination of multi-sensor fusion and reinforcement learning, achieves dynamic adaptive monitoring and precise parameter adjustment of the welding process, improving welding quality consistency and process stability.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of ultrasonic welding monitoring technology, and in particular to a multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning. Background Technology

[0002] Ultrasonic welding, as a highly efficient and clean joining technology, is widely used in precision manufacturing fields such as automobile manufacturing and electronic packaging. Its welding quality directly affects the structural integrity and service reliability of products. With the increasing demands for welding precision, consistency, and intelligence in industrial production, traditional methods relying on manual experience or single-parameter monitoring are no longer sufficient to meet the needs of continuous production. Multi-sensor fusion technology provides a data foundation for comprehensively characterizing the welding state by simultaneously capturing diverse physical signals such as vibration, pressure, energy, and displacement during the welding process. However, how to achieve dynamic matching of signal characteristics and process parameters through intelligent algorithms and construct a real-time closed-loop control mechanism has become a key technical challenge for improving the stability and quality controllability of the ultrasonic welding process. Reinforcement learning-based monitoring methods address this need, aiming to mine the value of multi-sensor data through intelligent decision-making algorithms to achieve adaptive optimization of the welding process.

[0003] Existing technologies have two significant shortcomings: First, traditional multi-sensor monitoring methods mostly employ static feature extraction and fixed threshold judgment modes, lacking the ability to adapt to dynamic changes in the welding process. They cannot adjust monitoring strategies based on real-time fluctuations in signal characteristics, resulting in delayed responses to factors such as parameter drift and environmental interference under complex working conditions, making it difficult to achieve precise dynamic adjustment of process parameters. Second, existing intelligent monitoring methods mostly rely on supervised learning models, requiring a large amount of labeled quality data for training. However, online labeling of ultrasonic welding quality parameters is difficult and costly, and the model's generalization ability is limited by the distribution of training data. It lacks adaptability to welding scenarios with different materials and structures, leading to a disconnect between monitoring results and actual process adjustments, making it difficult to fundamentally solve the quality fluctuation problem in continuous welding processes. Summary of the Invention

[0004] To overcome the shortcomings and deficiencies of existing technologies, this invention provides a multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning.

[0005] The technical solution adopted in this invention is a multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning, comprising the following steps: S1, synchronously acquiring multivariate physical signals during the welding process through a multi-sensor array integrated at the calibration position of the welding equipment, wherein the multivariate physical signals include welding head mechanical vibration state signals, welding pressure signals, input energy signals, and welding head position change signals; S2, extracting features from the multivariate physical signals to obtain calibration features corresponding to each signal, wherein the calibration features include the rising slope and peak value of the power curve, the collapse amount and velocity of the displacement curve, and the stability-related features of the pressure curve, and fusing the calibration features to form a high-dimensional state vector characterizing the current welding process state; S3, constructing a deep reinforcement learning network model, wherein the model adopts a proximal strategy optimization. The Actor-Critic architecture, combined with a deep deterministic policy gradient, defines the model's state space as a high-dimensional state vector and its action space as continuous, small-amplitude adjustments to welding calibration process parameters. S4: A reward function is set for the model, based on weld quality compliance, abnormal process parameter states, and parameter adjustment amplitudes. S5: Within a single welding cycle, the high-dimensional state vector is input into the deep reinforcement learning network model, which generates optimal parameter adjustment action instructions through its internal policy network mapping. S6: The servo actuator receives the parameter adjustment action instructions and adjusts the ultrasonic generator output and pressure mechanism settings in real time, forming a closed-loop control process of perception-decision-execution until the welding cycle terminates.

[0006] Furthermore, the policy update formula for the deep reinforcement learning network model is: ,in, For the updated model parameters, To update the model parameters, For learning rate, For trajectory data, As the current strategy, For the old strategy, For action vectors, For state feature mapping function, For the dominant function, This is the cutting factor.

[0007] Furthermore, the formula for calculating the reward function is as follows: ,in, As a reward value, , These are the weighting coefficients. For indicator functions, These are the actual weld joint quality parameters. For quality threshold, For real-time power parameters, This is the upper limit threshold for power. For real-time displacement parameters, This is the lower limit threshold for displacement. Adjust the vector for the parameters.

[0008] Furthermore, the fusion formula for the high-dimensional state vector is: ,in, It is a high-dimensional state vector. For feature concatenation function, For activation function, This is the weight matrix. These are the feature vectors of vibration, pressure, energy, and displacement signals, respectively. This is the bias vector.

[0009] Furthermore, the action output formula of the model is: ,in, For action vectors, The hyperbolic tangent activation function is used. Here is the weight matrix for the action network. This is the bias vector for the action network. Adjust the upper bound vector for the parameters. Adjust the lower bound vector for the parameters.

[0010] Furthermore, the value function update formula of the model is: ,in, For the state value function, For value network parameters, For instant rewards, As a discount factor, The next state value function, The regularization coefficient is . Let the action value function be... For action value network parameters, This is the next state.

[0011] Further, S3 includes the following sub-steps: S31, determining the core architecture of the deep reinforcement learning network model, selecting the Actor-Critic framework that combines proximal policy optimization and deep deterministic policy gradient, and clarifying the hierarchical structure and neuron number configuration of the policy network and value network; S32, defining the state space of the model, using the high-dimensional state vector generated in S2 as the state input, ensuring that the state space can fully cover the multi-dimensional physical characteristics and parameter interaction information of the welding process; S33, dividing the action space of the model, using the continuous small-amplitude adjustment values of the welding calibration process parameters as the action output, and clarifying the adjustment range and step interval of each parameter; S34, setting the training hyperparameters of the model, including the learning rate, discount factor, number of iterations, and batch size, to provide basic configuration conditions for offline training of the model.

[0012] Further, S4 includes the following sub-steps: S41, setting positive reward rules corresponding to weld quality compliance, and determining specific numerical standards for positive rewards based on the quantitative results of core welding quality indicators; S42, formulating negative reward rules corresponding to abnormal process parameters, and setting triggering conditions and values for immediate negative rewards for parameter states indicating defects such as abnormal power spikes and insufficient displacement; S43, designing penalty rules for parameter adjustment amplitude, and determining the value standard of penalty coefficients based on the absolute amplitude and rate of change of parameter adjustments; S44, integrating positive reward rules, negative reward rules, and penalty rules to construct a complete reward function expression, and clarifying the weight allocation ratio corresponding to each rule.

[0013] Further, S5 includes the following sub-steps: S51, at each control time step of a single welding cycle, receiving the real-time high-dimensional state vector generated in S2 and inputting it into the deep reinforcement learning network model that has completed offline training; S52, the model performs feature processing and mapping calculation on the input state through the policy network, and determines the optimal parameter adjustment direction in the current state by combining the evaluation results of the value network; S53, based on the optimal parameter adjustment direction, generating continuous small-amplitude adjustment commands for the ultrasonic generator output and the pressure mechanism set value, forming an action vector; S54, converting the action vector into a control signal that can be recognized by the servo actuator to ensure the real-time performance and accuracy of signal transmission.

[0014] A reinforcement learning-based multi-sensor continuous ultrasonic welding monitoring method is implemented through different units, including: a multi-dimensional physical signal synchronous acquisition unit, integrated at the calibration position of the welding equipment, used to synchronously capture the mechanical vibration state signal of the welding head, welding pressure signal, input energy signal, and welding head position change signal during the welding process, performing parallel acquisition and transmission of multi-dimensional signals; a multi-dimensional feature extraction and fusion unit, which receives the signals transmitted by the multi-dimensional physical signal synchronous acquisition unit, extracts the calibration features of each signal, and generates a high-dimensional state vector through feature splicing and fusion algorithms to complete the integration processing of signal features; and a deep reinforcement learning model construction and training unit, which adopts an Actor-Critic architecture combining proximal policy optimization and deep deterministic policy gradient, defines the state space, action space, and reward function, and optimizes the model parameters through offline training and iteration to generate a model with decision-making capabilities. The system comprises: an intelligent model for decision-making capabilities; a real-time parameter adjustment decision unit that receives high-dimensional state vectors output by the multi-dimensional feature extraction and fusion unit, utilizes the intelligent model generated by the deep reinforcement learning model construction and training unit, maps and generates optimal parameter adjustment action commands, and outputs decision commands in real time; a servo execution drive unit that establishes a signal connection with the real-time parameter adjustment decision unit, receives parameter adjustment action commands, drives the ultrasonic generator and pressure mechanism to adjust parameters, and completes command execution and feedback; and a closed-loop control coordination unit that establishes communication connections with the multi-dimensional physical signal synchronous acquisition unit, the multi-dimensional feature extraction and fusion unit, the deep reinforcement learning model construction and training unit, the real-time parameter adjustment decision unit, and the servo execution drive unit, coordinating the working sequence of each unit to ensure the closed-loop operation of the perception-decision-execution process, and continuously monitoring and controlling the welding process.

[0015] Beneficial Effects: This invention proposes a multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning. This method synchronously captures multi-dimensional physical signals during the welding process through a multi-sensor array, extracts and fuses features to form a high-dimensional state vector, and constructs a deep reinforcement learning model by combining a near-end policy optimization and a deep deterministic policy gradient architecture. Through reward function setting and real-time decision generation, the process parameters are dynamically adjusted with the help of a servo actuator, forming a complete closed-loop control process. Its beneficial effects are significant and accurately overcome the defects of existing technologies. To address the lack of dynamic adaptive capabilities in traditional methods, this method relies on the real-time decision-making mechanism of a reinforcement learning model. It can dynamically adjust monitoring and control strategies based on signal characteristic fluctuations, quickly responding to changes in operating conditions, parameter drift, and environmental interference, thus achieving precise dynamic optimization of process parameters. Addressing the shortcomings of existing intelligent monitoring methods, such as reliance on large amounts of labeled data, insufficient generalization ability, and lack of closed-loop mechanisms, this method employs an unsupervised training mode reinforcement learning framework. It does not require large-scale quality labeled data. Through a complete closed-loop design of perception-feature fusion-decision-execution, it strengthens the correlation between monitoring results and process adjustments, improving adaptability to welding scenarios with different materials and structures. Simultaneously, through multi-dimensional feature fusion and continuous small-amplitude parameter adjustments, it ensures the stability and quality consistency of the welding process, effectively solving the quality fluctuation problem in continuous welding and significantly improving the intelligence level and production reliability of ultrasonic welding in the precision manufacturing field. Attached Figure Description

[0016] Figure 1 This is a flowchart illustrating the overall process of the method of the present invention. Figure 2 This is a flowchart of method step S3 of the present invention; Figure 3 This is a flowchart of method step S4 of the present invention; Figure 4 This is a flowchart of step S5 of the method of the present invention; Figure 5 This is a diagram showing the system unit composition of the present invention. Detailed Implementation

[0017] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. The application will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0018] like Figure 1 As shown, the multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning includes the following steps: S1, a multi-sensor array integrated into the calibration position of the welding equipment synchronously collects multiple physical signals during the welding process. The multiple physical signals include welding head mechanical vibration state signal, welding pressure signal, input energy signal and welding head position change signal. Specifically, step S1 achieves synchronous acquisition of multiple physical signals through a multi-sensor array integrated into the calibration position of the welding equipment. The sensor array is specifically deployed at the end of the welding head, the contact surface of the pressure actuator, the output interface of the ultrasonic generator, and the displacement detection area of the worktable. It includes a triaxial accelerometer, a high-precision pressure sensor, a power sensor, and a laser displacement sensor. The sampling frequency of each sensor is uniformly set to 10,000 Hz, and the sampling resolution is 16 bits to ensure high timeliness and high accuracy of the original signal acquisition. Among them, the triaxial accelerometer is used to capture the mechanical vibration state signals of the welding head in the X, Y, and Z directions during the welding process, and the measurement frequency range covers 0 to 1000 Hz; the high-precision pressure sensor acquires the welding pressure signal in real time, with a measurement range of 0 to 50 kN and a measurement error controlled within 0.5%; the power sensor focuses on capturing the input energy signal and simultaneously records the instantaneous power and cumulative energy data, with a measurement range of 0 to 5 kW; the laser displacement sensor is used to detect the position change signal of the welding head, with a measurement range of 0 to 50 mm and a resolution of 1 micrometer. Each sensor achieves timing alignment through synchronous trigger signals, with the time synchronization error controlled within 1 millisecond. The acquired raw signals are transmitted to the data processing module in real time via a high-speed data transmission bus, with a transmission delay of no more than 10 milliseconds. This step comprehensively captures the multi-dimensional physical signals in the welding process, providing complete and synchronous raw data support for subsequent feature extraction and state characterization, ensuring that the physical state changes throughout the welding process can be accurately reflected.

[0019] S2, extract features from the multivariate physical signals to obtain calibration features corresponding to each signal. The calibration features include the rising slope and peak value of the power curve, the collapse amount of the displacement curve and the stability-related features of the velocity and pressure curves. The calibration features are fused to form a high-dimensional state vector characterizing the current welding process state. Specifically, step S2 involves calibrating and extracting features from the synchronously acquired multi-dimensional physical signals and performing fusion processing. In this process, a sliding window method is used to extract features from the input energy signal, with a window length set to 50 milliseconds. The rising slope of the power curve in the initial welding stage is calculated, and an adaptive threshold method is used to extract the peak value of the power curve, with a threshold coefficient set to 1.8 times the signal mean. The peak duration is recorded synchronously. For the weld head position change signal, the instantaneous velocity of the displacement curve is obtained through differential calculation. After eliminating high-frequency noise using a 5th-order polynomial fitting method, the collapse amount in the displacement curve, i.e., the maximum sinking of the weld head, is identified. For the welding pressure signal, the standard deviation and coefficient of variation of the pressure curve in the stable stage are calculated, with the standard deviation threshold set to 2% of the rated pressure and the coefficient of variation threshold set to 1.5%, which are used as relevant features for the stability of the pressure curve. After feature extraction, feature standardization was used to map each feature value to the range of 0 to 1. The final extracted calibration features include 28 feature parameters such as the power curve rise slope, peak power, peak duration, displacement collapse, instantaneous displacement velocity, pressure standard deviation, and pressure coefficient of variation. The feature vectors corresponding to each signal were fused using a feature concatenation algorithm. The fusion process used a weighted average method to allocate weights, with the peak power weight set to 0.3, the displacement collapse weight set to 0.28, the pressure standard deviation weight set to 0.22, and the remaining feature weights totaling 0.2. This resulted in a high-dimensional state vector that comprehensively represents the current welding process state, providing accurate state input for the subsequent reinforcement learning model.

[0020] S3. Construct a deep reinforcement learning network model. The model adopts an Actor-Critic architecture that combines proximal policy optimization and deep deterministic policy gradient. The state space of the model is defined as the high-dimensional state vector, and the action space is defined as the continuous small-amplitude adjustment values of the welding calibration process parameters. Specifically, step S3 constructs a deep reinforcement learning network model. The model adopts an Actor-Critic architecture that combines proximal policy optimization and deep deterministic policy gradient. Both the Actor and Critic networks use deep neural network structures. The Actor network includes an input layer, four hidden layers, and an output layer. The number of neurons in the input layer is 28, consistent with the dimension of the high-dimensional state vector. The number of neurons in the four hidden layers is set to 256, 128, 64, and 32, respectively. The number of neurons in the output layer is 3, consistent with the dimension of the welding calibration process parameter adjustment. The Critic network also includes an input layer, four hidden layers, and an output layer. The number of neurons in the input layer is 31, including a 28-dimensional state vector and a 3-dimensional action vector. The number of neurons in the four hidden layers is set to 256, 128, 64, and 32, respectively. The output layer has one neuron for outputting the state value. The model's state space is explicitly defined as a 28-dimensional high-dimensional state vector generated in step S2, with each dimension mapping to a range of 0 to 1. The action space is defined as the continuous, small-amplitude adjustments to the welding calibration process parameters, specifically including ultrasonic power, welding pressure, and welding time. The ultrasonic power adjustment range is set to -5% to +5% of the rated value, the welding pressure range to -3% to +3%, and the welding time range to -10 milliseconds to +10 milliseconds. The adjustment step size for each parameter is set to 0.1% of the rated value or 1 millisecond. Simultaneously, the model is set to 10,000 offline training iterations, a batch size of 64, and an initial learning rate of 0.0003. The model parameters are optimized using a gradient descent algorithm. The reinforcement learning model constructed in this step provides core algorithmic support for subsequent real-time decision-making, achieving precise mapping between states and actions.

[0021] S4, Set the reward function of the model, the reward function is set based on the weld quality compliance status, abnormal state of process parameters and parameter adjustment range; Specifically, step S4 sets the model's reward function around three core dimensions: weld quality compliance, abnormal process parameters, and parameter adjustment range. In implementation, the core weld quality indicators are first defined as welding strength and penetration depth, with quality compliance thresholds set: 30 MPa for welding strength and 0.5 mm for penetration depth. A positive reward is given when all detected weld quality parameters meet these thresholds, with a base value of 10. For abnormal process parameters, clear anomaly judgment criteria are set: the upper limit threshold for real-time power is set at 110% of rated power, and the lower limit threshold for real-time displacement is set at 80% of rated displacement. A negative reward is given when real-time power exceeds the upper limit or real-time displacement falls below the lower limit, with a single negative reward value of 8. If multiple parameters are abnormal simultaneously, the negative reward values are cumulative. For parameter adjustment range, the sum of squares of the adjustment vectors is used as the penalty, with a penalty coefficient of 0.1. This means that the larger the parameter adjustment range, the greater the penalty, thus preventing excessively drastic parameter adjustments that could lead to welding instability. Simultaneously, the weights of the reward function are defined: a positive reward weight of 0.5 is assigned to weld joint quality compliance, a negative reward weight of 0.3 is assigned to abnormal process parameters, and a penalty weight of 0.2 is assigned to the magnitude of parameter adjustments. The final reward value is then calculated through a weighted sum. This step, by scientifically setting the reward function, provides a clear learning direction for the reinforcement learning model, guiding it to generate optimized decisions that ensure both welding quality and process stability, thus ensuring that model training iterates towards improving both welding quality and process stability.

[0022] S5, within a single welding cycle, the high-dimensional state vector is input into the deep reinforcement learning network model, and the model generates optimal parameter adjustment action instructions through internal policy network mapping; Specifically, step S5 generates a high-dimensional state vector input and optimal parameter adjustment action commands within a single welding cycle. The single welding cycle is set to 500 milliseconds, and the control time step is set to 20 milliseconds, meaning 25 real-time decisions and adjustments are completed within each welding cycle. In practice, at each control time step, the real-time high-dimensional state vector generated in step S2 is first received. The values of each dimension of the vector are checked to ensure they are within a reasonable range of 0 to 1. After successful checking, the vector is input into a deep reinforcement learning network model that has been trained offline. After receiving the state vector, the model performs feature transformation and mapping calculations sequentially through the input and hidden layers of the Actor network. The hidden layer uses the ReLU activation function for nonlinear transformation, and the output layer uses the Sigmoid activation function to map the output value to the corresponding adjustment range in the action space. Simultaneously, the Critic network evaluates the value of the current state and candidate actions, outputting the state value evaluation result. The model combines the action output of the Actor network and the value evaluation result of the Critic network, and uses policy gradient optimization logic to select the optimal parameter adjustment direction, generating continuous small-amplitude adjustment commands for ultrasonic power, welding pressure, and welding time, forming a three-dimensional action vector. After the motion vector is generated, a range limitation check is performed to ensure that the adjustment values of each dimension do not exceed the set adjustment range. After the check is passed, it is converted into a digital control signal that can be recognized by the servo actuator. The signal transmission rate is set to 1000 baud rate. This step achieves dynamic adaptive adjustment of the welding process by receiving status information in real time and quickly generating optimization decisions, providing core decision support for closed-loop control.

[0023] S6, the parameter adjustment action command is received through the servo actuator, and the output of the ultrasonic generator and the set value of the pressure mechanism are adjusted in real time to form a closed-loop control process of perception-decision-execution until the welding cycle ends.

[0024] Specifically, step S6 involves receiving parameter adjustment commands through a servo actuator and completing real-time parameter adjustments, forming a closed-loop control process of perception-decision-execution. The servo actuator includes an ultrasonic power servo controller, a pressure servo driver, and a time control module. The response time of each actuator is set to 5 milliseconds, and the control accuracy is set to 0.1% of the rated value. In the specific implementation process, after receiving the adjustment command, the ultrasonic power servo controller adjusts the output power of the ultrasonic generator in real time. The adjustment process adopts a proportional-integral-derivative control algorithm, with the proportional coefficient set to 0.3, the integral coefficient set to 0.1, and the derivative coefficient set to 0.05 to ensure the smoothness and accuracy of power adjustment. After receiving the command, the pressure servo driver drives the hydraulic or pneumatic system of the pressure mechanism to adjust the output pressure. The pressure adjustment response time is controlled within 8 milliseconds, and the pressure control accuracy reaches 0.05 kN. The time control module corrects the remaining welding cycle time according to the adjustment command to ensure the real-time adjustment of welding time. After the actuator completes parameter adjustment, it synchronously collects the adjusted actual parameter values and feeds them back to the data processing module. The feedback delay is no more than 10 milliseconds. The data processing module compares the feedback parameters with the adjustment command and calculates the adjustment error. If the error exceeds 0.5%, a second fine-tuning is triggered. This step, through the precise execution and feedback verification of the servo actuator, ensures that the parameter adjustment command can be implemented quickly and accurately, forming a complete closed-loop control link. This link is continuously executed in each welding cycle until the welding cycle ends, effectively ensuring the stability and quality consistency of the welding process and solving the problems of lag and insufficient adjustment accuracy in traditional open-loop control.

[0025] Preferably, the policy update formula of the deep reinforcement learning network model is: ,in, For the updated model parameters, To update the model parameters, For learning rate, For trajectory data, As the current strategy, For the old strategy, For action vectors, For state feature mapping function, For the dominant function, This is the cutting factor.

[0026] Specifically, the policy update implementation of the deep reinforcement learning network model employs a gradient ascent algorithm to iteratively optimize model parameters, ensuring that the policy remains exploratory while avoiding excessive deviation from historically effective policies during the update process. During the update process, the learning rate is set to 0.0003, a value verified through multiple experiments to ensure update efficiency while preventing parameter oscillations. Trajectory data is selected from 64 complete welding process state-action-reward sequences in each training batch to ensure data diversity and representativeness. The advantage function is calculated using the temporal difference residual method, quantifying the merit of actions by combining the difference between the current state value and the discounted value of the next state with the immediate reward. The pruning factor is set to 0.2 to limit the magnitude of policy updates and prevent training instability caused by excessive differences between the new and old policies. In practice, state-action pairs are first sampled from historical trajectory data to calculate the advantage function value. Then, the gradient direction and magnitude are adjusted by a scaling factor. Finally, the gradient is superimposed on the original model parameters to complete the update. Each iteration only updates some network layer parameters, and the iteration interval is set to 10 training batches. This update mechanism can effectively balance the model convergence speed and stability, ensuring that the strategy continues to evolve towards the optimal direction.

[0027] Preferably, the formula for calculating the reward function is: ,in, As a reward value, , These are the weighting coefficients. For indicator functions, These are the actual weld joint quality parameters. For quality threshold, For real-time power parameters, This is the upper limit threshold for power. For real-time displacement parameters, This is the lower limit threshold for displacement. Adjust the vector for the parameters.

[0028] Specifically, the calculation logic and implementation rules of the reward function integrate evaluation indicators from different dimensions through weighted summation, providing a clear learning guide for the reinforcement learning model. The weight coefficients are calibrated through offline simulation and actual welding tests. The positive reward weight corresponding to weld quality compliance is set at 0.5, the negative reward weight corresponding to abnormal process parameters is set at 0.3, and the penalty weight corresponding to parameter adjustment magnitude is set at 0.2, ensuring that the reward function prioritizes quality objectives while also considering process stability and adjustment smoothness. Quality thresholds are set according to industry standards and product design requirements: the welding strength compliance threshold is 30 MPa, and the penetration depth compliance threshold is 0.5 mm. A positive reward is triggered only when both indicators are met simultaneously, with a base value of 10. The criteria for judging abnormal parameters are clearly defined: the real-time power upper limit threshold is 110% of the rated power, and the real-time displacement lower limit threshold is 80% of the rated displacement. Exceeding any indicator triggers a negative reward, with a single negative reward of 8. For multiple abnormalities, the reward values are accumulated. The penalty coefficient is set at 0.1, and the penalty is quantified by calculating the sum of the squares of the values of each dimension of the parameter adjustment vector; the larger the adjustment magnitude, the heavier the penalty. During implementation, weld quality parameters and process parameters are collected in real time, and reward and punishment conditions are determined by comparing them with thresholds. The final reward value is calculated according to weights, providing a quantitative basis for updating the model strategy.

[0029] Preferably, the fusion formula for the high-dimensional state vector is: ,in, It is a high-dimensional state vector. For feature concatenation function, For activation function, This is the weight matrix. These are the feature vectors of vibration, pressure, energy, and displacement signals, respectively. This is the bias vector.

[0030] Specifically, the fusion of high-dimensional state vectors integrates the calibration features of multi-sensor signals into a unified state representation through feature mapping and concatenation algorithms, providing comprehensive and accurate state information for model input. During the fusion process, the weight matrices of each signal feature vector are determined through offline training optimization. The weight matrix dimensions for vibration signal feature vectors are 28×64, pressure signal is 28×48, energy signal is 28×32, and displacement signal is 28×24, ensuring a reasonable weight allocation for different signal features. The activation function uses a modified linear unit, which effectively alleviates the gradient vanishing problem and improves the nonlinear expressive power of the feature mapping. The initial values of each element of the bias vector are set to 0.01 to prevent the output values from saturating during network initialization. The implementation steps are as follows: First, the calibration feature vectors of vibration, pressure, energy, and displacement signals are input into the corresponding fully connected layers. Through linear transformation of the weight matrix and bias vector, combined with nonlinear processing of the activation function, high-dimensional mapping features of each signal are generated. Then, a feature concatenation algorithm is used to concatenate the mapping features end to end in the order of vibration, pressure, energy, and displacement to form a high-dimensional state vector with a dimension of 256. During the concatenation process, the dimensions of each feature are normalized to ensure that the numerical range is consistent. This fusion method can fully preserve the unique information of each signal, while mining the potential correlation between features, and improving the completeness and effectiveness of the state representation.

[0031] Preferably, the action output formula of the model is: ,in, For action vectors, The hyperbolic tangent activation function is used. Here is the weight matrix for the action network. This is the bias vector for the action network. Adjust the upper bound vector for the parameters. Adjust the lower bound vector for the parameters.

[0032] Specifically, the generation mechanism of the model's action output transforms the high-dimensional state vector into parameter adjustment instructions that meet the actual process requirements through network computation and range mapping, ensuring the rationality and feasibility of the action output. The action network weight matrix is set to 256×3, corresponding to a 256-dimensional input state vector and a 3-dimensional output action vector. The weight parameters are optimized iteratively through offline training, and the initial values are initialized using the Xavier method to ensure consistent output variance across network layers. The action network bias vector has a dimension of 3, with each element initially set to 0 and gradually adjusted during training. The hyperbolic tangent activation function maps the linear output of the network to the range of -1 to 1, and then the actual adjustment values are obtained by scaling and translating the upper and lower limit vectors of parameter adjustment. The elements of the upper limit vector of parameter adjustment are set to 5%, 3%, and 10 milliseconds of the rated value, corresponding to the maximum adjustment range of ultrasonic power, welding pressure, and welding time, respectively; the elements of the lower limit vector are set to -5%, -3%, and -10 milliseconds of the rated value to ensure that the adjustment range is within the allowable range of the process. During implementation, after the high-dimensional state vector is input into the action network, it undergoes linear calculation of the weight matrix and bias vector, and is processed by the activation function to obtain the intermediate output. Then, it is calculated with the upper and lower limit vectors of parameter adjustment to generate the final action vector. Each element of the action vector corresponds to the adjustment value of ultrasonic power, welding pressure, and welding time, respectively. This output mechanism can ensure that the action command conforms to the model decision logic and meets the constraints of the actual welding process.

[0033] Preferably, the value function update formula of the model is: ,in, For the state value function, For value network parameters, For instant rewards, As a discount factor, The next state value function, The regularization coefficient is . Let the action value function be... For action value network parameters, This is the next state.

[0034] Specifically, the update rule for the model's value function, through temporal difference learning and regularization, improves the accuracy and generalization ability of the value function in assessing state-action value. The value network parameters include a weight matrix and bias vectors. The weight matrix dimensions are 31×128, 128×64, and 64×1, while the bias vector dimensions are 128, 64, and 1, respectively. The parameters are initialized using the He initialization method to adapt to the characteristics of the modified linear unit activation function. The discount factor is set to 0.95, a value that balances immediate and future rewards, ensuring the model focuses on both optimizing the current welding state and considering long-term quality stability. The regularization coefficient is set to 0.01 to penalize the deviation between the action value function and the state value function, preventing the value assessment from overfitting the training data. During implementation, data on the current state, action, immediate reward, and next state are first collected. The value assessment result of the next state is calculated through the target value network, weighted by a discount factor, and then summed with the immediate reward to obtain the target value. Next, the difference between the current state-action value and the current state value is calculated, squared, and multiplied by a regularization coefficient to obtain the regularization term. Finally, the target value and the regularization term are combined, and the value network parameters are updated through the gradient descent algorithm. The learning rate for each parameter update is set to 0.0005, and the batch size is 64. This update mechanism can effectively reduce value assessment error and improve the reliability of model decision-making.

[0035] Preferred, such as Figure 2 The S3 step includes the following sub-steps: S31, determining the core architecture of the deep reinforcement learning network model, selecting the Actor-Critic framework that combines proximal policy optimization and deep deterministic policy gradient, and clarifying the hierarchical structure and neuron number configuration of the policy network and value network; S32, defining the state space of the model, using the high-dimensional state vector generated in S2 as the state input, ensuring that the state space can fully cover the multi-dimensional physical characteristics and parameter interaction information of the welding process; S33, dividing the action space of the model, using the continuous small-amplitude adjustment values of the welding calibration process parameters as the action output, and clarifying the adjustment range and step interval of each parameter; S34, setting the training hyperparameters of the model, including the learning rate, discount factor, number of iterations and batch size, to provide basic configuration conditions for offline training of the model.

[0036] Specifically, the deep reinforcement learning network model construction process in step S3 is implemented in four sub-steps to ensure a reasonable model architecture and scientific parameter configuration. In S31, when determining the core architecture, the Actor-Critic framework, combining proximal policy optimization and deep deterministic policy gradient, is selected. Both the policy network and the value network use deep neural network structures. The policy network is responsible for action generation, and the value network is responsible for value evaluation; they work together to achieve policy optimization. The policy network is defined to include an input layer, four hidden layers, and an output layer. The value network's hierarchical structure is consistent with the policy network, with the number of hidden layer neurons set to 256, 128, 64, and 32 respectively, ensuring sufficient feature extraction and nonlinear mapping capabilities. In S32, when defining the state space, the high-dimensional state vector generated in step S2 is directly used as input. This vector has 28 dimensions and includes calibration features of vibration, pressure, energy, and displacement signals. The values of each dimension are uniformly mapped to the interval 0 to 1, ensuring the standardization and consistency of the state space. In S33, when dividing the action space, the action output is clearly defined as a continuous, small-amplitude adjustment value of three calibrated process parameters: ultrasonic power, welding pressure, and welding time. The ultrasonic power adjustment range is set to -5% to +5% of the rated value, the welding pressure range to -3% to +3% of the rated value, and the welding time range to -10 milliseconds to +10 milliseconds. The adjustment step size for each parameter is set to 0.1% of the rated value or 1 millisecond to ensure the precision and feasibility of the action adjustment. In S34, when setting the training hyperparameters, the learning rate is set to 0.0003, the discount factor to 0.95, the number of iterations to 10,000, and the batch size to 64. These parameter configurations ensure the convergence speed and stability of the model training. These four progressively layered steps comprehensively construct a reinforcement learning model suitable for ultrasonic welding monitoring.

[0037] Preferred, such as Figure 3 The S4 step includes the following sub-steps: S41, setting positive reward rules corresponding to weld quality compliance, and determining specific numerical standards for positive rewards based on the quantitative results of core welding quality indicators; S42, formulating negative reward rules corresponding to abnormal process parameters, and setting triggering conditions and values for immediate negative rewards for parameter states indicating defects such as abnormal power spikes and insufficient displacement; S43, designing penalty rules for parameter adjustment amplitude, and determining the value standard of the penalty coefficient based on the absolute amplitude and rate of change of parameter adjustments; S44, integrating positive reward rules, negative reward rules, and penalty rules to construct a complete reward function expression and clarifying the weight allocation ratio corresponding to each rule.

[0038] Specifically, the reward function setting process in step S4 clarifies the logic and parameter standards for each rule through four sub-steps, ensuring that the reward function can accurately guide model learning. In step S31, when setting the positive reward rule, welding strength and penetration depth are used as core quality indicators. Referring to industry standards and product design requirements, the threshold for welding strength is set at 30 MPa, and the threshold for penetration depth is set at 0.5 mm. A positive reward is triggered only when both indicators are met simultaneously. The base value for the positive reward is set to 10, a value verified through multiple simulations to effectively incentivize the model to generate high-quality actions. In step S42, when setting the negative reward rule, two key abnormal states are focused on abnormal power spikes and insufficient displacement. The upper limit threshold for real-time power is set to 110% of the rated power, and the lower limit threshold for real-time displacement is set to 80% of the rated displacement. If either indicator exceeds the set range, it is considered a parameter abnormality and a negative reward is triggered. The negative reward value for a single abnormality is set to 8, and the reward values are stacked for multiple abnormalities, thereby strengthening the model's awareness of avoiding abnormal states. When designing the penalty rules in S43, the sum of the squares of the parameter adjustment amplitudes is used as the basis for penalty, with a penalty coefficient set at 0.1. This coefficient effectively balances the adjustment effect and process stability, avoiding fluctuations in the welding process due to excessively drastic parameter adjustments, and ensuring smooth and controllable adjustment actions. In S44, when integrating the rules, a weighted summation method is used to construct a complete reward function, where the positive reward weight is set at 0.5, the negative reward weight at 0.3, and the penalty weight at 0.2. This weight allocation clarifies the priority of model learning: first ensuring welding quality, then focusing on process stability, and finally controlling the adjustment amplitude. These four steps are closely linked, forming a scientifically sound reward mechanism.

[0039] Preferred, such as Figure 4 The S5 process includes the following steps: S51, at each control time step of a single welding cycle, receiving the real-time high-dimensional state vector generated in S2 and inputting it into a deep reinforcement learning network model that has completed offline training; S52, the model performs feature processing and mapping calculation on the input state through the policy network, and determines the optimal parameter adjustment direction under the current state by combining the evaluation results of the value network; S53, based on the optimal parameter adjustment direction, generating continuous small-amplitude adjustment commands for the ultrasonic generator output and the pressure mechanism setpoint, forming an action vector; S54, converting the action vector into a control signal that can be recognized by the servo actuator to ensure the real-time performance and accuracy of signal transmission.

[0040] Specifically, the real-time decision generation process in step S5 clarifies the entire process of signal flow, feature processing, action generation, and signal conversion through four sub-steps, ensuring accurate and real-time decision commands. In step S51, within a single welding cycle, the real-time high-dimensional state vector generated in step S2 is received at 20-millisecond control time steps. A single welding cycle is set to 500 milliseconds, meaning 25 decision iterations are completed per cycle. After receiving the vector, a range check is performed to ensure that the values of each dimension are within a reasonable range of 0 to 1. If the check fails, signal correction is triggered. If the check passes, the data is immediately input into a deep reinforcement learning network model that has undergone 10,000 offline training iterations, ensuring the validity of the input data and the maturity of the model. In step S52, after the model receives the state vector, the policy network first performs linear transformation and nonlinear activation processing on the input features. The activation function uses a modified linear unit, and deep features are extracted step-by-step through four hidden layers. Simultaneously, the value network evaluates the value of the current state and candidate actions, outputting the state value evaluation result. Both work together to determine the optimal parameter adjustment direction, ensuring that the generated actions meet both process requirements and possess value advantages. Based on the optimal adjustment direction, S53 generates continuous, minute adjustment commands for ultrasonic power, welding pressure, and welding time, forming a three-dimensional motion vector. During generation, the adjustment range strictly adheres to the motion space settings to avoid exceeding process-allowed thresholds, ensuring the feasibility of the motion commands. S54 converts the motion vector into digital control signals recognizable by the servo actuator. The signal transmission rate is set to 1000 baud, with transmission delay controlled within 10 milliseconds to ensure real-time performance and accuracy. These four interconnected steps achieve rapid response from state input to decision output, providing core support for closed-loop control.

[0041] like Figure 5As shown, a multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning is characterized by being implemented through different units, including: a multi-dimensional physical signal synchronous acquisition unit, integrated at the calibration position of the welding equipment, used to synchronously capture the mechanical vibration state signal of the welding head, welding pressure signal, input energy signal, and welding head position change signal during the welding process, performing parallel acquisition and transmission of multi-dimensional signals; a multi-dimensional feature extraction and fusion unit, receiving the signals transmitted by the multi-dimensional physical signal synchronous acquisition unit, extracting the calibration features of each signal, and generating a high-dimensional state vector through feature splicing and fusion algorithms to complete the integration processing of signal features; and a deep reinforcement learning model construction and training unit, employing an Actor-Critic architecture combining proximal policy optimization and deep deterministic policy gradient, defining the state space, action space, and reward function, and iteratively optimizing the model parameters through offline training. The system generates an intelligent model with decision-making capabilities; a real-time parameter adjustment decision unit receives a high-dimensional state vector output by the multi-dimensional feature extraction and fusion unit, utilizes the intelligent model generated by the deep reinforcement learning model construction and training unit, maps it to generate the optimal parameter adjustment action command, and outputs the decision command in real time; a servo execution drive unit establishes a signal connection with the real-time parameter adjustment decision unit, receives the parameter adjustment action command, drives the ultrasonic generator and pressure mechanism to adjust parameters, and completes the execution and feedback of the command; a closed-loop control coordination unit establishes communication connections with the multi-dimensional physical signal synchronous acquisition unit, the multi-dimensional feature extraction and fusion unit, the deep reinforcement learning model construction and training unit, the real-time parameter adjustment decision unit, and the servo execution drive unit, respectively, coordinates the working sequence of each unit, ensures the closed-loop operation of the perception-decision-execution process, and performs continuous monitoring and control of the welding process.

[0042] A multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning was developed, constructing a closed-loop monitoring system that deeply integrates multi-sensor fusion and reinforcement learning. This system comprehensively captures diverse physical signals during the welding process using a multi-sensor array. Feature extraction and fusion are then used to form a high-dimensional information carrier that fully characterizes the process state. Intelligent decision-making is achieved through an advanced hybrid architecture reinforcement learning model, and real-time parameter adjustments are completed via a servo actuator, forming a complete control chain from signal perception to command execution. This method not only achieves comprehensive perception of the diverse states of the welding process, avoiding the limitations of single-signal monitoring, but also ensures the smoothness and stability of process optimization through continuous, small-amplitude parameter adjustments, significantly improving the controllability and quality consistency of the welding process.

[0043] This method addresses the lack of dynamic adaptive capabilities in traditional methods. Through the real-time decision-making mechanism of a reinforcement learning model, it can flexibly adjust the control strategy based on real-time fluctuations in signal characteristics, quickly responding to changes in operating conditions, parameter drift, and environmental interference, thus completely changing the passive mode of traditional static threshold monitoring. Addressing the issues of existing intelligent methods relying on large amounts of labeled data, insufficient generalization ability, and lack of closed-loop mechanisms, this method employs unsupervised training logic of reinforcement learning, achieving model optimization without requiring large-scale quality labeled data. Through a complete closed-loop design of perception-feature fusion-decision-execution, it strengthens the direct correlation between monitoring results and process adjustments, significantly improving adaptability to welding scenarios with different materials and structures, and fundamentally solving the problem of quality fluctuations in continuous welding processes.

[0044] In the description of this invention, it should be noted that, unless otherwise explicitly specified and limited, the terms "set," "install," "connect," "link," and "fix" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal communication between two components. Those skilled in the art will understand the specific meaning of the above terms in this invention based on the specific circumstances.

[0045] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various equivalent changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning, characterized in that, Includes the following steps: S1, a multi-sensor array integrated into the calibration position of the welding equipment synchronously collects multiple physical signals during the welding process. The multiple physical signals include welding head mechanical vibration state signal, welding pressure signal, input energy signal and welding head position change signal. S2, extract features from the multivariate physical signals to obtain calibration features corresponding to each signal. The calibration features include the rising slope and peak value of the power curve, the collapse amount of the displacement curve and the stability-related features of the velocity and pressure curves. The calibration features are fused to form a high-dimensional state vector characterizing the current welding process state. S3. Construct a deep reinforcement learning network model. The model adopts an Actor-Critic architecture that combines proximal policy optimization and deep deterministic policy gradient. The state space of the model is defined as the high-dimensional state vector, and the action space is defined as the continuous small-amplitude adjustment values of the welding calibration process parameters. S4, Set the reward function of the model, the reward function is set based on the weld quality compliance status, abnormal state of process parameters and parameter adjustment range; S5, within a single welding cycle, the high-dimensional state vector is input into the deep reinforcement learning network model, and the model generates optimal parameter adjustment action instructions through internal policy network mapping; S6, the parameter adjustment action command is received through the servo actuator, and the output of the ultrasonic generator and the set value of the pressure mechanism are adjusted in real time to form a closed-loop control process of perception-decision-execution until the welding cycle ends.

2. The multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning according to claim 1, characterized in that, The policy update formula for the deep reinforcement learning network model is: ,in, For the updated model parameters, To update the model parameters, For learning rate, For trajectory data, As the current strategy, This is the old strategy. For action vectors, For state feature mapping function, For the dominant function, This is the cutting factor.

3. The multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning according to claim 1, characterized in that, The formula for calculating the reward function is as follows: ,in, As a reward value, , These are the weighting coefficients. For indicator functions, These are the actual weld joint quality parameters. For quality threshold, For real-time power parameters, This is the upper limit threshold for power. For real-time displacement parameters, This is the lower limit threshold for displacement. Adjust the vector for the parameters.

4. The multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning according to claim 1, characterized in that, The fusion formula for the high-dimensional state vector is: ,in, It is a high-dimensional state vector. For feature concatenation function, For activation function, This is the weight matrix. These are the feature vectors of vibration, pressure, energy, and displacement signals, respectively. This is the bias vector.

5. The multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning according to claim 1, characterized in that, The action output formula of the model is: ,in, For action vectors, The hyperbolic tangent activation function is used. Here is the weight matrix of the action network. This is the bias vector for the action network. Adjust the upper bound vector for the parameters. Adjust the lower bound vector for the parameters.

6. The multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning according to claim 1, characterized in that, The value function update formula for the model is: ,in, For the state value function, For value network parameters, For instant rewards, As a discount factor, The next state value function, The regularization coefficient is . Let the action value function be... For action value network parameters, This is the next state.

7. The multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning according to claim 1, characterized in that, S3 includes the following sub-steps: S31, determining the core architecture of the deep reinforcement learning network model, selecting the Actor-Critic framework that combines proximal policy optimization and deep deterministic policy gradient, and clarifying the hierarchical structure and neuron number configuration of the policy network and value network; S32, defining the state space of the model, using the high-dimensional state vector generated in S2 as the state input, ensuring that the state space can fully cover the multi-dimensional physical characteristics and parameter interaction information of the welding process; S33, dividing the action space of the model, using the continuous small-amplitude adjustment values of the welding calibration process parameters as the action output, and clarifying the adjustment range and step interval of each parameter; S34, setting the training hyperparameters of the model, including learning rate, discount factor, number of iterations and batch size, to provide basic configuration conditions for offline training of the model.

8. The multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning according to claim 1, characterized in that, S4 includes the following sub-steps: S41, setting positive reward rules corresponding to weld quality compliance, and determining specific numerical standards for positive rewards based on the quantitative results of core welding quality indicators; S42, formulating negative reward rules corresponding to abnormal process parameters, and setting triggering conditions and values for immediate negative rewards for parameter states that indicate defects such as abnormal power spikes and insufficient displacement; S43, designing penalty rules for parameter adjustment range, and determining the value standard of the penalty coefficient based on the absolute range and rate of change of parameter adjustments. S44 integrates positive reward rules, negative reward rules, and penalty rules to construct a complete reward function expression and clarifies the weight allocation ratio corresponding to each rule.

9. The multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning according to claim 1, characterized in that, S5 includes the following sub-steps: S51, at each control time step of a single welding cycle, the real-time high-dimensional state vector generated in S2 is received and input into the deep reinforcement learning network model that has completed offline training; S52, the model performs feature processing and mapping calculation on the input state through the policy network, and determines the optimal parameter adjustment direction in the current state by combining the evaluation results of the value network; S53, based on the optimal parameter adjustment direction, continuous small-amplitude adjustment commands are generated for the ultrasonic generator output and the pressure mechanism set value, forming an action vector; S54, the action vector is converted into a control signal that can be recognized by the servo actuator to ensure the real-time performance and accuracy of signal transmission.

10. The multi-sensor continuous ultrasonic welding monitoring method based on reinforcement learning according to any one of claims 1-9, characterized in that, This method is implemented through different units, including: The multi-signal synchronous acquisition unit is integrated into the calibration position of the welding equipment. It is used to synchronously capture the mechanical vibration status signal of the welding head, the welding pressure signal, the input energy signal and the welding head position change signal during the welding process, and to perform parallel acquisition and transmission of multi-signals. The multi-dimensional feature extraction and fusion unit receives signals transmitted by the multi-dimensional physical signal synchronous acquisition unit, extracts the calibration features of each signal, and generates a high-dimensional state vector through feature splicing and fusion algorithms to complete the integration and processing of signal features. The deep reinforcement learning model building and training unit adopts an Actor-Critic architecture that combines proximal policy optimization and deep deterministic policy gradient. It defines the state space, action space and reward function, and optimizes the model parameters through offline training to generate an intelligent model with decision-making capabilities. The real-time parameter adjustment decision unit receives the high-dimensional state vector output by the multi-dimensional feature extraction and fusion unit, uses the deep reinforcement learning model to construct and train the intelligent model generated by the unit, maps and generates the optimal parameter adjustment action command, and outputs the decision command in real time. The servo execution drive unit establishes a signal connection with the real-time parameter adjustment decision unit, receives parameter adjustment action commands, drives the ultrasonic generator and pressure mechanism to adjust parameters, and completes the execution and feedback of commands. The closed-loop control coordination unit establishes communication connections with the multi-dimensional physical signal synchronous acquisition unit, the multi-dimensional feature extraction and fusion unit, the deep reinforcement learning model construction and training unit, the real-time parameter adjustment decision unit, and the servo execution drive unit, respectively, to coordinate the working sequence of each unit, ensure the closed-loop operation of the perception-decision-execution process, and perform continuous monitoring and control of the welding process.