Adaptive learning method, device and equipment for robot arm control and storage medium

By using an adaptive learning-based robotic arm control method, and combining offline and online datasets to train and generate an adaptive control model, the problem of existing robotic arm control schemes being unable to dynamically adjust stiffness and force feedback is solved, enabling high-precision and high-stability operation of the robotic arm under complex working conditions.

CN122253232APending Publication Date: 2026-06-23SHENZHEN SMARTMORE TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN SMARTMORE TECH CO LTD
Filing Date
2026-05-28
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing robotic arm compliant control solutions lack effective utilization of offline prior knowledge and cannot dynamically adjust stiffness and force feedback, resulting in insufficient operational stability and control precision, making it difficult to meet the refined control requirements of complex industrial scenarios.

Method used

An adaptive learning-based robotic arm control method is adopted. A first and second initial models are jointly trained using an offline dataset to generate a reference model and a target model. These models are then combined with an online dataset for hybrid training to achieve the generation and selection of adaptive control models. A shared state encoder is used to achieve feature space consistency between the flow model and the Q-function network model.

Benefits of technology

It improves the adaptive capability of robotic arm control, enhances operational safety and control precision, and can dynamically output optimal control parameters under different working conditions, achieving high-precision and high-stability compliant control.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122253232A_ABST
    Figure CN122253232A_ABST
Patent Text Reader

Abstract

This application relates to an adaptive learning-based robotic arm control method, apparatus, device, and storage medium. The method includes: jointly training a first initial model and a second initial model based on an offline dataset until a first preset condition is met, obtaining a first reference model and a second reference model; obtaining an online dataset corresponding to the robotic arm based on the first and second reference models; determining a hybrid dataset based on the offline and online datasets; jointly training the first and second reference models based on the hybrid dataset until a second preset condition is met, obtaining a first target model and a second target model; determining an adaptive control model based on the first and second target models; the adaptive control model is used to generate and filter optimal parameter pairs for the robotic arm in real time to achieve adaptive control of the robotic arm. Using this application can improve the adaptive capability of robotic arm control.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of robotic arm control technology, and in particular to an adaptive learning robotic arm control method, apparatus, device, and storage medium. Background Technology

[0002] In industrial production and intelligent manufacturing scenarios, the precision operation of robotic arms often involves physical contact tasks, and compliant control is a core technological support for ensuring operational safety and accuracy. Currently, in the field of compliant control for robotic arms, traditional control methods mostly employ fixed parameter control, which cannot dynamically adjust key parameters such as stiffness and force feedback according to task requirements and environmental changes. This can easily lead to problems such as excessive contact force and insufficient position control accuracy, affecting operational stability and task success rate.

[0003] While existing learning-based compliant control schemes have been applied to robotic arm control, they still have many shortcomings: on the one hand, they lack effective utilization of offline prior knowledge and fail to fully integrate key experiences such as stiffness selection and motion planning contained in offline data; on the other hand, online training suffers from problems such as large parameter fluctuations, slow convergence speed, and insufficient safety redundancy, and a complete multi-source data fusion and parameter optimization mechanism has not been formed, making it difficult to balance control accuracy and operational safety, and unable to meet the fine control needs of complex industrial scenarios.

[0004] Therefore, improving the adaptive capability of robotic arm control has become an urgent problem to be solved. Summary of the Invention

[0005] Therefore, it is necessary to provide an adaptive learning robotic arm control method, device, equipment, and storage medium to address the aforementioned technical problems, which can improve the adaptive capability of robotic arm control.

[0006] In a first aspect, this application provides an adaptive learning-based robotic arm control method, applied to the control unit of a robotic arm control system. The system further includes a robotic arm, and the control unit deploys a first initial model and a second initial model with a shared state encoder. The method includes: Based on the offline dataset, the first initial model and the second initial model are jointly trained until the first preset condition is met, thus obtaining the first reference model and the second reference model; the offline dataset consists of multi-source trajectory data corresponding to the robotic arm. Based on the first and second reference models, an online dataset corresponding to the robotic arm is obtained; the online dataset consists of multiple trajectory data groups generated during the real-time operation of the robotic arm. Determine the hybrid dataset based on offline and online datasets; Based on the mixed dataset, the first reference model and the second reference model are jointly trained until the second preset condition is met, thus obtaining the first target model and the second target model. The adaptive control model is determined based on the first objective model and the second objective model. The adaptive control model is used to generate and filter the optimal parameter pairs of the robotic arm in real time to achieve adaptive control of the robotic arm.

[0007] Secondly, this application provides an adaptive learning robotic arm control device, applied to the control unit of a robotic arm control system. The system also includes a robotic arm. The control unit is equipped with a first initial model and a second initial model with a shared state encoder. The device includes: The offline training module is used to jointly train the first initial model and the second initial model based on the offline dataset until the first preset condition is met, thereby obtaining the first reference model and the second reference model; the offline dataset consists of multi-source trajectory data corresponding to the robotic arm; The data acquisition module is used to obtain the online dataset corresponding to the robotic arm based on the first reference model and the second reference model; the online dataset consists of multiple trajectory data groups generated during the real-time operation of the robotic arm; The first determination module is used to determine the mixed dataset based on the offline dataset and the online dataset; The online fine-tuning module is used to jointly train the first reference model and the second reference model based on the mixed dataset until the second preset condition is met, so as to obtain the first target model and the second target model. The second determining module is used to determine the adaptive control model based on the first target model and the second target model; the adaptive control model is used to generate and filter the optimal parameter pairs of the robotic arm in real time to achieve adaptive control of the robotic arm.

[0008] Thirdly, this application provides a computer device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the steps in the method described above.

[0009] Fourthly, this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps in the above-described method.

[0010] Fifthly, this application provides a computer program product comprising a computer program that, when executed by a processor, implements the steps of the method described above.

[0011] The aforementioned adaptive learning-based robotic arm control method, device, equipment, and storage medium employ a two-stage joint training strategy combining offline pre-training and online fine-tuning. It leverages a shared state encoder to achieve feature space consistency between the flow model and the Q-function network model. Compared to existing technologies where fixed impedance control cannot adapt to dynamic operating conditions and rule-based adaptive control has poor generalization ability, this adaptive learning-based robotic arm control method constrains the model to learn safe operating priors using offline datasets and guides the model to adapt to real-time operating conditions using online datasets. The dual-model collaboration achieves accurate generation and selection of optimal parameter pairs, thus solving the problem that traditional compliant control schemes struggle to balance operational safety, control accuracy, and operating condition adaptability, thereby improving the adaptive capability of robotic arm control. Attached Figure Description

[0012] Figure 1A An architecture diagram of a robotic arm control system provided in an embodiment of this application; Figure 1B An application environment diagram for an adaptive learning robotic arm control method provided in this application embodiment; Figure 2 A flowchart illustrating an adaptive learning robotic arm control method provided in an embodiment of this application; Figure 3 A schematic diagram of a process for determining a first total loss value is provided in an embodiment of this application; Figure 4 A schematic diagram of a process for filtering candidate parameter pairs provided in an embodiment of this application; Figure 5 A structural block diagram of an adaptive learning robotic arm control device provided in an embodiment of this application; Figure 6 An internal structural diagram of a computer device provided in an embodiment of this application; Figure 7 An internal structural diagram of another computer device provided in an embodiment of this application; Figure 8 This is an internal structural diagram of a computer-readable storage medium provided in an embodiment of this application. Detailed Implementation

[0013] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0014] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0015] Please see Figure 1A , Figure 1A This is an architectural diagram of a robotic arm control system provided in an embodiment of this application. The robotic arm control system includes, but is not limited to, a control unit and a robotic arm.

[0016] The robotic arm can be a 6-DOF or 7-DOF arm, equipped with a torque sensor, a vision sensor, and a real-time controller. This type of robotic arm has joint torque feedback capability, which can sense changes in joint load in real time. Combined with an end-effector force sensor, it achieves dual-layer force control, adapting to precision contact operations such as assembly, insertion / removal, and grinding. The torque sensor can be configured as a six-dimensional force / torque sensor at the end of the arm, with a sampling frequency of no less than 100Hz. This torque sensor is used to collect three-dimensional force and torque data during the contact process between the robotic arm's end effector and the environment / workpiece in real time. The vision sensor can be an RGB camera or an RGB-D camera, installed at the end effector of the robotic arm or fixed to the work scene. It is used to perceive the shape characteristics of the workpiece, pose deviations, and environmental obstacle information. The output visual data and torque data are fused to form the multimodal state input of the robotic arm. The real-time controller can be a high-performance real-time controller supporting impedance control, with millisecond-level command response capability. This real-time controller has a built-in impedance control algorithm module, which can dynamically adjust the stiffness matrix according to the stiffness parameters issued by the control unit to achieve balanced control of the robotic arm's end effector position and contact force.

[0017] The control unit is equipped with a first initial model (flow model) and a second initial model (Q-function network model) with a shared state encoder. The control unit establishes a data interaction link with the robotic arm's sensing module and drive module through a communication interface to receive sensing data, perform model training and inference operations, and issue control commands to achieve adaptive compliant control of the robotic arm.

[0018] Please see Figure 1B , Figure 1BThis diagram illustrates an application environment for an adaptive learning robotic arm control method provided in this embodiment. The terminal 102 communicates with the server 104 via a communication network. A data storage system can store the data that the server 104 needs to process. The data storage system can be integrated onto the server 104 or located in the cloud or on other network servers. The terminal 102 can be, but is not limited to, various personal computers, laptops, smartphones, tablets, IoT devices, and portable wearable devices. IoT devices can include smart speakers, smart TVs, smart air conditioners, smart vehicle devices, etc. Portable wearable devices can include smartwatches, smart bracelets, head-mounted devices, etc. The server 104 can be implemented using a standalone server or a server cluster consisting of multiple servers.

[0019] like Figure 2 As shown, Figure 2 This application provides a flowchart illustrating an adaptive learning robotic arm control method, which is then applied to... Figure 1A Control unit in Figure 1B The method will be illustrated using terminal 102 or server 104 as examples. It is understood that the computer device may include at least one of a terminal and a server. The method includes the following steps: S101. Based on the offline dataset, jointly train the first initial model and the second initial model until the first preset condition is met, and obtain the first reference model and the second reference model.

[0020] The offline dataset consists of multi-source trajectory data corresponding to the robotic arm, specifically including: expert demonstration trajectory data, simulation scenario trajectory data, historical operation trajectory data, and data-augmented trajectory data. Specifically, the expert demonstration trajectory data comprises the robotic arm operation trajectory completed by the operator via teleoperation, including joint angles, action commands, contact force data, and corresponding stiffness parameters annotated by experts during the operation; the simulation scenario trajectory data consists of robotic arm operation trajectories generated based on a physical simulation platform, covering operation data under different working conditions (such as different loads and different workpiece positions); the historical operation trajectory data consists of complete operation trajectories recorded during the robotic arm's past actual operation, including trajectories of normal task execution and minor fault handling trajectories, eliminating invalid data with safety hazards (such as collisions and overloads) to ensure data reliability; and the data-augmented trajectory data consists of new trajectory samples generated by perturbing the original trajectory data (such as slightly adjusting joint angles and contact force parameters) to expand the dataset size, avoid model overfitting, and improve the model's generalization ability.

[0021] In some embodiments, S101, based on an offline dataset, jointly trains the first initial model and the second initial model until a first preset condition is met, to obtain a first reference model and a second reference model, including: A1. Sample the offline dataset to obtain multiple offline training samples; A2. Based on the first offline training sample, jointly train the first initial model and the second initial model to obtain the first total loss value; the first offline training sample is any one of multiple offline training samples; A3. Perform backpropagation based on the first total loss value to update the parameters of the first initial model and the second initial model. If the first preset condition is not met, continue iterative training; or, if the first preset condition is met, obtain the first reference model and the second reference model.

[0022] This can be achieved by randomly and uniformly sampling the offline dataset to obtain multiple offline training samples, thus avoiding data distribution bias during training. Then, following a preset order, each offline training sample is selected sequentially, and operations such as A2 and A3 are performed completely on each sample. Each completed training of an offline training sample constitutes one model iteration. After each iteration, the model's training effect is verified to determine if it meets a first preset condition. If not, the next offline training sample is selected, and iterative optimization continues. If the condition is met, offline training is completed, and a first reference model and a second reference model are obtained. The first preset condition can be that the total loss value decreases by less than a preset loss threshold (e.g., 1×10⁻⁶) for a consecutive preset number of rounds (e.g., 50 rounds). -4 Alternatively, the success rate of the simulation task can be greater than a preset success rate threshold (such as 90%), without specific limitations here.

[0023] For the first offline training sample, its corresponding robotic arm state data (including visual information, force information, proprioceptive information, and task information) can be input into the state encoder for format standardization, feature extraction, and fusion processing, ultimately outputting a first state feature vector. Then, based on the first and second initial models, the first total loss value corresponding to this first state feature vector is calculated. Next, based on the first total loss value, the partial derivatives of this first loss value with respect to all trainable parameters of the state encoder, the first initial model (stream model), and the second initial model (Q-function network model) are calculated using the backpropagation algorithm to obtain the gradient value for each parameter. A preset gradient descent optimization algorithm (e.g., the Adam optimization algorithm) is used to synchronously update the parameters of the state encoder, the stream model, and the Q-function network model based on the calculated gradient values ​​of each parameter. During the update process, the parameters of each model can be gradually adjusted according to the parameter settings of the optimization algorithm (e.g., the learning rate) to gradually reduce the first total loss value, thereby optimizing model performance.

[0024] It is evident that by sampling and iteratively training the offline dataset, and then backpropagating the model parameters based on the total loss value of the offline training samples, the first and second initial models can quickly learn the core rules of the expert strategy, ensuring the stability and reliability of the initial performance of the model, and laying a solid foundation for the subsequent fusion and optimization of the online dataset.

[0025] like Figure 3 As shown in the figure, this application embodiment provides a flowchart for determining a first total loss value. The first initial model is a flow matching model, and the second initial model is a Q-function network model. A2. Based on the first offline training samples, the first initial model and the second initial model are jointly trained to obtain the first total loss value, including: B1. Train the flow matching model based on the first offline training sample to obtain the first row of cloning loss values; B2. Train the Q-function network model based on the first offline training samples to obtain the first Q-learning loss value; B3. Determine the first total loss value based on the first row of cloning loss value, the first Q-learning loss value, and the preset loss weighting coefficient.

[0026] The flow matching model, serving as the parameter generation model, learns the continuous-time evolution path from reference noise to expert action-stiffness parameter pairs based on the state characteristics of the robotic arm. It can fit expert trajectories through multi-time-step interpolation and supports rapid single-step generation of candidate control parameters, making it the core network for safe and stable action-stiffness parameter generation. The Q-function network model, serving as the value evaluation model, evaluates the long-term benefits of the robotic arm's state and the action-stiffness parameter pairs generated by the flow matching model, outputting the corresponding Q-value to characterize the strategy's merits. It can optimize evaluation accuracy through temporal difference objectives, providing a reliable basis for optimal parameter selection. During offline training, the flow matching model and the Q-function network model share the features output by the same state encoder. Through weighted joint optimization of behavioral cloning loss and Q-learning loss, the parameter generation and value evaluation capabilities are simultaneously improved.

[0027] The first total loss value corresponding to the first offline training sample can be calculated according to the preset total loss function, which is shown below: L_total=L_BC+λ_Q·L_QL Where L_total represents the first total loss value; L_BC represents the first row cloning loss value; L_QL represents the first Q-learning loss value; and λ_Q represents the loss weighting coefficient.

[0028] It is evident that by calculating the behavioral cloning loss and the Q-learning loss separately and combining them with weighting coefficients to determine the total loss, we can balance the fitting ability of the flow matching model to the expert trajectory with the value assessment accuracy of the Q-function network model, thus achieving collaborative optimization and stable training of the two models.

[0029] In some embodiments, B1, training the stream matching model based on the first offline training samples to obtain the first row of cloning loss values ​​includes: C1. Extract the first reference state, reference action command, and reference stiffness parameters of the robotic arm from the first offline training sample; C2. Construct an interpolation path based on the reference action command, reference stiffness parameters, preset reference noise, and multiple time steps; the multiple time steps are obtained by sampling multiple times within a preset time interval. C3. Calculate the first true velocity corresponding to the interpolation path; C4. The first reference state, interpolation path and multiple time step input streams are matched with the model and processed to output the first prediction speed. C5. Calculate the mean square error between the first predicted velocity and the first true velocity to obtain the cloning loss value for the first row.

[0030] The process involves extracting a first reference state for the robotic arm at the first historical moment, reference motion commands labeled by experts or the simulation system, and reference stiffness parameters from the first offline training samples. This first reference state includes the robotic arm's joint angles, end-effector pose, six-dimensional force / torque information, visual features, and task-related information. Then, starting with a preset reference noise and targeting the extracted reference motion commands and reference stiffness parameters, a continuous interpolation path is constructed across multiple time steps obtained by uniformly sampling over a preset time interval, smoothly transitioning from noise to expert parameters. This interpolation path represents the evolution process of the expert trajectory that the flow model needs to learn. Based on the numerical changes of the interpolation path between adjacent time steps, the rate of change of the interpolation path over time is calculated to obtain the first true velocity. This true velocity is the target that the flow matching model needs to fit, representing the true evolution rate of the expert trajectory at each time step.

[0031] In this process, the first reference state, the constructed interpolation path, and the corresponding multiple time steps are input into the flow matching model. The model performs forward computation based on the input information and outputs the model's predicted evolution velocity, i.e., the first predicted velocity. The first predicted velocity is then compared with the first true velocity, and the mean squared error between the two is calculated. This mean squared error is the first row's cloning loss value. This first row's cloning loss value is used to constrain the flow matching model to accurately learn the velocity patterns of expert trajectories, making the parameters generated by the model closer to safe and standard expert strategies.

[0032] It is evident that by extracting the state and parameters labeled by experts to construct the interpolation path and calculating the mean square error between the actual velocity and the predicted velocity to obtain the behavior cloning loss value, the continuous evolution law of the flow matching model learning the expert trajectory can be accurately constrained, thereby improving the safety and reliability of the action command-stiffness parameter pairs generated by the model.

[0033] In some embodiments, B2, training the Q-function network model based on the first offline training samples to obtain the first Q-learning loss value includes: D1. Extract the first execution reward and the second reference state corresponding to the robotic arm from the first offline training sample; the second reference state is the next state after the first reference state. D2. The first reference state, reference noise, and single time step input stream are matched with the model and processed to output the second prediction velocity; the single time step is obtained by a preset fixed value. D3. Perform single-step sampling based on the second predicted velocity and reference noise to obtain the first parameter pair; the first parameter pair includes a first action command and a first stiffness parameter; D4. Process the first reference state and the first parameters into the input Q-function network model and output the first Q value; D5. Determine the second Q value based on the second reference state, reference noise, and a single time step; D6. Calculate the target Q value based on the preset discount factor, the first execution reward, and the second Q value; D7. Calculate the mean square error between the first Q value and the target Q value to obtain the first Q learning loss value.

[0034] The first execution reward is obtained after the robotic arm performs the corresponding action, and is determined by a combination of factors such as task completion, control safety, and operational efficiency. The second reference state represents the complete state information of the robotic arm at the next moment after performing the corresponding action in the first reference state, and is used to characterize the state transition result. The first reference state, a pre-set reference noise, and a single time step with a fixed value (usually a fixed value of 1) are input into the flow matching model. This model performs forward calculations based on the input information and outputs the evolution speed under the corresponding conditions, i.e., the second predicted speed, which is used to generate control parameters in subsequent single steps.

[0035] The process involves several steps. First, the second predicted velocity, based on the flow matching model output, is combined with the initial input reference noise to perform single-step flow sampling. Without multi-step interpolation, a complete set of control parameters, the first parameter pair, is directly generated. This first parameter pair includes a first action command for controlling the robotic arm's movement and a first stiffness parameter to ensure control stability. The first reference state and the first parameter pair are then input into a Q-function network model. This model evaluates the long-term cumulative benefit of the first reference state and the first parameter pair, outputting a corresponding value score, the first Q-value, which characterizes the quality of the first parameter pair under the first reference state. Next, using the same method as calculating the first Q-value, the second reference state, the same reference noise, and the same single time step are input into the model. Single-step sampling is used to generate the parameter pair corresponding to the next state, which is then input into the Q-function network model for processing, outputting the optimal value estimate for the next state, the second Q-value.

[0036] In this process, following the target value calculation rules of deep reinforcement learning, a target Q-value is constructed using a preset discount factor, a first execution reward, and a second Q-value. The calculation formula is: Target Q-value = First execution reward + Discount factor × Second Q-value. This target Q-value represents the true long-term return of the first parameter pair and is the learning objective that the Q-function network model needs to fit. Then, the first Q-value is compared with the target Q-value, and the mean squared error between them is calculated. This mean squared error is the first Q-learning loss value. This first Q-learning loss value is used to back-update the parameters of the Q-function network model, improving the model's accuracy in evaluating the value of the control parameters.

[0037] It is evident that by extracting state transition and reward information, generating parameter pairs through single-step sampling, and calculating Q-value errors, the value assessment capability of the Q-function network model can be effectively optimized, guiding the model to learn the control strategy with optimal long-term returns.

[0038] S102. Based on the first reference model and the second reference model, obtain the online dataset corresponding to the robotic arm.

[0039] The online dataset consists of multiple trajectory data sets generated during the real-time operation of the robotic arm.

[0040] In some embodiments, S102, based on the first reference model and the second reference model, obtains the online dataset corresponding to the robotic arm, including: E1. Obtain the first real-time state corresponding to the robotic arm through the state encoder; E2. Using the first reference model, generate multiple candidate parameter pairs based on the first real-time state; each candidate parameter pair includes a candidate action command and a candidate stiffness parameter. E3. Filter multiple candidate parameter pairs using the second reference model to obtain the target parameter pair; E4. Obtain the target execution reward corresponding to the robotic arm; the target execution reward is obtained by the robotic arm performing the corresponding operation according to the target parameters. E5. Obtain the second real-time state corresponding to the robotic arm through the state encoder; the second real-time state is the next state after the first real-time state. E6. Based on the first real-time state, the target parameter pair, the target execution reward, and the second real-time state, determine the first trajectory data group; the first trajectory data group is any one of multiple trajectory data groups.

[0041] Specifically, a state encoder extracts and encodes features from the robotic arm's raw sensing data at the current moment to obtain a standardized, high-dimensional first real-time state. This first real-time state is then input into a trained and converged flow matching model. Based on the current first real-time state, the model performs multiple independent samplings to generate multiple sets of different control parameter combinations, i.e., multiple candidate parameter pairs. Each candidate parameter pair contains a candidate action command for driving the robotic arm's movement and a candidate stiffness parameter for ensuring compliant control safety.

[0042] In this process, the first real-time state and each set of candidate parameter pairs are sequentially input into the Q-function network model. The model evaluates and filters multiple candidate parameter pairs to obtain the target parameter pair, which is then sent to the robotic arm for execution, achieving online selection of the optimal control strategy. After the robotic arm performs the specified operation according to the target parameter pair, the target execution reward corresponding to this operation is calculated based on indicators such as task completion, contact force safety, and motion stability, which is used to quantify the merits of this control strategy. After the operation is completed, the robotic arm is again state-encoded by the state encoder to obtain the state at the next moment after the operation, i.e., the second real-time state, which is used to characterize the state transition result. Finally, the first real-time state, the target parameter pair, the target execution reward, and the second real-time state are combined in sequence to form a complete first trajectory data set. Repeating E1 to E6 can collect multiple trajectory data sets, ultimately forming an online dataset for online fine-tuning training.

[0043] It is evident that by collecting the state of the robotic arm in real time, generating and filtering candidate parameter pairs, and recording execution rewards and state transition information to construct a trajectory data set, online data that closely matches actual working conditions can be efficiently accumulated, providing real and reliable sample support for subsequent joint optimization of the model.

[0044] The reward function used during model training is shown below: r_t=r_task+r_safety+r_efficiency Where r_t represents the total reward obtained by the robotic arm after performing the operation at the current time (step t); r_task represents the task reward; r_safety represents the safety reward; and r_efficiency represents the efficiency reward.

[0045] It should be noted that the task reward `r_task` is designed based on the current task completion rate of the robotic arm. Its core reference metric is the distance between the robotic arm's end effector and the target position. The closer to the target position, the higher the task completion rate, and the larger the reward value of `r_task`, used to guide the robotic arm to accurately complete the preset task. The safety reward `r_safety` is used to penalize excessive contact force during robotic arm operation, preventing damage to robotic arm components or the operated object due to excessive contact force. The specific expression is: r_safety=-λ·max(0,‖F‖-F_max) Where λ represents a preset penalty coefficient, used to adjust the penalty intensity for exceeding the contact force limit; ||F|| represents the modulus of the actual contact force detected by the robotic arm end effector; and F_max represents a preset contact force safety threshold. When the actual contact force ||F|| is greater than the safety threshold F_max, the penalty mechanism is triggered, outputting a negative reward, and the more the contact force exceeds the limit, the greater the penalty intensity; when the actual contact force does not exceed the limit, the r_safety reward value is 0, and no penalty is generated.

[0046] Among them, the efficiency reward r_efficiency is used to encourage the robotic arm to complete the preset task quickly and shorten the operation time. Specifically, it is designed to be negatively correlated with the number of operation time steps. That is, the fewer time steps the robotic arm takes to complete the task, the larger the reward value of r_efficiency, and vice versa, thereby improving the operation efficiency of the robotic arm.

[0047] like Figure 4 As shown in the figure, this application embodiment provides a flowchart for screening candidate parameter pairs. E3, screening multiple candidate parameter pairs through a second reference model to obtain target parameter pairs includes: F1. Input each candidate parameter pair and the first real-time state from multiple candidate parameter pairs into the second reference model for processing, and output multiple candidate Q values. F2. Select the largest candidate Q value from multiple candidate Q values; F3. Determine the candidate parameter pair corresponding to the largest candidate Q value as the target parameter pair.

[0048] In this process, each candidate parameter pair is combined with the first real-time state and input into the second reference model. The second reference model evaluates the long-term returns of each candidate parameter pair and outputs multiple candidate Q values ​​that correspond one-to-one with each candidate parameter pair.

[0049] It is evident that by evaluating and selecting candidate parameter pairs through Q-value assessment, the control parameters with the best long-term benefits under the current state can be accurately selected, thereby improving the decision-making efficiency and task completion quality of the robotic arm's adaptive control.

[0050] S103. Determine the mixed dataset based on the offline dataset and the online dataset.

[0051] During online model fine-tuning, the sampling ratio of the offline and online datasets can be dynamically adjusted according to the training progress to obtain a mixed dataset, thus balancing training stability and policy optimization effectiveness. The total number of training steps is denoted as T_total, and the current training step is denoted as t. When the number of training steps satisfies t < 0.2 × T_total, a conservative sampling mode is adopted, and mixed sampling is performed according to the ratio of 70% offline data and 30% online data. When the number of training steps satisfies 0.2×T_total≤t<0.6×T_total, a balanced sampling mode is adopted, and mixed sampling is performed according to the ratio of 50% offline data and 50% online data; When the number of training steps satisfies t≥0.6×T_total, an aggressive sampling mode is adopted, and mixed sampling is performed according to the ratio of 30% offline data and 70% online data.

[0052] Based on this dynamic sampling ratio, the offline and online datasets are combined, shuffled, and batched to obtain a hybrid dataset for joint model training. Simultaneously, an adaptive scheduling strategy using the loss weighting coefficient λ_Q is employed during training to achieve a dynamic balance between behavioral cloning loss and Q-learning loss: during offline training, the value of λ_Q gradually increases from 0.1 to 0.5; during online fine-tuning, the value of λ_Q further increases from 0.5 to 1.0, ensuring that the model primarily uses behavioral cloning in the early stages of training to guarantee control stability, and primarily uses Q-learning in the later stages of training to improve the optimality of the control strategy.

[0053] S104. Based on the mixed dataset, jointly train the first reference model and the second reference model until the second preset condition is met, and obtain the first target model and the second target model.

[0054] The mixed dataset is input in batches into the overall network structure consisting of a first reference model, a second reference model, and a state encoder. Forward calculation and backpropagation are performed using the total loss function to achieve synchronous updates and parameter optimization of the first and second reference models.

[0055] During training, a dynamically adjusted mixed sampling strategy of offline and online data is employed to ensure training stability and strategy optimization effectiveness. Training stops when the model converges, the loss function value stabilizes, or a second preset condition is met, such as reaching the maximum number of training iterations. This yields the optimized first and second target models.

[0056] S105. Determine the adaptive control model based on the first objective model and the second objective model.

[0057] The adaptive control model is used to generate and filter the optimal parameter pairs for the robotic arm in real time to achieve adaptive control of the robotic arm.

[0058] Specifically, the first objective model, after convergence through joint training on a hybrid dataset, can be combined with the second objective model to construct an adaptive control model. The first objective model is used to quickly generate multiple sets of candidate action command-stiffness parameter pairs based on the real-time state of the robotic arm. The second objective model is used to evaluate the value of these candidate parameter pairs and select the optimal pair. Together, they complete real-time inference and control decisions, enabling the adaptive control model to dynamically output the optimal control parameters matching the current task under different operating conditions, achieving high-precision and high-stability adaptive compliant control of the robotic arm.

[0059] It should be understood that although the steps in the flowcharts of the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages in other steps.

[0060] Based on the same inventive concept, this application also provides an adaptive learning robotic arm control device. The solution provided by this device is similar to the solution described in the above method. Therefore, the specific limitations of one or more adaptive learning robotic arm control device embodiments provided below can be found in the limitations of the adaptive learning robotic arm control method described above, and will not be repeated here.

[0061] like Figure 5 As shown, this application embodiment provides an adaptive learning robotic arm control device 500, applied to the control unit of a robotic arm control system. The system also includes a robotic arm. The control unit deploys a first initial model and a second initial model with a shared state encoder. The device 500 includes: The offline training module 501 is used to jointly train the first initial model and the second initial model based on the offline dataset until the first preset condition is met, so as to obtain the first reference model and the second reference model; the offline dataset consists of multi-source trajectory data corresponding to the robotic arm; The data acquisition module 502 is used to obtain the online dataset corresponding to the robotic arm based on the first reference model and the second reference model; the online dataset consists of multiple trajectory data groups generated during the real-time operation of the robotic arm; The first determining module 503 is used to determine the mixed dataset based on the offline dataset and the online dataset; The online fine-tuning module 504 is used to jointly train the first reference model and the second reference model based on the mixed dataset until the second preset condition is met, so as to obtain the first target model and the second target model. The second determining module 505 is used to determine the adaptive control model based on the first target model and the second target model; the adaptive control model is used to generate and filter the optimal parameter pairs of the robotic arm in real time to achieve adaptive control of the robotic arm.

[0062] In some embodiments, in jointly training the first initial model and the second initial model based on an offline dataset until a first preset condition is met to obtain a first reference model and a second reference model, the offline training module 501 is specifically used for: Multiple offline training samples were obtained by sampling the offline dataset; Based on the first offline training sample, the first initial model and the second initial model are jointly trained to obtain the first total loss value; the first offline training sample is any one of multiple offline training samples. Backpropagation is performed based on the first total loss value to update the parameters of the first and second initial models. If the first preset condition is not met, iterative training continues; or... If the first preset condition is met, the first reference model and the second reference model are obtained.

[0063] In some embodiments, the first initial model is a flow matching model, and the second initial model is a Q-function network model. Regarding the joint training of the first and second initial models based on the first offline training samples to obtain the first total loss value, the offline training module 501 is specifically used for: The flow matching model is trained based on the first offline training sample to obtain the first row of cloning loss values; The Q-function network model is trained based on the first offline training sample to obtain the first Q-learning loss value; The first total loss value is determined based on the first row of cloning loss value, the first Q-learning loss value, and the preset loss weighting coefficient.

[0064] In some embodiments, in training the stream matching model based on the first offline training samples to obtain the first row cloning loss value, the offline training module 501 is specifically used for: Extract the first reference state, reference action command, and reference stiffness parameters corresponding to the robotic arm from the first offline training sample; An interpolation path is constructed based on reference action commands, reference stiffness parameters, preset reference noise, and multiple time steps; the multiple time steps are obtained by sampling multiple times within a preset time interval. Calculate the first true velocity corresponding to the interpolation path; The first reference state, interpolation path, and multiple time step input streams are matched with the model to process the first prediction velocity; Calculate the mean square error between the first predicted velocity and the first true velocity to obtain the first row of cloning loss value.

[0065] In some embodiments, in training the Q-function network model based on the first offline training samples to obtain the first Q-learning loss value, the offline training module 501 is specifically used for: Extract the first execution reward and the second reference state corresponding to the robotic arm from the first offline training sample; the second reference state is the next state after the first reference state. The first reference state, reference noise, and single-time-step input stream are matched with the model to process and output the second prediction velocity; the single time step is obtained by preset fixed values. A first parameter pair is obtained by single-step sampling based on the second predicted velocity and the reference noise; the first parameter pair includes a first action command and a first stiffness parameter. The first reference state and the first parameters are processed into the input Q-function network model to output the first Q value; The second Q value is determined based on the second reference state, the reference noise, and a single time step; The target Q value is calculated based on the preset discount factor, the first execution reward, and the second Q value. Calculate the mean square error between the first Q value and the target Q value to obtain the first Q learning loss value.

[0066] In some embodiments, regarding obtaining the online dataset corresponding to the robotic arm based on the first reference model and the second reference model, the data acquisition module 502 is specifically used for: The first real-time state of the robotic arm is obtained through the state encoder; Using the first reference model, multiple candidate parameter pairs are generated based on the first real-time state; each candidate parameter pair includes a candidate action command and a candidate stiffness parameter. The target parameter pair is obtained by filtering multiple candidate parameter pairs using a second reference model. Obtain the target execution reward corresponding to the robotic arm; the target execution reward is obtained by the robotic arm performing the corresponding operation according to the target parameters. The second real-time state of the robotic arm is obtained through the state encoder; the second real-time state is the next state after the first real-time state. Based on the first real-time state, the target parameter pair, the target execution reward, and the second real-time state, a first trajectory data group is determined; the first trajectory data group is any one of multiple trajectory data groups.

[0067] In some embodiments, in obtaining the target parameter pair by filtering multiple candidate parameter pairs through a second reference model, the data acquisition module 502 is specifically used for: Each candidate parameter pair and the first real-time state are input into the second reference model for processing, and multiple candidate Q values ​​are output. Filter out the largest candidate Q value from multiple candidate Q values; The candidate parameter pair corresponding to the largest candidate Q value is determined as the target parameter pair.

[0068] Each module in the aforementioned adaptive learning robotic arm control device 500 can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device in hardware form, or stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to each module.

[0069] In some embodiments, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 6 As shown, the computer device includes a processor, memory, input / output (I / O) interfaces, and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The database stores data related to the adaptive learning robotic arm control method. The I / O interfaces are used for exchanging information between the processor and external devices. The communication interface is used for communicating with external terminals via a network connection. When the computer program is executed by the processor, it implements the steps of the aforementioned adaptive learning robotic arm control method.

[0070] In some embodiments, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 7As shown, the computer device includes a processor, memory, input / output interface, communication interface, display unit, and input device. The processor, memory, and input / output interface are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interface. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input / output interface is used for exchanging information between the processor and external devices. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, NFC (Near Field Communication), or other technologies. When the computer program is executed by the processor, it implements the steps in the adaptive learning robotic arm control method described above. The display unit is used to form a visually visible image and can be a display screen, projection device, or virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen; the input device of the computer device can be a touch layer covering the display screen, or buttons, trackballs or touchpads set on the casing of the computer device, or external keyboards, touchpads or mice, etc.

[0071] Those skilled in the art will understand that Figure 6 or Figure 7 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0072] In some embodiments, a computer device is provided, the computer device including a memory and a processor, the memory storing a computer program, the processor executing the computer program to implement the steps in the above method embodiments.

[0073] In some embodiments, such as Figure 8 The diagram shows the internal structure of a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps described in the above-described method embodiments.

[0074] In some embodiments, a computer program product is provided, which includes a computer program that, when executed by a processor, implements the steps in the above method embodiments.

[0075] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions.

[0076] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.

[0077] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0078] The above embodiments are merely illustrative of several implementation methods of this application, and their descriptions are relatively specific and detailed. However, they should not be construed as limiting the scope of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. An adaptive learning-based robotic arm control method, characterized in that, A control unit applied to a robotic arm control system, the system further comprising a robotic arm, the control unit being deployed with a first initial model and a second initial model sharing a state encoder, the method comprising: Based on the offline dataset, the first initial model and the second initial model are jointly trained until the first preset condition is met, thereby obtaining the first reference model and the second reference model; the offline dataset consists of multi-source trajectory data corresponding to the robotic arm. Based on the first reference model and the second reference model, an online dataset corresponding to the robotic arm is obtained; the online dataset consists of multiple trajectory data groups generated during the real-time operation of the robotic arm. A hybrid dataset is determined based on the offline dataset and the online dataset; Based on the hybrid dataset, the first reference model and the second reference model are jointly trained until the second preset condition is met, thereby obtaining the first target model and the second target model. An adaptive control model is determined based on the first target model and the second target model; the adaptive control model is used to generate and filter the optimal parameter pairs of the robotic arm in real time to achieve adaptive control of the robotic arm.

2. The method according to claim 1, characterized in that, The step of jointly training the first initial model and the second initial model based on the offline dataset until a first preset condition is met to obtain a first reference model and a second reference model includes: Multiple offline training samples were obtained by sampling the offline dataset; Based on the first offline training sample, the first initial model and the second initial model are jointly trained to obtain a first total loss value; the first offline training sample is any one of the plurality of offline training samples. Backpropagation is performed based on the first total loss value to update the parameters of the first initial model and the second initial model. If the first preset condition is not met, iterative training continues; or... If the first preset condition is met, a first reference model and a second reference model are obtained.

3. The method according to claim 2, characterized in that, The first initial model is a stream matching model, and the second initial model is a Q-function network model. The step of jointly training the first initial model and the second initial model based on the first offline training samples to obtain a first total loss value includes: The flow matching model is trained based on the first offline training sample to obtain the first row of cloning loss values; The Q-function network model is trained based on the first offline training sample to obtain the first Q-learning loss value; The first total loss value is determined based on the first behavior cloning loss value, the first Q learning loss value, and the preset loss weighting coefficient.

4. The method according to claim 3, characterized in that, The step of training the stream matching model based on the first offline training samples to obtain the first row of cloning loss values ​​includes: Extract the first reference state, reference action command, and reference stiffness parameters corresponding to the robotic arm from the first offline training sample; An interpolation path is constructed based on the reference action command, the reference stiffness parameter, the preset reference noise, and multiple time steps; the multiple time steps are obtained by sampling multiple times within a preset time interval. Calculate the first true velocity corresponding to the interpolation path; The first reference state, the interpolation path, and the multiple time steps are input into the flow matching model for processing, and the first prediction speed is output. The mean square error between the first predicted speed and the first true speed is calculated to obtain the first row of cloning loss values.

5. The method according to claim 4, characterized in that, The step of training the Q-function network model based on the first offline training samples to obtain the first Q-learning loss value includes: Extract the first execution reward and the second reference state corresponding to the robotic arm from the first offline training sample; the second reference state is the next state after the first reference state. The first reference state, the reference noise, and a single time step are input into the flow matching model for processing, and a second predicted velocity is output; the single time step is obtained by a preset fixed value. A first parameter pair is obtained by single-step sampling based on the second predicted velocity and the reference noise; the first parameter pair includes a first action command and a first stiffness parameter. The first reference state and the first parameters are processed into the Q-function network model to output the first Q value; The second Q value is determined based on the second reference state, the reference noise, and the single time step; The target Q value is calculated based on the preset discount factor, the first execution reward, and the second Q value. The mean square error between the first Q value and the target Q value is calculated to obtain the first Q learning loss value.

6. The method according to any one of claims 1-5, characterized in that, The process of obtaining the online dataset corresponding to the robotic arm based on the first reference model and the second reference model includes: The first real-time state of the robotic arm is obtained through the state encoder. Based on the first reference model and the first real-time state, multiple candidate parameter pairs are generated; each candidate parameter pair includes a candidate action command and a candidate stiffness parameter. The target parameter pair is obtained by filtering the multiple candidate parameter pairs using the second reference model; Obtain the target execution reward corresponding to the robotic arm; the target execution reward is obtained by the robotic arm performing corresponding operations according to the target parameters. The second real-time state corresponding to the robotic arm is obtained through the state encoder; the second real-time state is the next state after the first real-time state. A first trajectory data group is determined based on the first real-time state, the target parameter pair, the target execution reward, and the second real-time state; the first trajectory data group is any one of the plurality of trajectory data groups.

7. The method according to claim 6, characterized in that, The step of filtering the multiple candidate parameter pairs using the second reference model to obtain the target parameter pair includes: Each candidate parameter pair and the first real-time state are input into the second reference model for processing, and multiple candidate Q values ​​are output. The largest candidate Q value among the plurality of candidate Q values ​​is selected; The candidate parameter pair corresponding to the largest candidate Q value is determined as the target parameter pair.

8. An adaptive learning robotic arm control device, characterized in that, A control unit for a robotic arm control system, the system further comprising a robotic arm, the control unit being deployed with a first initial model and a second initial model sharing a state encoder, the device comprising: The offline training module is used to jointly train the first initial model and the second initial model based on the offline dataset until the first preset condition is met, thereby obtaining the first reference model and the second reference model; the offline dataset consists of multi-source trajectory data corresponding to the robotic arm; The data acquisition module is used to obtain the online dataset corresponding to the robotic arm based on the first reference model and the second reference model; the online dataset consists of multiple trajectory data groups generated during the real-time operation of the robotic arm; The first determining module is used to determine a mixed dataset based on the offline dataset and the online dataset; The online fine-tuning module is used to jointly train the first reference model and the second reference model based on the hybrid dataset until the second preset condition is met, so as to obtain the first target model and the second target model. The second determining module is used to determine an adaptive control model based on the first target model and the second target model; the adaptive control model is used to generate and filter the optimal parameter pairs of the robotic arm in real time to achieve adaptive control of the robotic arm.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.