A surgical robot skill learning method based on RS-ACT and double-flow servo

By using a hierarchically coupled surgical robot skill learning network, combined with RS-ACT planning and spatial servoing branches, the problems of decision feedback delay and insufficient positioning accuracy of surgical robots in complex surgical field environments are solved. This achieves efficient sub-pixel positioning and robust fine operation, making it suitable for intelligent minimally invasive assisted surgery and automated medical operations.

CN122245691APending Publication Date: 2026-06-19TIANJIN UNIVERSITY OF TECHNOLOGY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TIANJIN UNIVERSITY OF TECHNOLOGY
Filing Date
2026-02-02
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing surgical robots suffer from problems such as high decision feedback delay, insufficient instantaneous positioning accuracy for fine operations, and unstable long-range motion planning in complex surgical field environments. They also have low recognition rate and weak generalization ability, especially in high-precision scenarios.

Method used

A surgical robot skill learning method based on RS-ACT and dual-stream servoing is adopted. Through a hierarchically coupled surgical robot skill learning network, combined with RS-ACT planning branch and spatial servoing branch, a dynamic weight fusion layer is designed to achieve efficient fusion of action judgment and spatial positioning. Dual-stream sensing is used to filter surgical field noise, and a dynamic switching mechanism optimizes the allocation of computing resources when moving away from and approaching the target.

🎯Benefits of technology

It achieves efficient reasoning and sub-pixel-level localization in complex surgical field environments, reduces computational latency, enhances the robot's fine manipulation capabilities in complex physiological environments, improves robustness and localization accuracy, and is suitable for intelligent minimally invasive assisted surgery and automated medical operations.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245691A_ABST
    Figure CN122245691A_ABST
Patent Text Reader

Abstract

This invention discloses a surgical robot skill learning method based on RS-ACT and dual-stream servoing, comprising: S1, acquiring surgical images including the tip of the surgical instrument and the target surgical position from the perspective of the surgical robotic arm; S2, performing dual-stream sensing preprocessing on the surgical images and fusing the step-by-step results to obtain dual-stream sensing surgical images; S3, constructing a hierarchically coupled surgical robot skill learning network; S4, training the hierarchically coupled surgical robot skill learning network; S5, setting a dynamic switching mechanism for the hierarchically coupled surgical robot skill learning network based on the distance between the current surgical instrument and the target surgical position, so that the RS-ACT planning branch is run separately in the long-distance planning stage, and the RS-ACT planning branch and the servo planning branch are run in parallel in the near-distance planning stage. This method reduces the computational latency in simple operation scenarios and alleviates the positioning drift problem at the moment of surgical contact, with high computational efficiency, high positioning accuracy and strong robustness.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of surgical robots and intelligent assisted surgical systems, and in particular to a surgical robot skill learning method based on RS-ACT and dual-stream servo. Background Technology

[0002] Precise instrument end-effector positioning and stable surgical path planning are key prerequisites for robotic completion of complex surgical procedures. Minimally invasive surgery typically faces the dual challenges of macroscopic approach navigation and microscopic tissue contact correction.

[0003] On the one hand, the Transformer-based Action Blocking (ACT) algorithm performs well in long sequence prediction, but it has high computational cost and often suffers from insufficient spatial accuracy and feedback delay in the final stage near the target tissue. On the other hand, although traditional visual servoing methods can achieve local closed-loop correction, they are susceptible to noise interference in complex surgical field environments (such as smoke and reflections).

[0004] Existing adaptive computation time (ACT) mechanisms suffer from low recognition rates and weak generalization ability when handling high-precision scenarios such as bronchoscopic interventions or surgeries in narrow cavities. Achieving efficient allocation of computational resources and multi-dimensional decoupling of visual perception are key to solving these problems. Summary of the Invention

[0005] The purpose of this invention is to provide a surgical robot skill learning method based on RS-ACT and dual-stream servo to solve the technical problems of high decision feedback delay, insufficient instantaneous positioning accuracy for fine operations, and unstable long-range motion planning in surgical robots in complex surgical field environments, so as to achieve manual skill learning and execution that balances efficient reasoning and sub-pixel-level positioning.

[0006] Therefore, the technical solution of the present invention is as follows:

[0007] A surgical robot skill learning method based on RS-ACT and dual-stream servoing is proposed, which is implemented based on a hierarchically coupled surgical robot skill learning network. The network consists of an RS-ACT planning branch, a spatial servoing branch, and a dynamic weight fusion layer.

[0008] The RS-ACT planning branch is based on action blocks and is formed by adding an action judgment module to its decoding network. The action judgment module consists of an inter-layer differential operator, a feature concatenation module, a residual stable stopping gate, and a logic control module. The inter-layer differential operator is connected to the Transformer decoder to extract the current action block. The feature tensor output by the layer and the first The feature tensors output from each layer are used to calculate their L2 norm residuals; the feature concatenation module is connected to the Transformer decoder and the inter-layer difference operator respectively, to process the current L2 feature tensor. The feature tensor output by the layer is summed with the L2 residual; the residual stabilization stopping gate adopts a nonlinear mapping layer, which is connected to the feature concatenation module to generate the stopping probability; the logic control module determines whether to continue or stop the surgical action calculation based on the preset probability threshold and the stopping probability.

[0009] The spatial servoing branch consists of a convolutional neural network, a spatial normalization layer, and a coordinate expectation mapping operator connected in sequence. The convolutional neural network uses ResNet18 to output high-dimensional feature maps representing key features in the image, namely the surgical instrument tip and the target surgical location. The spatial normalization layer performs Softmax processing on the high-dimensional feature maps in both the height and width dimensions to generate a probability heatmap. The coordinate expectation mapping operator obtains the coordinates of key points in the image corresponding to the key features by taking the expected value of the weighted sum of the coordinates of all pixels in the probability heatmap. Furthermore, by combining the z-axis coordinates obtained from the surgical images, the precise spatial position coordinates of the surgical instrument end are calculated. The dynamic weight fusion layer fuses the difference between the next-moment position output by the action block and the key point coordinate vector output by the servo planning branch to obtain the position correction result.

[0010] Furthermore, the residual stability stopping gate is composed of a global average pooling operator, a multilayer perceptron, and a nonlinear activation operator connected in sequence; wherein, the multilayer perceptron is composed of a first linear fully connected layer, a ReLU activation layer, and a second linear fully connected layer connected in sequence; the nonlinear activation operator adopts the Sigmoid function.

[0011] Furthermore, the fusion expression for the dynamic weight fusion layer is:

[0012] ,

[0013] In the formula, For global path weights, This represents the next time step position of the action block output in the RS-ACT planning branch. Adjusting weights for residuals, This is the residual correction amount output by the spatial servo branch, specifically the difference between the target surgical position coordinates and the coordinates of the surgical instrument end.

[0014] A surgical robot skill learning method based on RS-ACT and dual-stream servoing, the implementation steps of which are as follows:

[0015] S1. Based on the perspective of the surgical robotic arm, acquire surgical images including the tip of the surgical instrument and the target surgical location below it;

[0016] S2. Perform dual-stream sensing preprocessing on the surgical images sequentially, including HSV dual-threshold color filtering and binary mask extraction, and fuse the results of each step to obtain dual-stream sensing surgical images.

[0017] S3. Construct a hierarchically coupled surgical robot skill learning network;

[0018] S4. Train the hierarchical coupled surgical robot skill learning network constructed in step S3;

[0019] S5. Based on the distance between the current surgical instruments and the target surgical position, a dynamic switching mechanism is set up for the hierarchical coupled surgical robot skill learning network in practical applications, so that the RS-ACT planning branch runs alone in the long-distance planning stage, while the RS-ACT planning branch and the servo planning branch run in parallel in the near-distance planning stage.

[0020] Furthermore, the specific implementation steps of step S2 are as follows:

[0021] S201. The surgical image is processed using the HSV dual-threshold color filtering operator to extract the surgical instruments by filtering out background noise, and the filtered image is obtained.

[0022] S202. Perform morphological closing and morphological opening operations on the filtered image in sequence, and then use the bitmap mask operator to generate a binary mask.

[0023] S203. Using the merging operator, perform pixel-by-pixel synthesis on the filtered image and the binary mask to generate a dual-stream visual input image.

[0024] Furthermore, in step S3, the training method for the hierarchically coupled surgical robot skill learning network is as follows:

[0025] S401. Construct a surgical robot embodied intelligence dataset: 1) Obtain several sets of surgical images using the same image acquisition method as in step S1. Each set of surgical images consists of several frames of surgical instrument operation trajectory images in a simulation environment or real scene; 2) Set labels for the total number of frames of surgical instrument operation trajectory images in each set of surgical images, including the current pose of the robotic arm end and subsequent multiple continuous action block data based on the current position of the robotic arm end.

[0026] S402. Set the joint loss function to minimize training. Its expression is:

[0027] ,

[0028] In the formula, Weights for action prediction loss. For predicting loss of action, To stop the penalty loss weight, The stopping penalty loss is applied to the computation depth. , This is a regularization term based on interlayer residuals;

[0029] S403. Using the method in step S2, the surgical images in the surgical robot embodied intelligence dataset are processed and then fed into the hierarchically coupled surgical robot skill learning network, with the goal of minimizing the joint loss function. With this goal in mind, training of both the RS-ACT planning branch and the space servo branch is completed simultaneously.

[0030] Furthermore, the specific implementation steps of step S4 are as follows:

[0031] S401. Using a monocular vision-based depth estimation method, the distance from the end of the surgical instrument to the target surgical position is calculated based on real-time acquired surgical images. ;

[0032] S402, Set distance threshold And make the following judgment:

[0033] When distance If the current motion state of the surgical robotic arm is determined to be in the long-distance planning stage, the hierarchically coupled surgical robot skill learning network only runs the RS-ACT planning branch, and uses the next moment position of the action block output in the RS-ACT planning branch as the next moving position of the surgical instrument end effector.

[0034] when If the current motion state of the surgical robotic arm is determined to be in the precision operation stage, the hierarchically coupled surgical robot skill learning network runs the RS-ACT planning branch and the spatial servo branch in parallel, and uses the fusion result of the dynamic weight fusion layer as the next moving position of the surgical instrument end effector.

[0035] Compared with existing technologies, this surgical robot skill learning method based on RS-ACT and dual-stream servoing achieves high computational efficiency by designing RS-ACT planning branches to dynamically adjust the inference depth based on inter-layer residuals. Simultaneously, the parallel spatial servoing branch design enables sub-pixel-level geometric feedback, effectively mitigating positioning drift during movement towards the surgical target location and achieving high positioning accuracy of the surgical instrument end effector. Furthermore, before processing the surgical images in the network containing RS-ACT planning and spatial servoing branches, dual-stream sensing processing and fusion of the surgical images effectively filters background noise in the surgical field, enhancing the robot's fine manipulation capabilities in complex physiological environments. This makes the method robust and promising for clinical application. Attached Figure Description

[0036] Figure 1 The flowchart shows the surgical robot skill learning method based on RS-ACT and dual-stream servoing according to the present invention.

[0037] Figure 2 This is a schematic diagram illustrating the working principle of the RS-ACT planning branch in the hierarchically coupled surgical robot skill learning network of the present invention.

[0038] Figure 3 This is a diagram illustrating the architecture of the spatial servo branch in the hierarchically coupled surgical robot skill learning network of the present invention.

[0039] Figure 4 This is a flowchart illustrating the dynamic switching mechanism of the hierarchically coupled surgical robot skill learning network of the present invention in practical applications. Detailed Implementation

[0040] The present invention will be further described below with reference to the accompanying drawings and specific embodiments, but the following embodiments are by no means intended to limit the present invention.

[0041] See Figure 1 The specific implementation method of the surgical robot skill learning method based on residual stability adaptive computation and dual-stream servo is described below.

[0042] S1. Based on the perspective of the surgical robotic arm, acquire surgical images including the tip of the surgical instrument and the target surgical location below it.

[0043] In this embodiment, the camera used to acquire surgical images is mounted on the wrist of the surgical robotic arm, that is, on the joint closest to the end of the surgical robotic arm. Based on the end view of the surgical robotic arm, the camera can simultaneously acquire surgical images including the tip of the surgical instrument and the surgical site below it.

[0044] The surgical image acquisition method in step S1 differs from traditional image acquisition methods. Specifically, traditional surgical image acquisition typically uses a fixed-position external monitoring camera or a separate endoscope holder, which is a hand-eye separation acquisition mode. This mode inevitably suffers from physical occlusion problems, calibration error accumulation, and limited visual resolution during actual image acquisition. In contrast, the applicant's image acquisition adopts a hand-eye integrated method, and based on this acquisition method, the following surgical robot skill learning method has been designed. Compared with the traditional mode, it not only has the advantages of no visual blind spots, enhanced local perception, and reduced computational load, but also creatively proposes a new method for surgical robot skill learning based on the perspective of the surgical robotic arm.

[0045] S2. Perform HSV dual-threshold color filtering and binary mask extraction on the surgical image in sequence, and obtain the dual-stream sensing surgical image by fusing the results of the steps.

[0046] Step S2 is used to perform surgical field feature enhancement and semantic decoupling processing on the surgical images acquired in step S1, so as to obtain images suitable for a hierarchically coupled surgical robot skill learning network.

[0047] In step S2, the method for performing dual-stream sensing preprocessing on the surgical images is as follows:

[0048] S201. The surgical image is processed using the HSV dual-threshold color filtering operator to extract the surgical instruments by filtering background noise, and the filtered image is obtained.

[0049] S202. Using morphological correction operators and bitmap masking operators, perform topology optimization processing on the filtered image obtained in step S201; specifically,

[0050] 1) The filtered image is first processed by morphological closing operation to fill the mask holes caused by the specular reflection of the instrument, and then processed by morphological opening operation to filter out background outliers and noise through morphological opening operation.

[0051] 2) Using the bitmap mask operator, the image obtained from step 1) is processed to generate a semantically connected and edge-smooth binary mask.

[0052] S203. Using the merging operator, perform pixel-by-pixel synthesis on the filtered image and the binary mask to generate a dual-stream sensing surgical image.

[0053] This dual-stream visual input image guidance network focuses attention on the geometric center of the surgical instrument tip, resolving perceptual interference caused by intraoperative smoke or reflections.

[0054] S3. Construct a hierarchically coupled surgical robot skill learning network and complete network training.

[0055] like Figure 1 As shown, this hierarchically coupled surgical robot skill learning network consists of an RS-ACT planning branch, a spatial servo branch, and a dynamic weight fusion layer.

[0056] The RS-ACT planning branch uses Action Chunking with Transformers (ACT) as its basic framework, improved by adding an action judgment module to the ACT decoding network. Specifically, in the RS-ACT planning branch, the action chunking module estimates future surgical actions based on the input dual-stream sensing surgical image and the current pose of the robotic arm, specifically the trajectory point sequence of the surgical instrument end effector over multiple future time steps. The action judgment module is used to accurately determine the depth of the surgical action inference calculation, thereby determining whether the surgical instrument should continue execution or stop subsequent calculations. This action judgment module adaptively reduces inference latency to control the surgical robotic arm to stop moving more promptly.

[0057] Specifically, see Figure 2 The action judgment module consists of an interlayer differential operator, a feature splicing module, a residual stable stopping gate, and a logic control module.

[0058] The interlayer difference operator is connected to the Transformer decoder to extract the current layer. Feature tensors output by the layer With the Feature tensors output by the layer Calculate the L2 residuals of the two features. And used to evaluate semantic convergence.

[0059] The expression for the interlayer difference operator is:

[0060] ,

[0061] In the formula, This is the index of the feature map in the height direction. , This is the index of the feature map along the width direction. , This is the index of the feature map along the channel direction. ; For characteristic tensors In the line, number Column, No. Characteristic values ​​at the channel, For characteristic tensors In the line, number Column, No. Characteristic values ​​at the channel.

[0062] The feature concatenation module is connected to the Transformer decoder and the inter-layer difference operator respectively to extract the current feature. Layer output features and the corresponding L2 norm residuals The summation process is performed to obtain the concatenated result of the current layer features and residuals.

[0063] The residual stabilization stopping gate is a newly designed module that connects to the feature concatenation module. It generates stopping probabilities by inputting the current layer features and the concatenated residuals. Specifically, the residual stabilization stopping gate employs a nonlinear mapping layer. In this embodiment, the residual stabilization stopping gate is composed of a global average pooling operator, a multilayer perceptron, and a nonlinear activation operator connected sequentially.

[0064] The global average pooling operator is used to reduce the dimensionality of the concatenated result output by the feature concatenation module in order to extract the core semantic response vector by compressing the spatial dimension.

[0065] A multilayer perceptron consists of a first fully connected linear layer, a ReLU activation layer, and a second fully connected linear layer connected in sequence, and is used to learn feature residuals. The changing patterns are analyzed, and high-dimensional features are mapped to scalar scores.

[0066] The nonlinear activation operator uses the Sigmoid function, which is used to convert the above scalar scores into stopping probabilities ranging from 0 to 1. This is used to characterize the saturation state of the current computational layer's contribution to action prediction.

[0067] The logic control module determines the probability threshold based on its preset value. Regarding the stopping probability Make a judgment: when the stopping probability If the probability threshold is greater than 0.99, the surgical action calculation is stopped; otherwise, the calculation continues. In this embodiment, the probability threshold... Set it to 0.99.

[0068] The spatial servo branch is a parallel branch to the RS-ACT planning branch, and it aims to provide sub-pixel-level geometric feedback. For example... Figure 3 As shown, the spatial servo branch consists of a convolutional neural network, a spatial normalization layer, and a coordinate expectation mapping operator connected in sequence.

[0069] Convolutional neural networks (CNNs) are used to extract features from input dual-stream sensing surgical images and output high-dimensional feature maps. In this embodiment, the CNN specifically uses ResNet18, which typically has two channels to correspond to the features of interest in the surgical image. For example, in suturing surgery, the tip of the forceps and the end of the suture in the surgical image are two key points of interest. The high-dimensional feature map output by the CNN (ResNet18) is set to two channels to be used to extract features from the tip of the forceps and the end of the suture, respectively. The tip of the forceps is the end of the surgical instrument, and the end of the suture is the target surgical location. Furthermore, each channel of the CNN independently outputs a corresponding high-dimensional feature map, which is then used to independently generate the corresponding probability heatmap through a spatial normalization layer, thereby achieving parallel localization of different key targets.

[0070] The spatial normalization layer is used to perform softmax processing on the high-dimensional feature map in both the height and width spatial dimensions, so as to convert all the pixel values ​​in the high-dimensional feature map into probabilities with values ​​between 0 and 1, and the sum of the pixel values ​​of the entire image is equal to 1. The purpose of this spatial normalization layer is to eliminate interference terms and make the network focus on the brightest and most obvious features.

[0071] Unlike traditional methods that directly perform global average pooling on high-dimensional feature maps, the spatial servo branch of this invention uses a spatial normalization layer to process the high-dimensional feature map... The spatial dimensions are probabilistically weighted to generate a probability heatmap. In the probability heatmap, the brightest (highest value) position is the peak position. Taking the suturing surgery described above as an example, the two high-dimensional feature maps output by the two channels of the convolutional neural network are passed through a spatial normalization layer to generate two probability heatmaps, with their peak positions corresponding to the tip of the forceps and the end of the suture, respectively.

[0072] The processing expression for this spatial normalization layer is:

[0073] ,

[0074] In the formula, For high-dimensional feature maps at location The exponential weights are used to map the original features to non-negative values ​​and enhance the significance of peak features; For feature map at location The indexed weights are used to calculate the sum of the full graph responses to achieve spatial probability normalization.

[0075] Based on this, the generation of this probability heat map gives the surgical images visualization value, allowing doctors to see which specific location the robot is currently "staring" at.

[0076] The coordinate expectation mapping operator is used to calculate the peak position in a probability heatmap. Specifically, it obtains the expected value by weighted summing of the coordinates of all pixels in the probability heatmap, which serves as the location of the corresponding key feature, also known as the location of the key point. ;

[0077] Specifically, the expression for the coordinate expectation mapping operator is:

[0078] ,

[0079] ,

[0080] In the formula, , These represent the expected values ​​of the x-coordinate and y-coordinate of the key points in the image coordinate system, respectively. The coordinates of the probability thermogram output of the spatially normalized layer are: The pixel probability value at that location. The horizontal index (column index) of this pixel; The vertical index (row index) of this pixel; The width of the feature map. The height of the feature map.

[0081] Through the above calculations, the probability heatmap achieves a deterministic mapping from pixel space to geometric space, pinpointing the location of the brightest point (Argmax) in the probability heatmap to sub-pixel accuracy; ultimately, the high-dimensional feature map is compressed into an extremely concise form. Coordinate vector.

[0082] Based on the known installation pose and parameters of the image acquisition equipment, the relative distance (z-axis coordinate) from the end of the surgical instrument to the target site is obtained using a monocular depth estimation method, and compared with the coordinates of the key points. The information is integrated to form a three-dimensional space; this information is transmitted to the surgical robotic arm controller to determine the precise position of the surgical instrument tip in three-dimensional space in real time, thereby achieving millimeter-level alignment and contact.

[0083] In summary, this spatial servo branch ensures the spatial isovariability of the hierarchically coupled surgical robot skill learning network to the minute displacements of surgical instruments, thereby providing a high-precision positional closed loop in precision operations such as dissection and suturing, and alleviating the positioning drift problem at the moment of contact.

[0084] The dynamic weight fusion layer extracts the next time step position of the action block output in the RS-ACT planning branch. The position correction result is obtained by fusing the key point coordinate difference output from the servo planning branch.

[0085] The fusion expression for the dynamic weight fusion layer is:

[0086] ,

[0087] In the formula, For global path weights, This represents the next time step position of the action block output in the RS-ACT planning branch. Adjusting weights for residuals, This is the residual correction amount output by the space servo branch, specifically the difference between the target surgical position coordinates and the coordinates of the surgical instrument end.

[0088] This dynamic weighted fusion layer achieves precise perception of instruments and tissues in complex surgical environments by fusing the two, thereby enabling millimeter-level precise alignment and contact with the target lesion.

[0089] S4. Train the hierarchical coupled surgical robot skill learning network constructed in step S3.

[0090] For a hierarchically coupled surgical robot skill learning network, in order to make the network output accurate results, the RS-ACT planning branch and the spatial servoing branch need to be trained separately.

[0091] Specifically, the training steps for step S4 are as follows:

[0092] S401. Construct a dataset of embodied intelligence for surgical robots;

[0093] 1) Several sets of surgical images are obtained using the same image acquisition method as in step S1. Each set of surgical images consists of several frames of surgical instrument operation trajectory images in a simulated environment or a real scene. Preferably, the several sets of surgical images include different starting angles of surgical instruments, different lighting conditions in the surgical field (e.g., but not limited to strong light and weak light), and different interference conditions in the surgical field (e.g., but not limited to smoke and shadow).

[0094] 2) Tagging is performed on several frames of surgical instrument operation trajectory images in each surgical image set. Specifically, two tags are added to each frame, including:

[0095] Tag 1: Current pose of the robotic arm's end effector.

[0096] Tag 2: Data of multiple consecutive action blocks based on the current position of the robotic arm end effector (in this embodiment, the position of the robotic arm end effector is set to the position of the robotic arm end effector in the next 5 time steps).

[0097] S402. Set the joint loss function to minimize training. Its expression is:

[0098] ,

[0099] In the formula, Weights for action prediction loss. For predicting loss of action, To stop the penalty loss weight, The stopping penalty loss is applied to the computation depth. , This is a regularization term based on interlayer residuals.

[0100] in,

[0101] Action prediction loss The expression is:

[0102] ,

[0103] Stop punishing losses The expression is:

[0104] ,

[0105] Introducing a regularization term based on interlayer residuals The expression is:

[0106] ,

[0107] In the above polynomial, This represents the total number of layers in the Transformer decoder. The total length of the time step sequence predicted for action blocks; Divide the action into blocks The time-predicted sequence of surgical instrument end-effector pose values; The true values ​​of expert actions obtained from the embodied intelligence dataset of the surgical robot; For the action judgment module in the first The probability of stopping layer generation; For the Transformer decoder The feature tensor output by the layer For the first The feature tensor output by the layer.

[0108] In step S402, the joint loss function is minimized. The purpose of adding regularization constraints is to: constrain the L2 norm residuals of feature tensors between adjacent layers, thereby forcing the network to enhance the smoothness of semantic expression during training, enabling the residual stable stopping gate to spontaneously learn the computational logic of "stopping when features converge," and more keenly capturing the moment of feature convergence; effectively balancing inference latency and execution accuracy during training, thereby improving the accuracy of computational depth prediction.

[0109] S403. The surgical robot embodied intelligence dataset constructed in S401 is first processed by dual-stream perception using the method in step S2, and then fed into the hierarchically coupled surgical robot skill learning network constructed in step S3, with the goal of minimizing the joint loss function. With this goal in mind, training of both the RS-ACT planning branch and the space servo branch is completed simultaneously.

[0110] During the training process described above, the RS-ACT planning branch progresses with the feature residuals. As the probability of semantic extraction gradually approaches zero, indicating that it has entered a "saturation period," the system automatically increases the stopping probability to trigger an early stopping mechanism, skipping subsequent redundant calculations and significantly reducing end-to-end surgical control latency. Meanwhile, the spatial servoing branch learns and extracts the geometric mapping relationship between key anatomical features and instrument tips in surgical images through regression constraints on the ground truth coordinates of key points during training. With the probabilistic constraints of the spatial normalization layer, this branch can spontaneously focus image weights on high-response pixel regions and use the coordinate expectation mapping operator to transform feature responses into high-precision geometric spatial coordinates. This not only endows the hierarchically coupled surgical robot skill learning network with the ability to perceive the spatial isovariability of minute displacements of surgical instruments but also provides the system with sub-pixel-level geometric feedback, thereby achieving high-precision positional closure during delicate operations such as dissection and suturing, effectively mitigating the positioning drift problem that may occur at the moment the instrument contacts the target tissue.

[0111] This embodiment's training is based on the Lift task of the robotic simulation platform to simulate the picking up and lifting process of surgical instruments on a target object. The computing environment uses the Ubuntu operating system and the PyTorch deep learning framework, with NVIDIA RTX series GPUs as the hardware core. During training, the AdamW optimizer is used for end-to-end parameter updates, and the weight decay factor is set to... The batch size was set to 16, and the total training duration was 200 epochs. The joint loss function was monitored. The convergence status is assessed to ensure that the hierarchically coupled surgical robot skill learning network can adaptively optimize the computation depth while achieving high-precision picking actions.

[0112] S5. Establish a dynamic switching mechanism for the hierarchically coupled surgical robot skill learning network in practical applications.

[0113] See Figure 4 The dynamic switching mechanism for step S5 is set as follows:

[0114] S501. A monocular vision-based depth estimation method is used to process the real-time acquired surgical images to obtain the current state S_t of the surgical instrument tip, thereby calculating the distance from the surgical instrument tip to the target surgical position. ;

[0115] S502, Set distance threshold And make the following judgment:

[0116] When distance If the current motion state of the surgical robotic arm is determined to be in the long-distance planning stage, the hierarchically coupled surgical robot skill learning network enters the planning mode, i.e., it only runs the RS-ACT planning branch, which dominates the generation of the global approach trajectory to ensure motion continuity. Specifically, the next movement position of the surgical instrument end effector in this stage directly adopts the next moment position output by the action block in the RS-ACT planning branch. ;

[0117] when If the current state of the surgical robotic arm is determined to be in the precision operation stage, the hierarchically coupled surgical robot skill learning network enters a fusion mode, i.e., the RS-ACT planning branch and the spatial servo branch are run in parallel, and their outputs are fused through a dynamic weight fusion layer to obtain a precisely corrected path trajectory. Specifically, the next moving position of the surgical instrument end effector in this stage adopts the fusion result of the dynamic weight fusion layer. .

[0118] Based on practical surgical applications, distance threshold This determines when the robot should undergo fine-tuning; this value can be a fixed value or it can be updated in real time according to changes in the spatial scale of the surgical task. Generally, Set to 0.8mm~1.0mm.

[0119] Furthermore, to verify the effectiveness of the surgical robot skill learning method based on RS-ACT and dual-stream servoing of the present invention, multiple sets of experiments were conducted on the Lift task of the robotic simulation platform, and the average value was taken as the final result. During the experiments, action block segmentation was also used as the basic model for result comparison.

[0120] Specifically, each experiment set up 50 independent tests, and statistical analysis was conducted on the task success rate under different training data volumes (1000 sets, 300 sets, 50 sets) and different training cycles (100 rounds, 200 rounds). The experimental results are shown in Table 1 below.

[0121] Table 1:

[0122] Training data volume Training epochs Base model success rate (%) Success rate (%) of the method of this invention 1000 sets 100 / 200 96 / 96 98 / 98 300 sets 100 / 200 86 / 78 86 / 94 50 sets 100 / 200 26 / 36 74 / 60

[0123] As can be seen from the test results in Table 1, firstly, the method of this invention demonstrates excellent few-shot learning ability and extremely fast model convergence speed. Specifically, under the harsh conditions of only 50 small datasets and a limited number of training epochs (100-200 epochs), the success rate of the basic model is only 26%-36%, and effective convergence has not yet been achieved; while the method of this invention can achieve a success rate of 60%-74%, demonstrating extremely strong robustness. This strongly proves that the geometric motion prior introduced through the RS-ACT mechanism and spatial servo branch can guide the network to quickly capture the core features of surgical skills, greatly shortening the training cycle and reducing the dependence on large-scale expert demonstration data. This characteristic gives the system significant technical advantages and potential for implementation in real clinical application scenarios where high-quality surgical data is scarce and computing resources are limited. Secondly, the method of this invention significantly improves the upper limit of the execution accuracy of surgical operations. Specifically, in a scenario with 1000 sets of sufficient data support, this invention achieves an extremely high success rate of 98%, consistently outperforming traditional architectures. This indicates that the sub-pixel-level feedback provided by the spatial servo branch can accurately compensate for the micro-positioning deviations that exist at the end of long-range trajectory planning in the Transformer architecture. Through real-time geometric correction, the robustness and accuracy of the surgical robot in the precision operation phase are ensured.

[0124] In summary, the surgical robot skill learning method based on RS-ACT and dual-stream servoing of this invention achieves a surgical robot skill learning method that balances long-range motion planning and fine geometric correction through parallel RS-ACT planning branches and spatial servo branches. Furthermore, experimental results demonstrate that this method can achieve smooth surgical action sequence prediction and has broad application prospects in the fields of intelligent minimally invasive assisted surgery, automated medical operations, and embodied robot intelligence.

Claims

1. A surgical robot skill learning method based on RS-ACT and dual-stream servoing, characterized in that, A surgical robot skill learning network based on hierarchical coupling is implemented; the network consists of an RS-ACT planning branch, a spatial servoing branch, and a dynamic weight fusion layer; among which... The RS-ACT planning branch is based on action blocks and is formed by adding an action judgment module to its decoding network. The action judgment module consists of an inter-layer differential operator, a feature concatenation module, a residual stable stopping gate, and a logic control module. The inter-layer differential operator is connected to the Transformer decoder to extract the current action block. The feature tensor output by the layer and the first The feature tensors output from each layer are used to calculate their L2 norm residuals; the feature concatenation module is connected to the Transformer decoder and the inter-layer difference operator respectively, to process the current L2 feature tensor. The feature tensor output by the layer is summed with the L2 residual; the residual stabilization stopping gate adopts a nonlinear mapping layer, which is connected to the feature concatenation module to generate the stopping probability; the logic control module determines whether to continue or stop the surgical action calculation based on the preset probability threshold and the stopping probability. The spatial servoing branch consists of a convolutional neural network, a spatial normalization layer, and a coordinate expectation mapping operator connected in sequence. The convolutional neural network uses ResNet18 to output high-dimensional feature maps representing key features in the image, namely the surgical instrument tip and the target surgical location. The spatial normalization layer performs Softmax processing on the high-dimensional feature maps in both the height and width dimensions to generate a probability heatmap. The coordinate expectation mapping operator obtains the coordinates of key points in the image corresponding to the key features by taking the expected value of the weighted sum of the coordinates of all pixels in the probability heatmap. Furthermore, by combining the z-axis coordinates obtained from the surgical images, the precise spatial position coordinates of the surgical instrument end are calculated. The dynamic weight fusion layer fuses the difference between the next-moment position output by the action block and the key point coordinate vector output by the servo planning branch to obtain the position correction result.

2. The surgical robot skill learning method based on RS-ACT and dual-stream servoing according to claim 1, characterized in that, The steps are as follows: The residual stability stopping gate is composed of a global average pooling operator, a multilayer perceptron, and a nonlinear activation operator connected in sequence; wherein, the multilayer perceptron is composed of a first linear fully connected layer, a ReLU activation layer, and a second linear fully connected layer connected in sequence; the nonlinear activation operator adopts the Sigmoid function.

3. The surgical robot skill learning method based on RS-ACT and dual-stream servoing according to claim 1, characterized in that, The fusion expression for the dynamic weight fusion layer is: , In the formula, For global path weights, This represents the next time step position of the action block output in the RS-ACT planning branch. Adjusting weights for residuals, This is the residual correction amount output by the spatial servo branch, specifically the difference between the target surgical position coordinates and the coordinates of the surgical instrument end.

4. The surgical robot skill learning method based on RS-ACT and dual-stream servo according to any one of claims 1 to 3, characterized in that, The steps are as follows: S1. Based on the perspective of the surgical robotic arm, acquire surgical images including the tip of the surgical instrument and the target surgical location below it; S2. Perform dual-stream sensing preprocessing on the surgical images sequentially, including HSV dual-threshold color filtering and binary mask extraction, and fuse the results of each step to obtain dual-stream sensing surgical images. S3. Construct a hierarchically coupled surgical robot skill learning network; S4. Train the hierarchical coupled surgical robot skill learning network constructed in step S3; S5. Based on the distance between the current surgical instruments and the target surgical position, a dynamic switching mechanism is set up for the hierarchical coupled surgical robot skill learning network in practical applications, so that the RS-ACT planning branch runs alone in the long-distance planning stage, while the RS-ACT planning branch and the servo planning branch run in parallel in the near-distance planning stage.

5. The surgical robot skill learning method based on RS-ACT and dual-stream servoing according to claim 4, characterized in that, The specific implementation steps of step S2 are as follows: S201. The surgical image is processed using the HSV dual-threshold color filtering operator to extract the surgical instruments by filtering out background noise, and the filtered image is obtained. S202. Perform morphological closing and morphological opening operations on the filtered image in sequence, and then use the bitmap mask operator to generate a binary mask. S203. Using the merging operator, perform pixel-by-pixel synthesis on the filtered image and the binary mask to generate a dual-stream visual input image.

6. The surgical robot skill learning method based on RS-ACT and dual-stream servoing according to claim 4, characterized in that, In step S4, the training method for the hierarchically coupled surgical robot skill learning network is as follows: S401. Construct a surgical robot embodied intelligence dataset: 1) Obtain several sets of surgical images using the same image acquisition method as in step S1. Each set of surgical images consists of several frames of surgical instrument operation trajectory images in a simulation environment or real scene; 2) Set labels for the total number of frames of surgical instrument operation trajectory images in each set of surgical images, including the current pose of the robotic arm end and subsequent multiple continuous action block data based on the current position of the robotic arm end. S402. Set the joint loss function to minimize training. Its expression is: , In the formula, Weights for action prediction loss. For predicting loss of action, To stop the penalty loss weight, The stopping penalty loss is applied to the computation depth. , This is a regularization term based on interlayer residuals; S403. Using the method in step S2, the surgical images in the surgical robot embodied intelligence dataset are processed and then fed into the hierarchically coupled surgical robot skill learning network, with the goal of minimizing the joint loss function. With this goal in mind, training of both the RS-ACT planning branch and the space servo branch is completed simultaneously.

7. The surgical robot skill learning method based on RS-ACT and dual-stream servoing according to claim 4, characterized in that, The specific implementation steps of step S5 are as follows: S501. Using a monocular vision-based depth estimation method, the distance from the end of the surgical instrument to the target surgical position is calculated based on real-time acquired surgical images. ; S502, Set distance threshold And make the following judgment: When distance If the current motion state of the surgical robot arm is determined to be in the long-distance planning stage, the hierarchically coupled surgical robot skill learning network only runs the RS-ACT planning branch, and uses the next moment position of the action block output in the RS-ACT planning branch as the next moving position of the surgical instrument end effector. when If the current motion state of the surgical robotic arm is determined to be in the precision operation stage, the hierarchically coupled surgical robot skill learning network runs the RS-ACT planning branch and the spatial servo branch in parallel, and uses the fusion result of the dynamic weight fusion layer as the next moving position of the surgical instrument end effector.