Dual teacher semi-supervised semantic segmentation method and device for stereo endoscopic surgery

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing a dual-teacher semi-supervised framework and generating complementary pseudo-labels using geometric and semantic teacher models of stereoscopic endoscopy, the problem of insufficient information utilization between stereoscopic views is solved, achieving high-precision and high-generalization semantic segmentation, which is applicable to stereoscopic endoscopic surgery.

CN122223331APending Publication Date: 2026-06-16HARBIN INST OF TECH

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: HARBIN INST OF TECH
Filing Date: 2026-03-27
Publication Date: 2026-06-16

Application Information

Patent Timeline

27 Mar 2026

Application

16 Jun 2026

Publication

CN122223331A

IPC: G06V10/26; G06V10/82; G06V10/764; G06V10/80; G06N3/096; G06N3/0895; G06N3/0464; G06N3/09

AI Tagging

Application Domain

Character and pattern recognition Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122223331A_ABST

Patent Text Reader

Abstract

The application provides a stereo endoscope surgery-oriented double-teacher semi-supervised semantic segmentation method and device, and belongs to the technical field of medical image processing and computer vision. The method comprises the following steps: obtaining a stereo rectification image pair, only the left view is labeled; a geometric teacher generates a confidence geometric pseudo label for the right view based on disparity estimation; a semantic teacher generates a stable semantic pseudo label and confidence for the right view through exponential moving average; the two confidences are normalized and then pixel-level gate fusion is performed to generate a right view soft supervision distribution; a student network is trained based on the left view label and the right view fusion supervision distribution to construct a total loss; the device comprises a data acquisition module, a student model module, a geometric teacher module, a semantic teacher module, a fusion module and a model training module. The application explicitly utilizes the stereo view geometry correspondence to mine additional supervision signals, and the geometric and semantic pseudo labels are complementary to reduce noise and bias, thereby improving segmentation accuracy and generalization under single view labeling.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of medical image processing and computer vision technology, and in particular to a dual-teacher semi-supervised semantic segmentation method and apparatus for stereoscopic endoscopic surgery. Background Technology

[0002] Semantic segmentation of surgical endoscopic images is a crucial foundation for tasks such as scene perception, autonomous operation, and safety control of surgical robots, requiring pixel-level precision in segmenting instruments, tissues, and other targets within the images. However, pixel-level fine-grained annotation of endoscopic images relies on specialized knowledge and significant manual labor costs, making it difficult to acquire at scale. This results in a scarcity of high-quality annotated data, hindering the performance improvement of semantic segmentation models and their stable application in clinical scenarios.

[0003] In recent years, to alleviate the problem of scarce labeled data, semi-supervised learning techniques have been widely applied to semantic segmentation tasks. By jointly utilizing a small amount of labeled data and a large amount of unlabeled data for model training, the problem of label scarcity has been alleviated to some extent. Existing semi-supervised semantic segmentation methods mostly employ strategies such as consistency regularization and pseudo-label learning to effectively utilize unlabeled samples. However, in robot-assisted laparoscopic surgery, the system typically uses a stereo endoscope to simultaneously acquire left and right views, while the actual annotation often only covers one view (e.g., the left view), leaving the other view (e.g., the right view) unlabeled. This results in a large amount of unlabeled data that has a geometric correspondence with the labeled views. Existing methods usually treat this unlabeled view as an independent sample, without explicitly modeling and utilizing the geometric constraints and complementary information relationships between stereo views. This makes it difficult to correct pseudo-label errors and easily leads to the typical confirmation bias in semi-supervised learning during training, thus limiting further improvements in segmentation accuracy and generalization under the semi-supervised paradigm. Summary of the Invention

[0004] The purpose of this invention is to provide a dual-teacher semi-supervised semantic segmentation method and apparatus for stereoscopic endoscopic surgery. By constructing a dual-teacher semi-supervised semantic segmentation framework consisting of a geometric teacher model and a semantic teacher model, the geometric teacher model generates geometric pseudo-labels by transferring supervisory information using the geometric correspondence of stereoscopic views, while the semantic teacher model outputs stable semantic pseudo-labels through exponential moving average, forming complementary supervision. This solves the problems of existing methods not utilizing stereoscopic view geometric constraints, low reliability of pseudo-labels, and insufficient segmentation accuracy, and significantly improves the accuracy and generalization of semantic segmentation in surgical scenarios.

[0005] To achieve the above objectives, this invention proposes a dual-teacher semi-supervised semantic segmentation method for stereoscopic endoscopic surgery, comprising the following steps: Step S1: Obtain a dataset of stereo-corrected stereoscopic endoscopic surgical image pairs. Each image pair includes a left view and a right view, where the left view has pixel-level ground truth segmentation annotations and the right view has no annotations. Step S2: Construct a student segmentation network to predict the left and right views with shared weights, and obtain the prediction results for the left and right views respectively; Step S3: Construct a geometric teacher model. Introduce the FoundationStereo stereo matching model into the geometric teacher model. Using stereo endoscopic surgical image pairs as input, predict the disparity map corresponding to the stereo image pairs. Use the disparity map to map the pixel-level ground truth labels of the left view to the coordinate system of the right view through geometric transformation to obtain geometric pseudo-labels on the right view. Convert the geometric pseudo-labels into a geometric pseudo-label distribution in the form of class probability distribution. Construct geometric confidence based on the cross-view reprojection consistency under disparity guidance. Reproject the left view according to the disparity map to obtain the reconstruction result of the right view. Evaluate the supervision confidence of each pixel position by the photometric consistency error between the reconstructed right view and the real right view, and generate a pixel-level confidence map of the geometric pseudo-labels of the right view. Step S4: Construct a semantic teacher model. The network structure of the semantic teacher model is the same as that of the student model. The parameter weights are updated using the exponential moving average of the parameter weights of the student network. The right view after data augmentation is received as input, and the class probability distribution is output. Based on the maximum class probability at each pixel position, a pixel-level confidence map of the semantic pseudo-label of the right view is generated. Step S5: Perform confidence-aware fusion of the right view geometric pseudo-labels and the right view semantic pseudo-labels to generate a fusion supervised distribution; Step S6: Based on the ground truth segmentation labels in the left view, use cross-entropy loss and sieve loss to calculate the supervised loss for the student model prediction. Based on the fused supervised distribution in the right view, calculate the cross-entropy loss for the student prediction and the fused pseudo-label distribution to obtain the unsupervised loss. Construct the total loss by weighted summation of supervised loss and unsupervised loss, and train the student segmentation network.

[0006] Preferably, in step S2, the left view prediction is supervised training using pixel-level ground truth segmentation labels; the right view prediction is based on two pseudo-labels generated by the geometric teacher and the semantic teacher, and the final supervised distribution is obtained through confidence-aware fusion to train the student model.

[0007] Preferably, in step S3, the formula for calculating the geometric pseudo-label on the right view is: ; in, For the geometric pseudo-labels on the right view, These are the pixel coordinates of the right view. For geometric transformation functions, Provide pixel-level truth labels for the left view. The disparity map corresponding to the right view; The formula for calculating the geometric pseudo-label distribution is: ; in, For the right view, the first Class in pixels Geometric pseudo-label probability distribution at location, For category indexing, The total number of categories in semantic segmentation. For pixel position index, For indicator functions; The formula for calculating the pixel-level confidence map of the right-view geometric pseudo-label is: ; in, For the right view geometry pseudo-label in pixels Pixel-level confidence map at the location. It is a natural exponential function. The total number of pixels in the image. Right view For the true right view in pixels Pixel value at that location, The result of the right view reconstruction. To reconstruct the right view in pixels Pixel value at that location, This is the temperature coefficient.

[0008] Preferably, in step S4, the parameter weights are updated using an exponential moving average of the student network parameter weights, with the following formula: ; in, For the first The parameter weights of the semantic teacher model after the next iteration. For the first The parameter weights of the semantic teacher model after the next iteration. To the number of training iterations, For smoothing coefficients, For the first The parameters of the student model after the next iteration. For semantic teacher model, For student models; Based on the maximum class probability at each pixel location, a pixel-level confidence map of the semantic pseudo-labels for the right view is generated, using the following formula: ; in, semantic pseudo-labels for the right view in pixels Pixel-level confidence map at the location. To obtain The maximum value among the categories, For the semantic teacher model, the right view pixels Output category The probability distribution.

[0009] Preferably, step S5 specifically includes the following steps: Step S51: Normalize the confidence scores of geometric pseudo-labels and semantic pseudo-labels using the following formula: ; in, For the right view, the teacher's pseudo-label is in pixels. Pixel-level normalized confidence map For the right view, the teacher's pseudo-label is in pixels. Pixel-level confidence map. It is the numerical stability constant. A false label for teachers Geometric pseudo-labels For semantic teacher pseudo-labels; Step S52: Perform pixel-level gated fusion based on the normalized confidence scores to generate a fusion supervision distribution, using the following formula: ; in, For right view pixels The probability distribution of fused pseudo-labels at the location, For the right view geometry pseudo-label in pixels Pixel-level normalized confidence map For right view pixels Geometric pseudo-label probability distribution at location, semantic pseudo-labels for the right view in pixels Pixel-level normalized confidence map For right view pixels The probability distribution of semantic teacher pseudo-labels at the location.

[0010] Preferably, in step S6, the supervised loss is calculated for the student model prediction, and the calculation formula is as follows: ; in, The supervised loss predicted by the student model. For cross-entropy loss, For the student model, the left view pixels The predicted probabilities of each category, Left view pixels Truth label at the location, For sieve loss; The formula for calculating unsupervised loss is: ; in, The unsupervised loss predicted by the student model. For the student model, the right view pixels The predicted probabilities of each category; The total loss is constructed by summing the supervised loss and the unsupervised loss, and the formula is as follows: ; in, For the total loss, To monitor loss weights, This represents the unsupervised loss weight.

[0011] This invention also proposes a dual-teacher semi-supervised semantic segmentation device for stereoscopic endoscopic surgery, comprising a data acquisition module, a student model module, a geometric teacher module, a semantic teacher module, a fusion module, and a model training module, wherein: The data acquisition module includes a stereo calibration unit and a dataset construction unit, which communicate with the student model module, the geometry teacher module, and the semantic teacher module. The student model module is a semantic segmentation network with shared weights, including a left view prediction unit and a right view prediction unit, and is communicatively connected to the semantic teacher module and the model training module. The geometry teacher module includes a stereo matching unit, a geometric transformation unit, a probability distribution transformation unit, and a reprojection consistency verification unit. It communicates with the fusion module. The stereo matching unit has a built-in FoundationStereo stereo matching basic model. The semantic teacher module includes a data augmentation unit, a parameter update unit, a semantic prediction unit, and a semantic confidence calculation unit, and is communicatively connected to the fusion module; The fusion module includes a confidence normalization unit and a pixel-level gated fusion unit, and is connected to the model training module. The model training module includes a supervised loss calculation unit, an unsupervised loss calculation unit, a total loss construction unit, and a parameter update unit, and communicates with the student model module.

[0012] Preferably, the semantic segmentation network of the student model module has the same network structure as that of the semantic teacher module, and the left view prediction unit and the right view prediction unit use completely shared network weights.

[0013] Preferably, in the geometry teacher module, the parameter update unit of the model training module updates the network parameters of the student model module based on backpropagation of the total loss function, and synchronizes the updated student model module parameters to the parameter update unit of the semantic teacher module to complete one round of iterative training.

[0014] Preferably, the device only uses the left and right views of the stereo image pair to generate dual-teacher supervision signals and train the model during the training phase, and only requires a single endoscopic view to complete semantic segmentation prediction during the inference phase, without introducing additional inference overhead.

[0015] Therefore, this invention proposes a dual-teacher semi-supervised semantic segmentation method and device for stereoscopic endoscopic surgery, with the following beneficial effects: (1) The semi-supervised framework of the present invention explicitly utilizes the geometric correspondence of stereo views under single-view annotation. The geometry teacher transfers the truth information of the annotated view to the unannotated view, fully explores the additional supervision signals in the stereo image pair, effectively improves the utilization efficiency of unannotated samples, and greatly reduces the manual cost of medical image annotation.

[0016] (2) This invention generates two complementary pseudo-labels through a geometric teacher and a semantic teacher. The semantic teacher provides the predicted category distribution and confidence, while the geometric teacher provides cross-view label transfer and residual confidence. By using confidence gating fusion, the reliability of pseudo-labels is improved, and the occlusion holes and confirmation bias are alleviated.

[0017] (3) In the model training stage, the present invention uses stereoscopic images of a stereo endoscope to generate supervision signals. In the model inference stage, only a single endoscope view needs to be input to achieve high-precision semantic segmentation, maintaining the simple process of single-view segmentation, without introducing any additional inference overhead, and fully adapting to the real-time scene perception requirements of surgical robots.

[0018] (4) The present invention constructs a total loss by fusing supervised loss and unsupervised loss, so that the model can use labeled samples to ensure segmentation accuracy, while making full use of unlabeled samples to further improve the segmentation accuracy and generalization ability of the model. Attached Figure Description

[0019] Figure 1 A schematic diagram of the overall framework of a dual-teacher semi-supervised semantic segmentation method for stereoscopic endoscopic surgery; Figure 2 This is a schematic diagram of the geometry teacher model framework; Figure 3 This is a schematic diagram of the pseudo-tag fusion mechanism. Figure 4 This is a diagram illustrating the segmentation visualization results. Detailed Implementation

[0020] The technical solution of the present invention will be further described below with reference to the accompanying drawings and embodiments.

[0021] Unless otherwise defined, the technical or scientific terms used in this invention shall have the ordinary meaning as understood by one of ordinary skill in the art to which this invention pertains.

[0022] Example 1 like Figure 1 As shown, this invention provides a dual-teacher semi-supervised semantic segmentation method for stereoscopic endoscopic surgery, comprising the following steps: Step S1: Suppose that the training dataset consists of several stereo-corrected stereo image pairs, each image pair including a left view and a right view, wherein the left view has pixel-level ground truth segmentation labels, and the right view has no labels.

[0023] Step S2: Construct a student segmentation network to predict the left and right views with shared weights, and obtain the prediction results for the left and right views respectively. The prediction of the left view is supervised by pixel-level ground truth segmentation labels. The prediction of the right view is based on two pseudo-labels generated by the geometric teacher and the semantic teacher. The final supervised distribution is obtained by confidence-aware fusion and used to train the student model.

[0024] Step S3: Construct the geometry teacher model, with the framework as follows: Figure 2 As shown, the FoundationStereo stereo matching model is introduced into the geometry teacher model. Using the left and right views acquired by a stereo endoscope as input, a disparity map corresponding to the stereo image pair is predicted. The pixel-level ground truth labels of the left view are then mapped to the coordinate system of the right view through geometric transformation using the disparity map, resulting in geometric pseudo-labels on the right view. The formula for calculating the geometric pseudo-labels on the right view is as follows: ; in, For the geometric pseudo-labels on the right view, These are the pixel coordinates of the right view. For geometric transformation functions, Provide pixel-level truth labels for the left view. The disparity map corresponding to the right view; To facilitate subsequent integration with the pseudo-labels output by the semantic teacher, the geometric pseudo-labels are converted into a geometric pseudo-label distribution in the form of a category probability distribution. The calculation formula is as follows: ; in, For the right view, the first Class in pixels Geometric pseudo-label probability distribution at location, For category indexing, The total number of categories in semantic segmentation. For pixel position index, For indicator functions; Geometric confidence is constructed based on the cross-view reprojection consistency guided by disparity. The left view is reprojected according to the disparity map to obtain the reconstructed right view. The supervision confidence of each pixel position is evaluated by the photometric consistency error between the reconstructed right view and the real right view, and a pixel-level confidence map of the right view geometric pseudo-label is generated. The calculation formula is as follows: ; in, The pixel-level confidence map of the geometric pseudo-labels in the right view. It is a natural exponential function. The total number of pixels in the image. Right view For the true right view in pixels Pixel value at that location, The result of the right view reconstruction. To reconstruct the right view in pixels Pixel value at that location, Temperature coefficient; The pixel-level confidence map of the right-view geometric pseudo-label is used to characterize the reliability of the geometric pseudo-label in different regions, and it participates in the construction of the final supervision signal as a pixel-level weight when it is subsequently fused with the semantic pseudo-label generated by the semantic teacher.

[0025] Geometry instructors can fully utilize the geometric correspondence between the left and right views of a stereo endoscope to effectively transfer the real annotation information from the labeled view to the unlabeled view, providing pseudo-label supervision with geometric constraints for student models. At the same time, by combining confidence weights to suppress error propagation in occluded regions, unstable matching regions, and boundary regions, the quality of supervision in the semi-supervised training process can be improved.

[0026] Step S4: Construct the semantic teacher model. The network structure of the semantic teacher model is the same as that of the student model. The parameter weights are updated using the exponential moving average of the student network parameter weights, with the following formula: ; in, For the first The parameter weights of the semantic teacher model after the next iteration. For the first The parameter weights of the semantic teacher model after the next iteration. To the number of training iterations, For smoothing coefficients, For the first The parameters of the student model after the next iteration. For semantic teacher model, For student models; The augmented right view is received as input, and the output is the class probability distribution. Based on the maximum class probability at each pixel location, a pixel-level confidence map of the semantic pseudo-labels of the right view is generated, using the following formula: ; in, semantic pseudo-labels for the right view in pixels Pixel-level confidence map at the location. To obtain The maximum value among the categories, For the semantic teacher model, the right view in pixels Output category The probability distribution.

[0027] Step S5: To overcome the gaps / misalignments in geometric pseudo-labels and the confirmation bias of semantic pseudo-labels, the final supervision target is generated by fusing the two pseudo-labels based on confidence weights, such as... Figure 3 As shown, the specific steps include: Step S51: Normalize the confidence scores of geometric pseudo-labels and semantic pseudo-labels using the following formula: ; in, Pseudo-labels for right-view teacher (geometric / semantic) in pixels Pixel-level normalized confidence map at the location. Pseudo-labels for right-view teacher (geometric / semantic) in pixels Pixel-level confidence map at the location. It is the numerical stability constant. A false label for teachers Geometric pseudo-labels These are semantic pseudo-tags; Step S52: Perform pixel-level gated fusion based on the normalized confidence scores to generate a fusion supervision distribution, using the following formula: ; in, For right view pixels The probability distribution of fused pseudo-labels at the location, For the right view geometry pseudo-label in pixels Pixel-level normalized confidence map For right view pixels Geometric pseudo-label probability distribution at location, semantic pseudo-labels for the right view in pixels Pixel-level normalized confidence map For right view pixels The probability distribution of semantic pseudo-labels at the location.

[0028] Step S6: Based on the ground truth segmentation annotations of the left view, calculate the supervision loss for the student model prediction using cross-entropy loss and sieve loss. The calculation formula is as follows: ; in, The supervised loss predicted by the student model. For cross-entropy loss, For the student model, the left view pixels The predicted probabilities of each category, Left view pixels Truth label at the location, For sieve loss; Based on the fused supervised distribution of the right view, the cross-entropy loss is calculated between the student prediction and the fused pseudo-label distribution, and then used for pixel weighting and normalization to obtain the unsupervised loss. The calculation formula is as follows: ; in, The unsupervised loss predicted by the student model. For the student model, the right view pixels The predicted probabilities of each category; The total loss is constructed by weighting the supervised and unsupervised losses. The student segmentation network is trained, and the student model is trained under supervision. The formula is as follows: ; in, For the total loss, To monitor loss weights, This represents the unsupervised loss weight.

[0029] Example 2 This invention also provides a dual-teacher semi-supervised semantic segmentation device for stereoscopic endoscopic surgery, comprising a data acquisition module, a student model module, a geometric teacher module, a semantic teacher module, a fusion module, and a model training module, wherein: The data acquisition module acquires stereoscopic image pairs from a stereoscopic endoscopic surgical scenario, performs image preprocessing and dataset construction, and provides standardized training data with geometric correspondences for subsequent model training. This ensures that only the left view contains pixel-level ground truth segmentation annotations, while the right view is unlabeled. The module includes a stereo calibration unit and a dataset construction unit. The stereo calibration unit performs stereo calibration on the acquired original left and right view images of the stereoscopic endoscope, eliminating geometric distortions and ensuring column alignment geometric correspondence between the left and right views, providing a foundation for subsequent cross-view label transfer by geometry teachers. The dataset construction unit organizes the stereo-calibrated image pairs into a standardized training dataset, dividing the dataset into a left view with pixel-level ground truth segmentation annotations and an unlabeled right view, unifying the image data format and size, and supporting batch input to subsequent model modules.

[0030] The student model module, as the core model to be trained in this device, includes a left view prediction unit and a right view prediction unit. It achieves the initial prediction of semantic segmentation of the left and right views through a segmentation network with shared weights. The output prediction results provide a basis for subsequent loss calculation and model parameter update. It is the training object for the dual-teacher model to provide supervision signals.

[0031] The geometry teacher module utilizes the geometric correspondence between the left and right views of a stereo endoscope to achieve label migration from the labeled left view to the unlabeled right view, generating geometric pseudo-labels with pixel-level confidence and corresponding category probability distributions. This provides the student model with a geometrically constrained supervision signal and suppresses the error propagation of pseudo-labels through confidence. The module includes a stereo matching unit, a geometric transformation unit, a probability distribution conversion unit, and a reprojection consistency verification unit. The stereo matching unit incorporates the FoundationStereo stereo matching model, using calibrated stereo image pairs as input to perform stereo matching calculations and predict the disparity map corresponding to the right view, providing a core basis for cross-view geometric transformation. The geometric transformation unit maps the pixel-level ground truth segmentation labels of the left view to the coordinate system of the right view through geometric transformation based on the disparity map, generating geometric pseudo-labels for the right view and realizing the migration of labeled information to unlabeled views. The probability distribution conversion unit converts the discrete geometric pseudo-labels into a geometric pseudo-label distribution in the form of a category probability distribution, aligning the geometric pseudo-labels with the semantic pseudo-label format output by the semantic teacher, preparing for subsequent fusion. The reprojection consistency verification unit reprojects the left view based on the disparity map to obtain the reconstructed right view, calculates the photometric consistency error between the reconstructed right view and the real right view, and generates a pixel-level geometric confidence map corresponding to the geometric pseudo-labels, characterizing the reliability of different pixel regions of the geometric pseudo-labels.

[0032] The semantic teacher module constructs a network model with the same structure as the student model. It achieves stable parameter updates through an exponential moving average strategy, generating stable semantic pseudo-labels and their distributions with pixel-level confidence for the unlabeled right view. These pseudo-labels form a complementary source of supervision with the geometric pseudo-labels, mitigating the confirmation bias problem in semi-supervised training. The module includes a data augmentation unit, a parameter update unit, a semantic prediction unit, and a semantic confidence calculation unit, and communicates with the fusion module. Specifically, the data augmentation unit performs data augmentation processing (such as color adjustment) on the unlabeled right view to improve the generalization of the semantic pseudo-labels and prevent overfitting of the student model. The parameter update unit uses an exponential moving average parameter update strategy to update the semantic teacher model parameters, ensuring that the semantic teacher parameters follow the student model's updates and guaranteeing the stability of the output pseudo-labels. The semantic prediction unit takes the data-augmented right view as input and outputs the semantic pseudo-label category probability distribution of the right view through the semantic teacher network, providing semantic-level supervision signals for the student model. The semantic confidence calculation unit calculates the maximum category probability at each pixel position based on the semantic pseudo-label category probability distribution, generating a pixel-level semantic confidence map corresponding to the semantic pseudo-label, characterizing the reliability of the semantic pseudo-label in different pixel regions.

[0033] The fusion module fuses the two pseudo-labels output by the geometry teacher and the semantic teacher, along with their corresponding pixel-level confidence scores. It overcomes the problems of hollow / misaligned geometric pseudo-labels and confirmation bias in semantic pseudo-labels through a confidence-aware fusion strategy, generating a reliable fusion supervision distribution. This distribution serves as the core supervision objective for the unsupervised training of the student model's right-view. The module includes a confidence normalization unit and a pixel-level gated fusion unit. Specifically, the confidence normalization unit performs pixel-by-pixel normalization on the geometric and semantic confidence maps, obtaining normalized geometric and semantic confidence scores, providing standardized weights for pixel-level fusion. The pixel-level gated fusion unit uses the normalized confidence scores as pixel-level weights to perform pixel-level gated fusion on the geometric and semantic pseudo-label distributions, generating the final fusion supervision distribution. This provides a unified and reliable supervision signal for the unsupervised training of the student model's right-view.

[0034] The model training module calculates supervised and unsupervised losses based on the ground truth labels of the left view and the fused supervised distribution of the right view, constructs a total loss function, and completes end-to-end training of the student model through backpropagation. The final output is a high-precision, highly generalizable semantic segmentation model for stereoscopic endoscopic surgical images. This module includes a supervised loss calculation unit, an unsupervised loss calculation unit, a total loss construction unit, and a parameter update unit. Specifically, the supervised loss calculation unit calculates supervised losses (cross-entropy loss and Dice loss) based on the pixel-level ground truth segmentation labels of the left view and the student model's left view prediction results, ensuring the student model's segmentation accuracy on labeled samples. The unsupervised loss calculation unit utilizes the fused supervised distribution of the right view and the student model's right view prediction results to fully leverage unlabeled samples and improve the model's generalization ability. The total loss construction unit sets supervised and unsupervised loss weight coefficients to construct the total loss function, balancing the contributions of supervised and unsupervised training. The parameter update unit backpropagates the total loss function to the student model's segmentation network, updates the network parameters through an optimizer, iterates training until model convergence, saves the optimal model parameters, and outputs the trained semantic segmentation model.

[0035] The invention will be further illustrated below through specific implementation examples.

[0036] 1. Dataset preparation: The EndoVis2017 laparoscopic surgical instrument segmentation public dataset was used. This dataset consists of left and right view images of laparoscopic surgery acquired simultaneously by a stereo endoscope. All images were stereo-corrected to eliminate geometric distortion and parallax offset, ensuring the column alignment and geometric correspondence of the left and right views.

[0037] The dimensions of all stereo image pairs are uniformly normalized to Pixel augmentation involves normalizing pixel values and adjusting colors in an image to improve the model's generalization ability.

[0038] 2. Set up the comparison model: To fully verify the performance advantages of the method of this invention, the training configuration of all comparison models is completely consistent with the method of this application to ensure the fairness of the experiment.

[0039] The comparison models include: Models based on convolutional networks include U-Net, DeepLabv3+, PSPNet, and HRNet. Attention-based models include TransUNet, SegFormer, and LSKANet. Models based on semi-supervised frameworks: MeanTeacher, GCT, CCT, and CPS.

[0040] 3. Determine the evaluation indicators: The segmentation performance is evaluated using two commonly used metrics in the field of medical image semantic segmentation: mean intersection-union ratio (mIoU) and mean sieve coefficient (mDice). Specifically: Mean Intersection over Union (mIoU): The ratio of the intersection to the union of the predicted segmentation result and the ground truth label. The average value is taken as the average of all categories. The higher the value, the higher the segmentation accuracy. Average Dice coefficient mDice: Calculates the overlap between the predicted segmentation result and the ground truth label, taking the average value of all categories. The higher the value, the better the segmentation effect.

[0041] 4. Training details and related parameter design: Based on the proposed dual-teacher semi-supervised semantic segmentation method for stereoscopic endoscopic surgery, experiments were conducted using a dual-teacher semi-supervised semantic segmentation device for stereoscopic endoscopic surgery, wherein: The student network uses ResNet50-UNet as the base segmentation network, the stereo matching model in the geometry teacher uses FoundationStereo (pre-trained with fixed weights), and the semantic teacher is updated through exponential moving average, with momentum coefficients... Set to 0.99; training uses the AdamW optimizer; loss weights It is 0.8. It is 0.2.

[0042] 5. Analysis of experimental results: The segmentation performance quantification results of the method of this application and the comparison models are shown in Table 1: Table 1: Quantitative results of segmentation performance of the method in this application and various comparative models

[0043] It is evident that the proposed method achieves the highest segmentation accuracy compared to the comparative method in terms of the overall mIou, overall mDice, instrument shaft mIoU, and instrument tip mIoU of the surgical instrument segmentation, thus verifying the effectiveness of the proposed method.

[0044] Segmentation visualization results as follows Figure 4 As shown, the method of this application achieves better segmentation results compared to other models, and can segment surgical instruments more accurately, especially at the wrist and end of the surgical instrument, where its edges can be segmented more precisely.

[0045] It is worth noting that all contents not described in detail in this invention are existing technologies and are well known to those skilled in the art.

[0046] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can still be made to the technical solutions of the present invention, and these modifications or equivalent substitutions cannot cause the modified technical solutions to deviate from the spirit and scope of the technical solutions of the present invention.

Claims

1. A dual-teacher semi-supervised semantic segmentation method for stereoscopic endoscopic surgery, characterized in that, Includes the following steps: Step S1: Obtain a dataset of stereo-corrected stereoscopic endoscopic surgical image pairs. Each image pair includes a left view and a right view, where the left view has pixel-level ground truth segmentation annotations and the right view has no annotations. Step S2: Construct a student segmentation network to predict the left and right views with shared weights, and obtain the prediction results for the left and right views respectively; Step S3: Construct a geometric teacher model. Introduce the FoundationStereo stereo matching model into the geometric teacher model. Using stereo endoscopic surgical image pairs as input, predict the disparity map corresponding to the stereo image pairs. Use the disparity map to map the pixel-level ground truth labels of the left view to the coordinate system of the right view through geometric transformation to obtain geometric pseudo-labels on the right view. Convert the geometric pseudo-labels into a geometric pseudo-label distribution in the form of class probability distribution. Construct geometric confidence based on the cross-view reprojection consistency under disparity guidance. Reproject the left view according to the disparity map to obtain the reconstruction result of the right view. Evaluate the supervision confidence of each pixel position by the photometric consistency error between the reconstructed right view and the real right view, and generate a pixel-level confidence map of the geometric pseudo-labels of the right view. Step S4: Construct a semantic teacher model. The network structure of the semantic teacher model is the same as that of the student model. The parameter weights are updated using the exponential moving average of the parameter weights of the student network. The right view after data augmentation is received as input, and the class probability distribution is output. Based on the maximum class probability at each pixel position, a pixel-level confidence map of the semantic pseudo-label of the right view is generated. Step S5: Perform confidence-aware fusion of the right view geometric pseudo-labels and the right view semantic pseudo-labels to generate a fusion supervised distribution; Step S6: Based on the ground truth segmentation labels in the left view, use cross-entropy loss and sieve loss to calculate the supervised loss for the student model prediction. Based on the fused supervised distribution in the right view, calculate the cross-entropy loss for the student prediction and the fused pseudo-label distribution to obtain the unsupervised loss. Construct the total loss by weighted summation of supervised loss and unsupervised loss, and train the student segmentation network.

2. The dual-teacher semi-supervised semantic segmentation method for stereoscopic endoscopic surgery according to claim 1, characterized in that: In step S2, the left view prediction is supervised training using pixel-level ground truth segmentation labels; the right view prediction is based on two pseudo-labels generated by the geometric teacher and the semantic teacher, and the final supervised distribution is obtained through confidence-aware fusion to train the student model.

3. The dual-teacher semi-supervised semantic segmentation method for stereoscopic endoscopic surgery according to claim 1, characterized in that: In step S3, the formula for calculating the geometric pseudo-label on the right view is: ； in, For the geometric pseudo-labels on the right view, These are the pixel coordinates of the right view. For geometric transformation functions, Provide pixel-level truth labels for the left view. The disparity map corresponding to the right view; The formula for calculating the geometric pseudo-label distribution is: ； in, For the right view, the first Class in pixels Geometric pseudo-label probability distribution at location, For category indexing, The total number of categories in semantic segmentation. For pixel position index, For indicator functions; The formula for calculating the pixel-level confidence map of the right-view geometric pseudo-label is: ； in, For the right view geometry pseudo-label in pixels Pixel-level confidence map at the location. It is a natural exponential function. The total number of pixels in the image. Right view For the true right view in pixels Pixel value at that location, The result of the right view reconstruction. To reconstruct the right view in pixels Pixel value at that location, This is the temperature coefficient.

4. The dual-teacher semi-supervised semantic segmentation method for stereoscopic endoscopic surgery according to claim 1, characterized in that: In step S4, the parameter weights are updated using an exponential moving average of the student network parameter weights, with the following formula: ； in, For the first The parameter weights of the semantic teacher model after the next iteration. For the first The parameter weights of the semantic teacher model after the next iteration. To determine the number of training iterations, For smoothing coefficients, For the first The parameters of the student model after the next iteration. For semantic teacher model, For student models; Based on the maximum class probability at each pixel location, a pixel-level confidence map of the semantic pseudo-labels for the right view is generated, using the following formula: ； in, semantic pseudo-labels for the right view in pixels Pixel-level confidence map at the location. To obtain The maximum value among the categories, For the semantic teacher model, the right view pixels Output category The probability distribution.

5. The dual-teacher semi-supervised semantic segmentation method for stereoscopic endoscopic surgery according to claim 1, characterized in that: Step S5 specifically includes the following steps: Step S51: Normalize the confidence scores of geometric pseudo-labels and semantic pseudo-labels using the following formula: ； in, For the right view, the teacher's pseudo-label is in pixels. Pixel-level normalized confidence map For the right view, the teacher's pseudo-label is in pixels. Pixel-level confidence map. It is the numerical stability constant. A false label for teachers Geometric pseudo-labels For semantic teacher pseudo-labels; Step S52: Perform pixel-level gated fusion based on the normalized confidence scores to generate a fusion supervision distribution, using the following formula: ； in, For right view pixels The probability distribution of fused pseudo-labels at the location, For the right view geometry pseudo-label in pixels Pixel-level normalized confidence map For right view pixels Geometric pseudo-label probability distribution at location, semantic pseudo-labels for the right view in pixels Pixel-level normalized confidence map For right view pixels The probability distribution of semantic teacher pseudo-labels at the location.

6. The dual-teacher semi-supervised semantic segmentation method for stereoscopic endoscopic surgery according to claim 1, characterized in that: In step S6, the supervised loss is calculated for the student model prediction. The calculation formula is as follows: ； in, The supervised loss predicted by the student model. For cross-entropy loss, For the student model, the left view pixels The predicted probabilities of each category, Left view pixels Truth label at the location, For sieve loss; The formula for calculating unsupervised loss is: ； in, The unsupervised loss predicted by the student model. For the student model, the right view pixels The predicted probabilities of each category; The total loss is constructed by summing the supervised loss and the unsupervised loss, and the formula is as follows: ； in, For the total loss, To monitor loss weights, This represents the unsupervised loss weight.

7. A dual-teacher semi-supervised semantic segmentation device for stereoscopic endoscopic surgery, used to implement the dual-teacher semi-supervised semantic segmentation method for stereoscopic endoscopic surgery as described in any one of claims 1-6, characterized in that: It includes a data acquisition module, a student model module, a geometry teacher module, a semantic teacher module, a fusion module, and a model training module, among which: The data acquisition module includes a stereo calibration unit and a dataset construction unit, which communicate with the student model module, the geometry teacher module, and the semantic teacher module. The student model module is a semantic segmentation network with shared weights, including a left view prediction unit and a right view prediction unit, and is communicatively connected to the semantic teacher module and the model training module. The geometry teacher module includes a stereo matching unit, a geometric transformation unit, a probability distribution transformation unit, and a reprojection consistency verification unit. It communicates with the fusion module. The stereo matching unit has a built-in FoundationStereo stereo matching basic model. The semantic teacher module includes a data augmentation unit, a parameter update unit, a semantic prediction unit, and a semantic confidence calculation unit, and is communicatively connected to the fusion module; The fusion module includes a confidence normalization unit and a pixel-level gated fusion unit, and is connected to the model training module. The model training module includes a supervised loss calculation unit, an unsupervised loss calculation unit, a total loss construction unit, and a parameter update unit, and communicates with the student model module.

8. A dual-teacher semi-supervised semantic segmentation device for stereoscopic endoscopic surgery according to claim 7, characterized in that: The semantic segmentation network of the student model module has the same network structure as that of the semantic teacher module, and the left view prediction unit and the right view prediction unit use completely shared network weights.

9. A dual-teacher semi-supervised semantic segmentation device for stereoscopic endoscopic surgery according to claim 7, characterized in that: The parameter update unit of the model training module updates the network parameters of the student model module based on backpropagation of the total loss function, and synchronizes the updated student model module parameters to the parameter update unit of the semantic teacher module to complete one round of iterative training.

10. A dual-teacher semi-supervised semantic segmentation device for stereoscopic endoscopic surgery according to claim 7, characterized in that: The device utilizes the left and right views of the stereo image pair to generate dual-teacher supervision signals and train the model only during the training phase. During the inference phase, it can complete semantic segmentation prediction by inputting only a single endoscopic view, without introducing additional inference overhead.