Intelligent image segmentation method, system, device and medium for endoscopic colorectal surgery

By using a multi-branch heterogeneous network and an adaptive channel attention fusion module, combined with a surgical step temporal attention module, a keyframe feature system specifically for laparoscopic colorectal surgery is constructed. This solves the problems of low accuracy and low efficiency in keyframe recognition during laparoscopic colorectal surgery, and enables automatic segmentation and efficient keyframe extraction of arteries such as the inferior mesenteric artery.

CN122244073APending Publication Date: 2026-06-19THE FIRST HOSPITAL OF CHINA MEDICIAL UNIV +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
THE FIRST HOSPITAL OF CHINA MEDICIAL UNIV
Filing Date
2026-03-19
Publication Date
2026-06-19

Smart Images

  • Figure CN122244073A_ABST
    Figure CN122244073A_ABST
Patent Text Reader

Abstract

This invention discloses an intelligent segmentation method, system, device, and medium for laparoscopic colorectal surgery images. The method includes: acquiring images during laparoscopic colorectal surgery; performing frame-by-frame processing on the images; and annotating keyframes in the frame-by-frame images and presetting Regions of Interest (ROIs) to achieve segmentation of the laparoscopic colorectal surgery images. This invention, through real-time laparoscopic visualization, provides guidance for locating the inferior mesenteric artery, automatic segmentation of the inferior mesenteric artery, automatic segmentation of the Toldts space, and automatic segmentation of instrument forceps during the inferior mesenteric artery dissection stage, effectively achieving intelligent segmentation of laparoscopic colorectal surgery images. This invention achieves automated keyframe extraction with high processing efficiency and can quickly construct large-scale, standardized keyframe datasets.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent medical technology, and in particular to an intelligent segmentation method, system, device and medium for laparoscopic colorectal surgery images. Background Technology

[0002] Laparoscopic colorectal surgery is a minimally invasive surgical procedure performed under the support of a laparoscopic system. Through a small incision in the abdominal wall, a high-definition laparoscopic lens and specialized surgical instruments are inserted into the abdominal cavity to remove colorectal lesions, dissect lymph nodes, and reconstruct the digestive tract. It has significant advantages such as less trauma, less bleeding, faster postoperative recovery, and a low incidence of complications. It has been widely adopted in medical centers at all levels both domestically and internationally and is gradually being promoted to primary hospitals. However, current laparoscopic colorectal surgery mainly relies on the surgeon's visual observation and clinical experience. Laparoscopic surgery provides a two-dimensional planar view and lacks three-dimensional spatial depth perception. Surgeons need long-term practical experience to accurately judge tissue layers and spatial locations, resulting in a long learning curve. The varying skill levels among different surgeons make it difficult to standardize surgical quality. Moreover, the inferior mesenteric artery is thin, located deep, and often surrounded by adipose tissue, resulting in low contrast with surrounding tissues. This makes it easy to miss or damage the artery during surgery, leading to serious complications such as massive hemorrhage, intestinal ischemia, and anastomotic leakage. The Toldts space, as a potential anatomical space between the colorectal mesentery and retroperitoneum, has no clear visual boundary. Its grayscale and texture are highly similar to the surrounding fat and mesenteric tissue, making layer identification highly dependent on the surgeon's experience. Deviance in layer identification can easily damage the ureter, reproductive vessels, and autonomic nerves. Surgical instruments are made of metal, have variable postures, and are easily obscured by tissue or blood, making real-time positioning difficult and increasing the risk of accidental puncture and incomplete cutting.

[0003] Based on the shortcomings of existing laparoscopic colorectal surgery, researchers began to explore intelligent segmentation schemes for laparoscopic colorectal surgery images. In intelligent segmentation schemes for laparoscopic colorectal surgery images, the accurate extraction of keyframes from surgical videos is the core of surgical AI model training, intraoperative navigation, and surgical process analysis. Existing laparoscopic surgery keyframe recognition technologies are mostly developed for surgeries such as gallbladder and uterus, and have many technical shortcomings when applied to colorectal surgery: 1. Poor adaptability: There is no keyframe feature system specifically for laparoscopic colon surgery, and it is unable to recognize colon surgery-specific steps such as Toldts interspace incision and IMA vascular localization. The matching degree between general models and colon surgery scenarios is low. II. Limited Feature Utilization: Relying solely on single visual features of tissues / instruments for identification, without integrating temporal operation features of instruments and contextual features of the surgical scene, results in poor robustness to complex scenarios such as instrument occlusion, tissue deformation, and cauterization oil mist, leading to generally low keyframe recognition accuracy. III. Rigid Screening Mechanism: Using fixed thresholds for keyframe screening easily leads to mis-screening or omission of frames with "fuzzy features but surgical significance," resulting in insufficient keyframe coverage. IV. Low Manual Efficiency: In clinical practice, keyframe extraction relies on manual work by doctors, with an efficiency of only 5-8 frames per hour. Furthermore, the annotation standards are subjective, failing to meet the needs of AI model training for large-scale, standardized keyframe datasets. Additionally, existing feature fusion methods often employ simple convolutional concatenation or direct pixel-level fusion, neglecting the differences in importance between different modalities, easily leading to core features being masked by redundant features. Temporal feature processing does not incorporate prior clinical knowledge of surgical steps, failing to effectively capture the temporal dependencies of colon surgery procedures.

[0004] The above problems urgently need to be solved. Summary of the Invention

[0005] To address the related technical problems, the purpose of this invention is to provide a method, system, device, and medium for intelligent segmentation of laparoscopic colorectal surgery images, thereby resolving the issues mentioned in the background section.

[0006] To achieve this objective, the embodiments of the present invention adopt the following technical solutions: In a first aspect, embodiments of the present invention provide an intelligent segmentation method for laparoscopic colorectal surgery images, the method comprising: Acquire images during laparoscopic colorectal surgery; The images during the laparoscopic colorectal surgery were split into frames. The image after frame splitting is labeled with keyframes and a ROI is preset to achieve segmentation of the laparoscopic colorectal surgery image.

[0007] As an optional implementation, the acquisition of images during laparoscopic colorectal surgery includes: Images were acquired during laparoscopic colorectal surgery following a standardized procedure, and images were selected from the inferior artery to the completion of the block.

[0008] As an optional implementation, the preset ROI includes, but is not limited to, the inferior mesenteric artery, Toldts' space, and instrument forceps.

[0009] As an optional implementation, the step of keyframe annotation and ROI preset of the frame-decomposed image to achieve segmentation of the laparoscopic colorectal surgery image includes: Constructing a keyframe feature system for laparoscopic colorectal surgery; Three-dimensional features are extracted in parallel using a multi-branch heterogeneous network to obtain instrument feature tensor, tissue feature tensor, and scene feature tensor; An adaptive channel attention fusion module is used to fuse heterogeneous feature tensors into a global feature tensor.

[0010] As an optional implementation, the step of using an adaptive channel attention fusion module to fuse heterogeneous feature tensors into a global feature tensor further includes: The global feature tensor is encoded into a temporal feature sequence, which is then weighted and fused through the surgical step temporal attention module, and input into the classifier to obtain preliminary keyframes, surgical step labels and confidence scores. The initial keyframes are clustered by feature similarity through scene-adaptive dynamic threshold filtering. The dynamic thresholds for each scene are calculated, and the final keyframe set is output after secondary filtering.

[0011] Secondly, embodiments of the present invention provide an intelligent image segmentation system for laparoscopic colorectal surgery, the system comprising: The target image acquisition module is used to acquire images during laparoscopic colorectal surgery. The frame splitting processing module is used to split the images during the laparoscopic colorectal surgery. The image segmentation module is used to annotate keyframes in the image after frame splitting and to preset ROIs to achieve segmentation of the laparoscopic colorectal surgery image.

[0012] As an optional implementation, the target image acquisition module is specifically used to acquire images during a standardized laparoscopic colorectal surgery procedure, and select images from the incision point of the inferior artery to the completion of the block.

[0013] As an optional implementation, the preset ROI includes, but is not limited to, the inferior mesenteric artery, Toldts' space, and instrument forceps.

[0014] Thirdly, embodiments of the present invention provide an electronic device, the electronic device including a processor and a memory connected to the processor, wherein the memory stores program data, and the processor retrieves the program data stored in the memory to execute the intelligent segmentation method for laparoscopic colorectal surgery images as provided in the first aspect embodiment above.

[0015] Fourthly, embodiments of the present invention provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the intelligent segmentation method for laparoscopic colorectal surgery images provided in the first aspect embodiment above.

[0016] The technical solution proposed in this invention acquires images during laparoscopic colorectal surgery; performs frame-by-frame processing on the images during the laparoscopic colorectal surgery; and annotates keyframes in the frame-by-frame processed images and presets ROIs to achieve segmentation of the laparoscopic colorectal surgery images. This invention can provide guidance for locating the inferior mesenteric artery, automatic segmentation of the inferior mesenteric artery, automatic segmentation of the Toldts space, and automatic segmentation of instrument forceps during the inferior mesenteric artery dissection stage through real-time laparoscopic images, effectively realizing intelligent segmentation of laparoscopic colorectal surgery images. The technical solution proposed in this invention achieves intelligent keyframe recognition based on multimodal feature fusion and temporal attention. It constructs a keyframe feature system and core keyframe classification specific to laparoscopic colorectal surgery, accurately matching the surgical operation process. It integrates three-dimensional features of visual morphology, instrument operation, and scene context, and combines an attention mechanism to enhance the weight of core features, thereby improving the average recognition accuracy of core keyframes and the recognition accuracy in complex scenes. It adopts scene-adaptive dynamic threshold screening to solve the problems of misscreening and omissions caused by fixed thresholds, significantly improving the keyframe coverage of core surgical steps, with no omissions of iconic keyframes and high keyframe coverage. It realizes automated keyframe extraction with high processing efficiency, and can quickly build large-scale standardized keyframe datasets to meet the needs of AI model training. Attached Figure Description

[0017] To more clearly illustrate and understand the technical solutions in the embodiments of the present invention, the accompanying drawings used in the background technology and embodiment descriptions of the present invention will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the content of the embodiments of the present invention and these drawings without creative effort.

[0018] Figure 1 This is a schematic diagram of the intelligent segmentation method for laparoscopic colorectal surgery images provided in an embodiment of the present invention; Figure 2 This is a structural block diagram of the intelligent image segmentation system for laparoscopic colorectal surgery provided in an embodiment of the present invention. Detailed Implementation

[0019] To make the technical problems solved by the present invention, the technical solutions adopted, and the technical effects achieved clearer, the technical solutions of the embodiments of the present invention will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0020] Example 1 Please refer to Figure 1 The above, Figure 1 A schematic flowchart of the intelligent segmentation method for laparoscopic colorectal surgery images provided in this embodiment of the invention is shown. As shown in the figure, the intelligent segmentation method for laparoscopic colorectal surgery images in this embodiment includes: S101. Acquire images during laparoscopic colorectal surgery; S102. Perform frame splitting processing on the images during the laparoscopic colorectal surgery; S103. Perform keyframe annotation on the image after frame splitting and preset ROI to achieve segmentation of the laparoscopic colorectal surgery image.

[0021] For example, the acquisition of images during laparoscopic colorectal surgery includes: Images were acquired during laparoscopic colorectal surgery following a standardized procedure, and images were selected from the inferior artery to the completion of the block.

[0022] For example, the preset ROI includes, but is not limited to, the inferior mesenteric artery, Toldts' space, and instrument forceps.

[0023] It should be noted that the extraction and recognition of keyframes in laparoscopic images is the first step in all laparoscopic image recognition. The surgical process can be segmented based on the recognition results of keyframes. Because laparoscopic surgery is lengthy, complex, and involves many steps, keyframe recognition can identify the current surgical steps and surgical scenario, which is a crucial foundation for subsequent development such as surgical guidance, hazard warning, and organ recognition.

[0024] In this implementation, if the doctor manually extracts keyframes from the surgical video and defines each keyframe (e.g., start of surgery, start of cauterization, start of dissection), this keyframe information is used as training material. During acceptance testing, the doctor needs to extract keyframes from the validation set using the same definitions to verify the segmentation accuracy. In one validation example, the Dice of the inferior mesenteric artery after dissection is >0.8, the Dice of the instrument forceps is >0.8, and the Dice of the Toldts space and other tissues is >0.7.

[0025] For example, the step of keyframe annotation and ROI preset of the image after frame splitting to achieve segmentation of the laparoscopic colorectal surgery image includes: Constructing a keyframe feature system for laparoscopic colorectal surgery; Three-dimensional features are extracted in parallel using a multi-branch heterogeneous network to obtain instrument feature tensor, tissue feature tensor, and scene feature tensor; An adaptive channel attention fusion module is used to fuse heterogeneous feature tensors into a global feature tensor.

[0026] For example, the step of using an adaptive channel attention fusion module to fuse heterogeneous feature tensors into a global feature tensor further includes: The global feature tensor is encoded into a temporal feature sequence, which is then weighted and fused through the surgical step temporal attention module, and input into the classifier to obtain preliminary keyframes, surgical step labels and confidence scores. The initial keyframes are clustered by feature similarity through scene-adaptive dynamic threshold filtering. The dynamic thresholds for each scene are calculated, and the final keyframe set is output after secondary filtering.

[0027] In this embodiment, a keyframe feature system specific to laparoscopic colorectal surgery is constructed in the intelligent recognition method for keyframes in laparoscopic colorectal surgery based on multimodal feature fusion and temporal attention. Eight types of core keyframes are defined, and the visual morphological features of each frame are extracted. Instrument operation characteristics Surgical scene context features The above three-dimensional features are extracted in parallel using a multi-branch heterogeneous network to obtain the instrument feature tensor. Organizational characteristic tensor Scene feature tensor The Adaptive Channel Attention Fusion Module (ACAFM) is used to fuse heterogeneous feature tensors into a global feature tensor. The global feature tensor is encoded into a temporal feature sequence, which is then weighted and fused using a surgical step temporal attention module (SAMS). This fusion is then input into a classifier to obtain preliminary keyframes, surgical step labels, and confidence scores. A scene-adaptive dynamic threshold filtering algorithm is used to filter initial keyframes based on feature similarity. Clustering into 3 scenarios, dynamic thresholds for each scenario are calculated. After secondary filtering, the final keyframe set is output.

[0028] Specifically, the multi-branch heterogeneous network includes: a branch for instrument operation feature extraction using YOLO-OBB, a branch for tissue visual feature extraction using an improved U-Net++, and a branch for scene context feature extraction using MobileNetV3+ConvLSTM. The improved U-Net++ uses ResNet50 as the encoder, adds a spatial attention module to the decoding layer, and employs dense convolutional fusion for skip connections. Specifically, the computation process of the Adaptive Channel Attention Fusion Module (ACAFM) is as follows: first, the initial fusion feature tensor... Global average pooling is used to obtain the channel feature vectors. Then, the channel attention weights are learned through two fully connected layers. ,Will and After channel-wise multiplication, the dimensionality is reduced by 1×1 convolution to obtain the global feature tensor. Specifically, the Surgical Procedure Sequential Attention Module (SAMS) combines self-attention mechanisms with clinical prior weights. The formula for calculating self-attention weights is as follows: The final attention weight is ,in This represents clinical prior weights. Specifically, feature similarity. Cosine similarity is used for calculation, and the formula is as follows: The basic threshold in dynamic thresholding Adjustment coefficient .

[0029] In this embodiment, a keyframe feature system specific to laparoscopic colorectal surgery was constructed, and a four-stage recognition method consisting of multi-branch feature extraction, cross-modal fusion, temporal attention classification, and dynamic threshold filtering was designed. A multi-branch heterogeneous network fusion model architecture was proposed, including an adaptive channel attention fusion module and a surgical step temporal attention module, and an end-to-end intelligent recognition system was built. The model adopts an integrated heterogeneous network architecture with multi-branch input, feature fusion layer, temporal attention classifier, and dynamic filtering layer. The input is a continuous RGB image after image frame decomposition during laparoscopic colorectal surgery, and the output is a standardized keyframe set with surgical step labels and confidence scores. Simultaneously, a keyframe feature system specific to laparoscopic colorectal surgery was constructed, an adaptive channel attention fusion module and a surgical step temporal attention module were designed, and a scene-adaptive dynamic threshold filtering algorithm was proposed to achieve high-precision and high-coverage keyframe recognition.

[0030] Specifically, the keyframe feature system for laparoscopic colorectal surgery is determined based on the clinical operation process of colorectal surgery. Taking lateral approach laparoscopic colorectal surgery as an example, eight core keyframes are defined, covering the core steps of the entire surgical process, as follows: 1. Surgical preparation frame: Instruments are in place, surgical area is exposed, and preparation for core operations begins; 2. Lateral peritoneum localization frame: Lateral peritoneum anatomical location is completed, preparing for Toldts space incision; 3. Toldts space incision frame: Instruments are incised at the boundary between the lateral peritoneum and mesentery, and Toldts space dissection begins; 4. Toldts space cauterization frame: Toldts space is cauterized using cauterization instruments; 5. IMA vessel localization frame: Inferior mesenteric artery (IMA) anatomical location is completed, but dissection has not yet occurred; 6. IMA vessel dissection frame: IMA vessel dissection begins; 7. Vessel block frame: IMA vessel dissection and block operations are completed; 8. Surgical completion frame: Core operations are completed, surgical area is cleaned and finished. Specifically, for each type of keyframe, visual morphological features are constructed ( ), instrument operation characteristics ( ), surgical scenario context features ( A three-dimensional feature system provides a dedicated training basis for the model. The feature definition and included dimensions are as follows: visual morphological features ( : Pixel-level low-level visual features of tissue / instrument, including texture features, color features, and contour features, such as the grayish-white texture of the Toldts gap after cauterization, the red tubular outline of IMA vessels, and the metallic reflective features of instrument forceps; instrument operation features ( ): Motion and state characteristics of surgical instruments, including position features, posture features, motion trajectory features, and cross-frame tracking IDs, such as the vertical angle between the cauterization instrument and the Toldts gap, and the lifting posture of the clamping instrument on the IMA vessel; surgical scene context features ( ): Higher-order association features between tissues and between instruments and tissues, including tissue anatomical location association features, instrument-tissue interaction features, and scene temporal change features, such as the Toldts gap location above the intestine and below the mesentery, and the contact relationship between instrument forceps and IMA vessels.

[0031] Specifically, the keyframe recognition model in this embodiment is a multi-branch heterogeneous network fusion architecture, consisting of four core layers: a multi-branch feature extraction layer, a cross-modal feature fusion layer, a temporal attention classification layer, and a dynamic threshold filtering layer. The multi-branch feature extraction layer is the foundational layer of the model, containing three heterogeneous feature extraction branches that extract three-dimensional features of laparoscopic colorectal surgery in parallel, outputting standardized feature tensors to address the poor robustness of single-feature recognition. The network structure, training customization, and output features of each branch are all customized for the colon surgery scenario, as detailed below: I. Instrument Operation Feature Extraction Branch ( The core objective is to achieve real-time detection, tracking, and pose estimation of surgical instruments, and to extract instrument operation feature tensors. Specifically, the base network uses YOLO-OBB (a variant of rotating target detection), an improvement on YOLOv8-medium, optimized for the rotational pose of laparoscopic surgical instruments. The network depth is 36 layers, the activation function is SiLU, and the number of anchor boxes is set to 9 (adapting to 3 types of core instruments: clamps, cauterizers, and ultrasound probes, with 3 anchor boxes of different scales for each type of instrument); the loss function uses CIoU loss + class cross-entropy loss, with the total loss function formula as follows: in, CIoU loss for the rotated bounding box is used to optimize the position and attitude estimation of the instrument. The category cross-entropy loss is used to optimize instrument category recognition; The weighting coefficient was set to 0.5 after experimental optimization; preprocessing operations: normalization of the input image (pixel values ​​mapped to [0,1]) and adaptive anchor box matching; output features: instrument operation feature tensor ,in , , The tensor contains five types of information: instrument category code (16-dimensional), rotated bounding box coordinates (4-dimensional), attitude angle (1-dimensional), cross-frame tracking ID (8-dimensional), and motion trajectory vector (35-dimensional). II. Tissue Visual Feature Extraction Branch ( The core objective is to achieve pixel-level fine segmentation of core anatomical tissues and extract tissue visual feature tensors. The base network uses an improved U-Net++, with three core improvements specifically for colon surgery tissue segmentation, enhancing segmentation accuracy and feature extraction capabilities. The encoder uses ResNet50 as the backbone network, replacing the original simple convolutional encoder in U-Net++ to enhance deep feature extraction capabilities. The decoding layer incorporates a spatial attention module (SAM) to focus on core tissue regions such as IMA vessels and Toldts gaps, suppressing redundant background features. Skip connections employ dense convolutional fusion to densely stitch together features from each encoder layer, resolving the issue of blurred tissue boundaries. A leading optimization strategy first extracts the region of interest (ROI) of the surgical area using YOLO-OBB, cropping the ROI to a 512×512 high-resolution sub-image before inputting it into U-Net++, improving both segmentation accuracy and inference speed by 3 times. The loss function uses Dice loss + cross-entropy loss to address class imbalance in medical image segmentation. The total loss function formula is: in, The similarity coefficient is Dice. Cross-entropy loss; In the formula, For the pixel points predicted by the model The segmentation probability, For pixels The actual labeled value, This represents the total number of pixels. Output features: Organizational visual feature tensor ,in , , The tensor contains four types of information: segmentation mask of the core organization (32-dimensional), texture features (64-dimensional), contour features (16-dimensional), and color features (16-dimensional).

[0032] III. Scene Context Feature Extraction Branch ( The core objective is to capture anatomical location relationships between tissues, instrument-tissue interactions, and temporal changes in the scene, extracting scene context feature tensors. The base network employs a lightweight CNN combined with a temporal feature extraction module, balancing inference speed and temporal feature capture capabilities. The CNN backbone uses MobileNetV3-small, with only 1.5M network parameters, ensuring real-time processing requirements. The temporal module uses a 3-frame sliding window ConvLSTM to capture temporal changes in the scene across 3 consecutive frames. The ConvLSTM has a hidden layer dimension of 32 and a 3×3 kernel size.

[0033] Feature extraction logic: First, MobileNetV3-small is used to extract tissue-instrument interaction features from a single frame. Then, ConvLSTM is used to perform temporal modeling of the interaction features from three consecutive frames, capturing temporal information such as instrument movement and tissue morphology changes. Output features: Scene context feature tensor. ,in , , The tensor contains three types of information: inter-tissue anatomical location encoding (12-dimensional), instrument-tissue interaction encoding (10-dimensional), and scene temporal change encoding (10-dimensional).

[0034] The core objective of the cross-modal feature fusion layer is to integrate the heterogeneous feature tensors output from the three branches. , , This layer fuses features into a unified global feature tensor, addressing the issues of direct concatenation of heterogeneous features and the neglect of differences in feature importance. The fused features contain both local instrument / tissue features and global scene context features. This layer employs a two-stage fusion strategy of "spatial dimension alignment and channel dimension adaptive fusion," innovatively designing an adaptive channel attention fusion module to replace traditional simple convolutional fusion, achieving adaptive weighted fusion of features. The specific steps are as follows: I. Spatial Dimension Alignment The feature tensor space resolution of the three branches is 512×512, and pixel-level stitching is directly used to... , , The concatenation is used to form the initial fused feature tensor, and the formula is: in, ,Right now .

[0035] II. Channel Dimension Adaptive Fusion The learning channel attention weights increase the weight of core feature channels (such as IMA blood vessels and Toldts gap related channels) and decrease the weight of redundant feature channels (such as background and irrelevant tissue related channels), achieving adaptive weighted fusion of features. The module structure is as follows: Figure 2 As shown, the specific calculation process is as follows: Global Average Pooling (GAP): For Global average pooling is performed to map the two-dimensional features of each channel to one-dimensional feature values, resulting in channel feature vectors: In the formula, for At pixel The channel feature vector at that location; Attention weight learning: Channel attention weights are learned through two fully connected layers. The activation function is Sigmoid, which maps the weight values ​​to [0,1]. in, This is the first fully connected layer, with dimensions changing from 224 to 112. This is the second fully connected layer, with dimensions ranging from 112 to 224; ReLU is the activation function, and Sigmoid is the normalization function. Weighted fusion: The initial fusion feature tensor With channel attention weights Perform channel-by-channel multiplication to obtain the weighted fused feature tensor: In the formula, This is an element-wise multiplication operation; Feature dimensionality reduction: using a single 1×1 convolutional layer Dimensionality reduction is performed to decrease the computational load of subsequent layers while preserving core features, resulting in a global feature tensor: The core objective of the temporal attention classification layer is to convert the global feature tensor of a single frame into a sequence of temporal feature vectors, focus on the landmark frames of surgical steps through the attention mechanism, realize the classification and recognition of key frames and confidence scoring, and solve the problem of distinguishing between key frames and non-key frames in continuous frames. The core innovation is the design of the Surgical Step Temporal Attention Module (SAMS), which combines clinical prior knowledge to capture the temporal dependencies of surgical steps.

[0036] This layer consists of three steps: frame feature encoding, temporal attention weighting, and keyframe classification, as detailed below: I. Frame Feature Coding For global feature tensor Global average pooling combined with fully connected layers is used to convert the two-dimensional feature tensor of each frame into a one-dimensional feature vector, thereby achieving standardized encoding of frame features. in, It is a fully connected layer with dimensions ranging from 64 to 128; This is the feature vector of a single frame.

[0037] Taking 16 frames as a time window (optimized experimentally to match the operation rhythm of laparoscopic colorectal surgery), the feature vectors of N consecutive frames are combined to form a time-series feature sequence: In the formula, For the first The feature vector of the frame, .

[0038] II. Temporal Attention Weighting The core of this module is to combine a self-attention mechanism based on prior clinical knowledge to learn the attention weights of different frames in a temporal sequence, focusing on the features of key frames and suppressing redundant features of non-key frames. The specific calculation process is as follows: Self-attention calculation: A simplified version of Multi-Head Attention (with 4 heads) is used to capture the temporal dependencies between frames (such as the correlation between the pre-arson localization frame and the ablation frame) to obtain the self-attention weights. : in, For query, key, value matrix; The dimension of the feature vector; This is the normalization function; The time-series feature sequence after self-attention weighting is: Clinical prior weight correction: Introducing prior weights for surgical steps (Annotated by clinicians, e.g., the prior weight for Toldts gap cut-in frames is 0.9, and for non-keyframes it is 0.1), the self-attention weights are adjusted to improve the clinical relevance of features: Temporal feature weighted fusion: The corrected attention weights are multiplied by the self-attention weighted feature sequence to obtain the final weighted temporal feature sequence. Keyframe classification and confidence scoring Weighted time series feature sequences The system is classified, and the surgical step labels and confidence scores for each frame are output. The specific design is as follows: Classifier architecture: A lightweight classifier with 2 fully connected layers + Dropout layer is used to prevent model overfitting: Dropout layer: dropout rate = 0.3, randomly discarding 30% of neurons; FC1: First fully connected layer, dimension changes from 128 to 64, activation function is ReLU; FC2: Second fully connected layer, dimension changes from 64 to 8, activation function is Softmax (corresponding to 8 types of core keyframes). Classification and Confidence Calculation: The probability distribution of each frame belonging to 8 keyframe categories is obtained through the Softmax activation function. The maximum probability value is taken as the confidence score of that frame, and the category corresponding to the maximum probability is the surgical procedure label for that frame. In the formula, For the first The probability distribution of frames; For the first Frame confidence score; For the first Surgical step labels for frames; Preliminary keyframe determination: Setting an initial confidence threshold ,like If the frame is positive, it is considered a preliminary keyframe; otherwise, it is considered a non-keyframe.

[0039] The core objective of the dynamic threshold screening layer is to design a scene-adaptive dynamic threshold screening algorithm for keyframes with ambiguous features but surgical significance (such as positioning frames where the IMA is not completely dissected or Toldts gap incision frames under instrument occlusion) in laparoscopic colorectal surgery. This algorithm replaces the traditional fixed threshold screening, improves the coverage and accuracy of keyframes, and solves the problems of misscreening and missed screening by fixed thresholds.

[0040] This layer consists of three steps: keyframe scene clustering, dynamic threshold calculation, and final keyframe selection, as detailed below: I. Keyframe Scene Clustering Based on the complexity of the surgical scenario, the initial keyframes were divided into three categories, with clustering based on the feature similarity between the keyframes and the standard frames of the surgical steps. ( , The closer it is to 1, the clearer the features; The closer to 0, the more blurred the features (feature similarity is calculated using cosine similarity). In the formula, This is the standard frame feature vector for this surgical step (a typical frame annotated by a clinician). Let L2 be the norm of the vector.

[0041] Scene clustering criteria: Simple scenes: no occlusion, no deformation, no oil fog, clear features. Medium-level scene: slight occlusion / deformation, features are relatively clear. Complex scenes: severe occlusion / deformation / oil fog, blurred features. .

[0042] II. Dynamic Threshold Calculation For each scenario, an adaptive threshold function is designed, with the threshold varying according to feature similarity. Dynamic adjustment ensures that key frames are not missed in complex scenes and low-quality frames are filtered out in simple scenes. The dynamic threshold formula is: in: • Scene-adaptive dynamic threshold; • Baseline confidence threshold; • Threshold adjustment coefficient (obtained through extensive experimental optimization); • Feature similarity between the frame and the standard frame.

[0043] Based on the above formula, the dynamic thresholds for various scenarios are: Simple scenario: , Medium-level scenario: , Complex scenarios: , .

[0044] III. Final Keyframe Selection and Output Dynamic thresholds for various scenarios A second screening is performed on the initial keyframes, with the following screening rule: if the confidence score of the frame is... If so, it is determined to be the final keyframe; if If it is a low-confidence keyframe, it will be marked with a manual review prompt and needs to be manually confirmed by a doctor before being included in the keyframe set; if If a frame is not a key frame, it will be discarded.

[0045] The final output is a keyframe set, containing for each frame: keyframe image, surgical step label, and confidence score. Feature similarity Scene complexity label, manual review prompts (if applicable).

[0046] Specifically, in this embodiment, an end-to-end intelligent recognition system can be built based on the above keyframe recognition method. It can be directly deployed in the laparoscopic surgery workstation of the hospital to realize automated processing of surgical videos, keyframe extraction and dataset export.

[0047] This embodiment proposes an intelligent segmentation method for laparoscopic colorectal surgery images. The method acquires images during the laparoscopic colorectal surgery process; performs frame-by-frame processing on these images; and annotates keyframes and presets Regions of Interest (ROIs) on the frame-by-frame processed images to achieve segmentation of the laparoscopic colorectal surgery images. This intelligent segmentation method for laparoscopic colorectal surgery images, through real-time laparoscopic visualization, provides guidance for locating the inferior mesenteric artery, automatic segmentation of the inferior mesenteric artery, automatic segmentation of the Toldts space, and automatic segmentation of instrument forceps during the inferior mesenteric artery dissection stage, effectively realizing intelligent segmentation of laparoscopic colorectal surgery images.

[0048] The intelligent segmentation method for laparoscopic colorectal surgery images proposed in this embodiment achieves intelligent keyframe recognition based on multimodal feature fusion and temporal attention. It constructs a keyframe feature system and core keyframe classification specific to laparoscopic colorectal surgery, accurately matching the surgical operation process; it integrates three-dimensional features of visual morphology, instrument operation, and scene context, and combines an attention mechanism to enhance the weight of core features, thereby improving the average recognition accuracy of core keyframes and the recognition accuracy in complex scenes; it adopts scene-adaptive dynamic threshold screening to solve the problems of misscreening and omissions caused by fixed thresholds, significantly improving the keyframe coverage of core surgical steps, with no omissions of iconic keyframes and high keyframe coverage; it achieves automated keyframe extraction with high processing efficiency, and can quickly construct large-scale standardized keyframe datasets to meet the needs of AI model training.

[0049] Example 2 This embodiment provides an intelligent segmentation system for laparoscopic colorectal surgery images. This system employs the intelligent segmentation method for laparoscopic colorectal surgery images proposed in Embodiment 1 above, including: The target image acquisition module is used to acquire images during laparoscopic colorectal surgery. The frame splitting processing module is used to split the images during the laparoscopic colorectal surgery. The image segmentation module is used to annotate keyframes in the image after frame splitting and to preset ROIs to achieve segmentation of the laparoscopic colorectal surgery image.

[0050] For example, the target image acquisition module is specifically used to acquire images during a standardized laparoscopic colorectal surgery procedure, and select images from the incision point of the inferior artery to the completion of the block.

[0051] For example, the preset ROI includes, but is not limited to, the inferior mesenteric artery, Toldts' space, and instrument forceps.

[0052] The intelligent segmentation system for laparoscopic colorectal surgery images proposed in this embodiment provides guidance for locating the inferior mesenteric artery, automatic segmentation of the inferior mesenteric artery, automatic segmentation of the Toldts space, and automatic segmentation of instrument forceps during the inferior mesenteric artery dissection stage through real-time laparoscopic images, effectively realizing intelligent segmentation of laparoscopic colorectal surgery images. This system achieves intelligent keyframe recognition based on multimodal feature fusion and temporal attention. It constructs a keyframe feature system and core keyframe classification specific to laparoscopic colorectal surgery, accurately matching the surgical operation flow; it integrates three-dimensional features of visual morphology, instrument operation, and scene context, and combines an attention mechanism to enhance the weight of core features, improving the average recognition accuracy of core keyframes and the recognition accuracy in complex scenes; it adopts scene-adaptive dynamic threshold screening to solve the problems of misscreening and missed screening with fixed thresholds, significantly improving the keyframe coverage of core surgical steps, with no missed landmark keyframes and high keyframe coverage; it achieves automated keyframe extraction with high processing efficiency, and can quickly construct large-scale, standardized keyframe datasets to meet the needs of AI model training.

[0053] Example 3 This embodiment provides an electronic device, which includes a processor and a memory connected to the processor. The memory stores program data, and the processor retrieves the program data stored in the memory to execute the intelligent segmentation method for laparoscopic colorectal surgery images as provided in Embodiment 1 above.

[0054] Example 4 This embodiment provides a computer-readable storage medium storing computer-executable instructions. When executed by a processor, these computer-executable instructions are used to implement the intelligent segmentation method for laparoscopic colorectal surgery images as provided in Embodiment 1 above.

[0055] It should be noted that the aforementioned readable storage medium can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. The readable storage medium can be any available medium accessible to general-purpose or special-purpose computers.

[0056] Note that the above description is merely a preferred embodiment of the present invention and the technical principles employed. Those skilled in the art will understand that the present invention is not limited to the specific embodiments described herein, and various obvious changes, readjustments, and substitutions can be made without departing from the scope of protection of the present invention. Therefore, although the present invention has been described in detail through the above embodiments, the present invention is not limited to the above embodiments, and may include many other equivalent embodiments without departing from the concept of the present invention, the scope of which is determined by the scope of the appended claims.

Claims

1. A method for intelligent segmentation of images in laparoscopic colorectal surgery, characterized in that, include: Acquire images during laparoscopic colorectal surgery; The images during the laparoscopic colorectal surgery were split into frames. The image after frame splitting is labeled with keyframes and a ROI is preset to achieve segmentation of the laparoscopic colorectal surgery image.

2. The intelligent segmentation method for laparoscopic colorectal surgery images according to claim 1, characterized in that, The images acquired during the laparoscopic colorectal surgery include: Images were acquired during laparoscopic colorectal surgery following a standardized procedure, and images were selected from the inferior artery to the completion of the block.

3. The intelligent segmentation method for laparoscopic colorectal surgery images according to claim 2, characterized in that, The preset ROI includes, but is not limited to, the inferior mesenteric artery, Toldts' space, and instrument forceps.

4. The intelligent segmentation method for laparoscopic colorectal surgery images according to any one of claims 1 to 3, characterized in that, The step of keyframe annotation and ROI pre-setting of the image after frame splitting to achieve segmentation of the laparoscopic colorectal surgery image includes: Constructing a keyframe feature system for laparoscopic colorectal surgery; Three-dimensional features are extracted in parallel using a multi-branch heterogeneous network to obtain instrument feature tensor, tissue feature tensor, and scene feature tensor; An adaptive channel attention fusion module is used to fuse heterogeneous feature tensors into a global feature tensor.

5. The intelligent segmentation method for laparoscopic colorectal surgery images according to claim 4, characterized in that, The step of using an adaptive channel attention fusion module to fuse heterogeneous feature tensors into a global feature tensor further includes: The global feature tensor is encoded into a temporal feature sequence, which is then weighted and fused through the surgical step temporal attention module, and input into the classifier to obtain preliminary keyframes, surgical step labels and confidence scores. The initial keyframes are clustered by feature similarity through scene-adaptive dynamic threshold filtering. The dynamic thresholds for each scene are calculated, and the final keyframe set is output after secondary filtering.

6. An intelligent image segmentation system for laparoscopic colorectal surgery, characterized in that, The system employs the intelligent segmentation method for laparoscopic colorectal surgery images as described in claim 1, comprising: The target image acquisition module is used to acquire images during laparoscopic colorectal surgery. The frame splitting processing module is used to split the images during the laparoscopic colorectal surgery. The image segmentation module is used to annotate keyframes in the image after frame splitting and to preset ROIs to achieve segmentation of the laparoscopic colorectal surgery image.

7. The intelligent image segmentation system for laparoscopic colorectal surgery according to claim 1, characterized in that, The images acquired during the laparoscopic colorectal surgery include: Images were acquired during laparoscopic colorectal surgery following a standardized procedure, and images were selected from the inferior artery to the completion of the block.

8. The intelligent image segmentation system for laparoscopic colorectal surgery according to claim 2, characterized in that, The preset ROI includes, but is not limited to, the inferior mesenteric artery, Toldts' space, and instrument forceps.

9. An electronic device, characterized in that, The electronic device includes a processor and a memory connected to the processor, wherein the memory stores program data, and the processor retrieves the program data stored in the memory to execute the intelligent segmentation method for laparoscopic colorectal surgery images as described in claim 1.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, are used to implement the intelligent segmentation method for laparoscopic colorectal surgery images as described in claim 1.