A method and system for stage recognition based on fusion of surgical instrument features
By constructing a spatiotemporal-instrument cross-attention fusion mechanism and adaptive modulation technology, the robustness and accuracy issues of surgical stage recognition methods in complex scenarios were solved, and stable recognition of surgical stages was achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHENGDU WITHAI INNOVATION TECH CO LTD
- Filing Date
- 2026-05-28
- Publication Date
- 2026-06-26
AI Technical Summary
Existing surgical stage identification methods have poor robustness in complex scenarios, struggle to fully integrate instrument and video spatiotemporal features, and fail to adequately utilize multi-scale features, resulting in low identification accuracy.
By constructing a spatiotemporal-device cross-attention fusion mechanism, deep spatiotemporal features and device features are integrated. The cross-attention mechanism and adaptive modulation technology are used to enhance the interaction between device semantics and video scene, extract multi-scale features and perform adaptive modulation.
It significantly improves the accuracy and robustness of surgical stage identification, enabling stable and reliable identification of surgical stages in complex scenarios and adapting to the dynamic needs of clinical surgery.
Smart Images

Figure CN122289620A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of video processing technology, and more specifically to a stage recognition method based on surgical instrument feature fusion. Background Technology
[0002] Surgical stage identification is a key technology in intelligent surgical navigation and intraoperative decision support systems. Its core function is to automatically identify the current surgical stage from a continuous surgical video stream, providing surgeons with auxiliary information such as progress assessment, risk warnings, and postoperative analysis. Accurate identification of surgical stages is crucial for improving surgical safety and optimizing surgical procedures.
[0003] Currently, methods for identifying surgical stages primarily rely on temporal modeling based on the visual features of surgical videos. Common technical approaches include: extracting visual features from single-frame images using convolutional neural networks, then modeling temporal dependencies using recurrent neural networks, temporal convolutional networks, or Transformer models to ultimately classify surgical stages. With the development of deep learning technology, Transformer-based temporal models have achieved good results in this field due to their powerful long-range dependency modeling capabilities.
[0004] However, existing methods still have the following shortcomings: On the one hand, most existing technologies regard surgical stage recognition as a pure video classification task, mainly relying on the visual appearance features of the surgical scene for judgment. However, the differences between surgical stages are often reflected in the changes in instrument type, instrument combination, and instrument operation mode. Simple visual features are difficult to fully capture these discriminative information that are strongly related to instrument operation, resulting in poor robustness of recognition in complex scenarios such as instrument occlusion, field of view switching, and blood contamination.
[0005] On the other hand, although some studies have attempted to introduce instrument detection information, they usually treat the instrument detection results as auxiliary features and simply splice or add them together. They lack in-depth modeling of the interaction between instruments and surgical scenarios, and the semantic information of instruments and video spatiotemporal features have not been fully integrated, making it difficult to give full play to the guiding role of instrument features in stage recognition.
[0006] Furthermore, existing methods still fall short in utilizing multi-scale features. Surgical videos contain multi-level semantic information ranging from global anatomical structures to local instrument manipulations. How to effectively extract and fuse spatiotemporal features at different scales to better serve stage recognition tasks remains a pressing technical problem to be solved in this field.
[0007] Therefore, we propose a method that can fully integrate instrument semantics and video spatiotemporal information to improve the accuracy and robustness of surgical stage identification. Summary of the Invention
[0008] The purpose of this invention is to provide a stage identification method and system based on surgical instrument feature fusion, which solves the problems of low accuracy in traditional instrument data reliance and surgical stage identification.
[0009] This invention is achieved through the following technical solution: A stage identification method based on surgical instrument feature fusion, specifically including: Receive surgical video frame sequences, perform preprocessing and preliminary encoding, and generate initial spatiotemporal features; A multi-layer neural network is used to process the initial spatiotemporal features and extract deep spatiotemporal features at multiple scales. Based on the surgical video frame sequence, surgical instruments are detected and instrument features are extracted; A spatiotemporal-instrument cross-attention fusion algorithm is constructed, and the cross-attention mechanism in the algorithm is used to fuse deep spatiotemporal features and instrument features to generate fused spatiotemporal features; Adaptive modulation of the fused spatiotemporal features for device perception is performed to generate fused features for device perception. The system outputs the judgment result of the current stage of the surgery based on the instrument perception fusion characteristics.
[0010] Furthermore, the specific steps for detecting the surgical instruments and extracting their features are as follows: The instrument feature map of each frame of the image is extracted using an instrument branch feature extraction network; Based on the device bounding box, the device region features are cropped from the device feature map using the ROIAlign (Region of Interest Align) operation; The device region features are converted into device embedding vectors through global average pooling and linear mapping. After encoding the geometric information of the instrument bounding box into a geometric position code, it is added to the instrument embedding vector to generate the instrument feature.
[0011] Furthermore, the specific steps for generating the fused spatiotemporal features are as follows: Using deep spatiotemporal features as queries and instrument features as keys and values, perform instrument-guided video feature enhancement to obtain updated video features; Using instrument features as the query and deep spatiotemporal features as the key and value, perform scene-aware instrument feature completion to obtain updated instrument features; Temporal consistency modeling is performed on updated video features and updated instrument features along the time dimension, and fused spatiotemporal features are output.
[0012] Furthermore, before performing video feature enhancement using deep spatiotemporal features as queries and device features as keys and values for device guidance, the process also includes: Deep spatiotemporal features are unfolded in the spatial dimension to generate a spatial token sequence.
[0013] Furthermore, the temporal consistency modeling uses a multi-head self-attention mechanism to model the temporal feature sequences of the same spatial location or the same instrument along the time dimension.
[0014] Furthermore, the specific steps for generating the device perception fusion features are as follows: Aggregate device features to generate a semantic summary of the device; Channel-level modulated signals are generated based on the semantic summary of the device, and channel modulation is performed on the fused spatiotemporal features; Based on the instrument testing results, a spatial prior distribution of the instrument is constructed, and the fused spatiotemporal features are spatially modulated. The features after channel modulation and spatial modulation are residually fused with the fused spatiotemporal features to output the fused features perceived by the instrument.
[0015] Furthermore, the channel modulation includes: The device semantic summary is mapped to channel-gated weights by using the Sigmoid activation function, and the weights are applied to each channel of the fused feature map.
[0016] Furthermore, the spatial modulation includes: The instrument bounding box set is rasterized into a spatial mask. The channel-modulated features are concatenated with a spatial mask and then generated by convolution and sigmoid activation to create a spatial gating weight map, which is used to weight the spatial location of the feature map.
[0017] Furthermore, the specific steps for outputting the determination result of the current stage of the surgery based on the instrument perception fusion features are as follows: Global average pooling is applied to the instrument perception fusion features to generate stage-determined feature vectors; Temporal modeling of stage determination feature vectors across multiple frames is performed using a temporal Transformer. The probability distribution of the surgical stage to which the current frame belongs is output through a linear classification layer.
[0018] A stage recognition system based on surgical instrument feature fusion includes: The video input and encoding unit is used to receive surgical video frame sequences and generate initial spatiotemporal feature representations; The spatiotemporal feature extraction unit is used to extract deep spatiotemporal features at multiple scales. The instrument detection and instrument feature extraction unit is used to detect surgical instruments and extract instrument features based on surgical video frame sequences; A cross-attention fusion unit is used to fuse deep spatiotemporal features with the instrument features using a cross-attention mechanism to generate fused spatiotemporal features. The instrument perception modulation unit is used to adaptively modulate the fused spatiotemporal features to generate instrument perception fusion features. The output unit is used to output the judgment result of the current stage of the surgery based on the instrument perception fusion characteristics.
[0019] The technical solution of the present invention has at least the following advantages and beneficial effects: This invention discloses a stage recognition method and system based on surgical instrument feature fusion. By constructing a spatiotemporal-instrument cross-attention fusion mechanism, deep spatiotemporal features and instrument features are deeply fused, realizing bidirectional interaction between instrument semantics and video scene. This mechanism enables video features to focus on the spatial region related to the current instrument operation, while enabling instrument features to perceive the stage context of the current surgical scene, thereby significantly enhancing the expressive power and discriminative power of stage discrimination features, and effectively improving the accuracy and robustness of surgical stage recognition.
[0020] In addition, this method introduces adaptive modulation of instrument perception on the basis of feature fusion. Through the gating mechanism of channel and spatial dimensions, the fused features are enhanced in a targeted manner, which further highlights the key information related to the surgical stage and suppresses irrelevant background interference, making the stage determination more stable and reliable.
[0021] Furthermore, during the instrument feature extraction process, the ROIAlign operation is used to accurately crop the instrument region features, and combined with the geometric information encoding of the instrument bounding box, the instrument features can carry rich spatial location and scale information, providing a high-quality semantic representation of the instrument for subsequent cross-attention fusion. Attached Figure Description
[0022] Figure 1 This is a schematic diagram of a stage recognition method based on surgical instrument feature fusion according to the present invention; Figure 2 This is a schematic diagram of a stage recognition system based on surgical instrument feature fusion according to the present invention; Figure 3 This is a schematic diagram of the cross-attention fusion unit structure of the present invention; Figure 4 This is a schematic diagram of the device sensing modulation unit structure of the present invention. Detailed Implementation
[0023] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations.
[0024] Example 1 like Figure 1 The stage recognition method shown here, based on surgical instrument feature fusion, specifically includes: S1. Receive the surgical video frame sequence, perform preprocessing and preliminary encoding, and generate initial spatiotemporal features; The video frame sequence directly carries all the dynamic information of the surgical operation (instrument movement, tissue interaction, and changes in the timing of the operation process, etc.). The consistency and validity of the video data directly determine the accuracy of subsequent feature extraction and the real-time performance of the system recognition. Therefore, it is necessary to first standardize the input video to eliminate irrelevant interference and unify the data specifications, so as to provide high-quality input for subsequent steps. Moreover, the video frame sequence comes from real-time surgical video streams from different types of laparoscopic equipment (laparoscopy, thoracoscopy, etc.). Due to differences in equipment models and surgical operation angles, these video streams have problems with inconsistent resolution and frame rates, which can easily lead to deviations in subsequent feature extraction.
[0025] The preprocessing and preliminary encoding processes specifically include: First, all input videos are uniformly adjusted to a resolution of 640×360 pixels. A combination of center cropping and bilinear interpolation is used to finally output a standard input image of 384×224 pixels. This resolution can preserve key information such as instrument edges and tissue details, while effectively controlling the amount of computation, and balancing feature extraction accuracy and system real-time performance. Secondly, the high frame rate of surgical videos means that directly processing all frames would lead to a surge in computation and recognition delays; conversely, a low sampling frequency would result in the loss of critical operation frames, causing misjudgments in stage identification. Therefore, an adaptive frame sampling strategy is adopted: 3 frames are sampled per second during normal operation, and this is automatically increased to 5 frames per second during operation switching. The operation stage type is determined in real time using frame interpolation, forming a continuous and efficient surgical video frame sequence. This approach balances computational efficiency and temporal resolution while fully capturing the continuous dynamic process of the surgical operation, providing a stable and effective data source for subsequent spatiotemporal feature extraction.
[0026] Finally, a dual auxiliary preprocessing operation is performed on each sampled image frame to further improve the quality of the input data: First, pixel normalization is performed, mapping the image pixel values from the [0,255] interval to the [0,1] interval. Linear scaling is used to eliminate interference caused by changes in illumination (adjustment of the endoscope light source angle, tissue bleeding reflection) and differences in pixel intensity, ensuring that the feature extraction benchmark of different frames is consistent. Second, abnormal frame screening is performed by combining the frame difference method and the gray-scale variance method to identify and remove blurred frames, black screen frames, and motion-blurred frames. The specific judgment criteria are: when the frame difference between two consecutive frames exceeds a preset threshold (the threshold is adaptively adjusted according to the surgical scenario, ranging from 0.1 to 0.3), and the gray-scale variance is less than 50, it is judged as an abnormal frame, avoiding feature extraction distortion caused by abnormal frames and ensuring the effectiveness of the input frame sequence.
[0027] S2. A multi-layer neural network is used to process the initial spatiotemporal features and extract multi-scale deep spatiotemporal features. Specifically, 3D ConvNeXt is used as the backbone network to extract multi-scale deep spatiotemporal features of surgical video frame sequences. Compared with traditional 3D CNN (Convolutional Neural Network), this network can capture the spatial details of images and the temporal dynamic correlation of videos more efficiently. First, the network input is constructed by selecting 10 consecutive preprocessed standard images to form a tensor with dimensions of 10×224×384×3 (where 10 is the number of time-dimension frames, 224×384 is the image spatial resolution, and 3 is the number of RGB channels). Selecting 10 frames as the time window allows for accurate capture of dynamic changes in short-term surgical operations (such as rapid instrument movement and instantaneous tissue interaction) while avoiding the surge in computational load and feature redundancy caused by an excessively long time window, thus adapting to the dynamic rhythm of clinical surgery.
[0028] Secondly, preliminary 3D feature encoding: The input tensor is feature-encoded using the initial encoding layer of the 3D ConvNeXt network, initially fusing two-dimensional image features with temporal dynamic features to generate an initial spatiotemporal feature map with dimensions of 10×56×96 (where 10 represents the time dimension, and 56 and 96 represent the spatial height and width of the feature map, respectively). This encoding process effectively extracts fundamental spatiotemporal information about the surgical scene, such as the trajectory of instruments, changes in tissue position, and the continuity of surgical actions, laying the foundation for subsequent multi-scale feature extraction.
[0029] Then, multi-scale feature layer extraction is performed: spatiotemporal features are extracted step by step through 5 stages. Each stage contains a different number of Transformer Blocks, and feature maps with resolutions of 1 / 4, 1 / 8, 1 / 16, 1 / 32, and 1 / 64 are extracted in sequence. Shallow features (1 / 4 and 1 / 8 resolutions) mainly capture low-level features such as instrument edges, tissue contours, and local operation details. Deep features (1 / 16, 1 / 32, and 1 / 64 resolutions) mainly capture high-level features such as the global context of the surgical scene, the temporal correlation of the operation process, and the interaction mode between tissues and instruments. This achieves multi-scale spatiotemporal feature capture from shallow to deep and from local to global, ensuring the comprehensiveness and hierarchy of features.
[0030] Finally, the spatiotemporal feature output is achieved by splicing and fusing the multi-scale features extracted from each stage to output the final multi-scale deep spatiotemporal features. These features contain spatial details and temporal dynamic correlation information of the surgical scene, which can fully reflect the continuous process of the surgical operation and provide high-quality scene feature input for subsequent deep fusion with instrument features.
[0031] It can accurately capture the spatiotemporal dynamic changes of surgical scenes, taking into account both local operational details and global scene context. This solves the recognition bias caused by traditional feature extraction that only focuses on spatial features and ignores the dynamic correlation of time. It provides comprehensive feature support at the scene level for stage recognition. At the same time, it enhances the expressive power of features through multi-scale feature fusion, adapting to the recognition needs of complex surgical scenes.
[0032] S3. Based on the surgical video frame sequence, detect the surgical instruments and extract their features. Instruments, as the core carriers of surgical operations, directly correspond to surgical stages in terms of type, location, and movement (e.g., scalpels are mainly used in the incision stage, electrocautery is mainly used in the hemostasis stage, and suture needles and needle holders are mainly used in the suturing stage). Therefore, accurate extraction of instrument features is crucial for improving stage recognition accuracy and distinguishing similar operational stages. The specific steps are as follows: First, for each frame in the video frame sequence Using a YOLOv9 model finely tuned and optimized with a surgical instrument dataset, the maximum number of data points in the frame is output. There are several instrument bounding boxes, and the set of bounding boxes is represented as follows:
[0033] in , Indicates the time frame index. Indicates the first in this frame One instrument, The coordinates of the top left corner of the bounding box. The coordinates of the bottom right corner of the bounding box. The confidence level for instrument type is set (threshold ≥ 0.7, and areas below the threshold are judged as non-instrument areas); the fine-tuned detection model can adapt to the morphological differences of different surgical instruments and occlusion scenarios (such as instruments being occluded by tissue or multiple instruments overlapping), with a detection accuracy of ≥ 95%, providing accurate spatial location basis for subsequent instrument feature extraction.
[0034] Secondly, an independent instrument branch feature extraction network is used for each frame of image. Feature extraction is performed to generate a device feature map, expressed as: ,
[0035] in Indicates batch size, Indicates the number of feature channels in the instrument branch. , These represent the spatial height and width of the device feature map, respectively. This feature map can accurately capture the semantic information of the device (such as device type, morphological features, and functional attributes), providing rich basic features for subsequent device embedding vector generation.
[0036] Then, the instrument region features are precisely cropped: based on the bounding box of each instrument. From the device feature map via ROIAlign operation The precise cropping of the corresponding region features of the instrument is expressed as follows:
[0037] in , These represent the width and height of the fixed space of the aligned region features, respectively. Compared to the traditional ROIPool operation, the ROIAlign operation can effectively avoid feature distortion during the region clipping process, better preserve the detailed features of the instrument (such as the instrument tip and handle shape), and ensure that the clipped region features are highly consistent with the actual shape and semantic information of the instrument.
[0038] Next, the instrument region features are analyzed using a global average pooling (GAP) operation. Dimensionality reduction is performed to obtain the pooled instrument vector:
[0039] Then, it is transformed into a fixed-dimensional device embedding vector through two layers of linear mapping, as expressed by:
[0040] in , The linear mapping weight matrix, , For bias terms, The embedding dimension is used to compress the high-dimensional features of the device into a low-dimensional vector of fixed dimensions. This not only preserves the core semantic information and morphological features of the device, but also effectively reduces the complexity of subsequent fusion calculations and improves the system's operating efficiency.
[0041] Finally, the geometric information of the instrument bounding box (including the coordinates of the top left corner) is... lower right corner coordinates Boundary box area: The geometric location is encoded into a geometric location vector using a multilayer perceptron (MLP, which contains two fully connected layers with ReLU activation function), and the expression is:
[0042] The position encoding vector and the instrument embedding vector Element-by-element addition achieves deep fusion of geometric positional information and semantic information. This ultimately generates complete instrument features. Additionally, a single frame... The set of instrument features is represented as The set of instrument features for the entire video frame sequence is represented as follows: (Where T is the total number of frames in the video frame sequence); By accurately detecting surgical instruments and comprehensively extracting their semantic and positional features, complete capture of instrument information is achieved, solving the problems of neglecting instrument information and being unable to distinguish similar operation stages in traditional stage recognition. This provides accurate and high-quality instrument feature input for subsequent feature fusion. At the same time, through geometric position coding fusion, the spatial representation capability of instrument features is improved, adapting to complex fields such as instrument movement and occlusion.
[0043] Furthermore, by constructing spatiotemporal feature extraction branches and instrument feature extraction branches respectively, multi-scale deep spatiotemporal features of the surgical scene and semantic-positional features of surgical instruments are extracted simultaneously. Spatiotemporal features focus on the global dynamic changes and operational sequence correlation of the surgical scene, while instrument features focus on the core tools of the surgical operation (instrument type, position, and motion state). The accurate extraction of these two types of features is the core foundation for subsequent feature fusion and stage recognition, and can effectively distinguish similar surgical stages (such as the suturing stage and the knotting stage, where there are significant differences in instrument type and usage).
[0044] S4. Construct a spatiotemporal-instrument cross-attention fusion algorithm, and use the cross-attention mechanism in this algorithm to fuse deep spatiotemporal features and instrument features to generate fused spatiotemporal features. The specific steps are as follows: First, the multi-scale deep spatiotemporal features are unfolded in the spatial dimension to generate a spatial token sequence, specifically: the first... Video feature map of a frame , , Shape rearrangement is performed on the feature map space height and width, expanding it into... ,in This represents the number of spatial tokens. This operation converts two-dimensional spatial features into a one-dimensional token sequence, facilitating subsequent attention calculations while fully preserving the detailed information of the spatial features, ensuring that scene information is not lost during the fusion process. The video token sequence of the entire video frame sequence is represented as follows: .
[0045] Secondly, based on deep spatiotemporal features (video token sequences) ) as a query ( ), Device characteristics (device token sequence) ) as a key ( ) and value ( This utilizes a cross-attention mechanism to enhance the video features guided by the instrument, focusing the video features on areas related to the instrument (such as the interaction area between the instrument and tissue, and the instrument's movement trajectory area) while suppressing irrelevant background interference (such as invalid areas at the edge of the endoscopic field of view and non-operational areas). The specific expression is as follows:
[0046]
[0047]
[0048] in , , A learnable linear projection matrix (all of dimension D×D) is used to map input features to a query, key, and value vector space. For value matrices, Represents the key matrix. Represents the query matrix. This indicates that the scaling factor is used to stabilize the gradient; This is an attention weight matrix between space tokens and device tokens, which can quantify the degree of association between devices and various areas of the scene. The higher the degree of association, the greater the weight. This refers to the feature update amount after attention weighting; This is a layer normalization operation used to stabilize the training process and avoid gradient vanishing or gradient exploding. This involves updating video features after integrating device information. This process allows video features to focus on key areas related to the device, improving the relevance and discriminative power of the video features.
[0049] Then, scene-aware device feature completion: using device features (device token sequence) ) as a query ( Deep spatiotemporal features (video token sequence) ) as a key ( ) and value ( This approach utilizes a cross-attention mechanism to achieve scene-aware instrument feature completion. By combining instrument features with surgical scene information (such as tissue type, operating environment, and the interaction state between instruments and tissues), it enhances the semantic expressive power of instrument features, resolving the problem of insufficient feature expression caused by instrument occlusion and semantic ambiguity. This enables bidirectional interaction between instrument and scene semantics. The specific expression is as follows:
[0050]
[0051]
[0052] in , , It is a learnable linear projection matrix (all dimensions are D×D); This is the attention weight matrix between the device token and the spatial token, which can quantify the supplementary role of each region of the scene in the semantics of the device. This represents the updated instrument features after cross-attention weighting; This involves updating instrument features after incorporating scene information. This process allows instrument features to better fit the current surgical scenario, improving their robustness. Even when the instrument is partially occluded, scene information can be used to complete the instrument's semantics.
[0053] Finally, using Multi-Head Self-Attention (MHSA), temporal consistency modeling is performed on both updated video features and updated instrument features along the time dimension. This suppresses feature fluctuations caused by short-term occlusion, rapid instrument movement, and abrupt changes in surgical actions, improving the continuity and stability of features in the time dimension. Surgical operations are continuous dynamic processes, and instrument movement and scene changes have clear temporal correlations. Temporal consistency modeling can capture these correlations and avoid recognition errors caused by anomalies in single-frame features. The specific process is as follows: 1) For each space token Extract its feature sequences across all time frames and perform time-series modeling:
[0054]
[0055] in Let n be the feature sequence of spatial location n across all time frames. For multi-head self-attention mechanisms along the time dimension, The spatial token features are enhanced with temporal sequence characteristics; 2) For each device token Extract its feature sequences across all time frames and perform time-series modeling:
[0056]
[0057] in For the first The characteristic sequence of an instrument across all time frames. The features of the device token are enhanced with time-series enhancement; The temporally enhanced updated video features and updated instrument features are concatenated and integrated to obtain the final fused spatiotemporal features, where the updated video features are represented as follows: The updated instrument features are represented as Together, these two elements constitute a fused spatiotemporal feature, providing high-quality feature input for subsequent adaptive modulation of instrument perception.
[0058] This step achieves bidirectional deep fusion of spatiotemporal features and instrument features, which not only enhances the targeting of video features (focusing on instrument-related areas) but also completes the semantic information of instrument features (combined with scene context). At the same time, it improves the stability and continuity of features through temporal consistency modeling, solving the problems of low recognition accuracy caused by feature independence, poor temporal correlation, and weak anti-interference ability in traditional feature fusion, and providing more discriminative fused features for subsequent stage judgment.
[0059] S5. Adaptively modulate the fused spatiotemporal features to generate fused instrument perception features. Adaptively modulate the fused spatiotemporal features in both channel and spatial dimensions to enhance features strongly correlated with instrument operation and suppress interference from irrelevant features. Specifically: First, regarding the first Instrument features after time-enhanced frame An average pooling aggregation operation is performed along the instrument dimension to generate an instrument semantic summary vector representing the current surgical operation state, expressed as:
[0060] This vector can accurately reflect the instrument combination and operation mode at the current stage (such as the incision stage mainly using a scalpel, the hemostasis stage mainly using an electrocoagulator, and the suturing stage mainly using a combination of suture needles and needle holders), providing clear instrument semantic basis for subsequent two-dimensional modulation and ensuring the targeting of the modulation.
[0061] Secondly, based on device semantic summarization vectors A channel-level modulated signal is generated to modulate the fused spatiotemporal features, enhancing feature channels strongly correlated with instrument operation and suppressing irrelevant channel information. The feature channels corresponding to instrument operations at different surgical stages differ significantly (e.g., scalpel operation corresponds to edge feature channels, and electrocoagulation operation corresponds to temperature-related feature channels). Channel-level modulation can specifically enhance key channels and improve feature discrimination capabilities. The specific process is as follows: 1) The device semantic summary is mapped to channel-gated weights using the Sigmoid activation function, with the following expression:
[0062] in The Sigmoid activation function compresses the output to the [0,1] interval. MLP is a multilayer perceptron used to adaptively learn the relationship between the semantics of the device and the feature channels. 2) Convert the spatiotemporal features into a fused feature map with a two-dimensional spatial structure. Gating weights of the channels By expanding the spatial dimensions (matching the spatial dimensions H and W of the feature map), the channels of the fused feature map are weighted through element-wise multiplication to obtain the channel-modulated feature map:
[0063] Here, "⊙" represents element-wise multiplication, and None represents dimensional expansion, which enhances key feature channels and suppresses irrelevant channels.
[0064] Then, based on the instrument detection results, a priori spatial distribution of the instruments is constructed. Spatial modulation is applied to the fused spatiotemporal features, focusing on enhancing the feature representation of the instrument's location and its neighborhood while suppressing irrelevant background interference. The core of surgical stage identification is the interaction area between the instrument and tissue; spatial modulation can focus on this area, improving the specificity of the features, while suppressing invalid interference from the background area. The specific process is as follows: 1) The first The set of instrument bounding boxes of a frame Rasterization into a spatial mask with the same spatial dimensions as the fused feature map is expressed as:
[0065] Rasterize(·) is a rasterization operation that generates a continuous spatial mask for accurately marking the spatial location of the instrument and the interaction area; 2) The feature map after channel modulation With spatial mask Concatenate along the channel dimension, and generate a spatially gated weight map through convolution operations and a sigmoid activation function, expressed as follows:
[0066] in" "" indicates channel-dimensional concatenation. Convolutional operations are used to fuse channel modulation features and spatial mask information to generate a more accurate spatial weight distribution. 3) By weighting the spatial location of the feature map through spatial gating weight map, the features of the area where the instrument is located and the interaction area are highlighted, and the interference of irrelevant background is suppressed, thereby further improving the relevance and discrimination ability of the features.
[0067] Finally, residual fusion and feature output: The channel-modulated and spatially modulated features are combined with the original fused spatiotemporal features (fused feature map). Residual fusion is performed, and the expression is:
[0068] This operation can enhance instrument perception while maintaining the integrity of the original semantic information, avoiding excessive suppression of background information (such as the state of background tissue having stage discrimination significance in some surgical stages), and preventing feature distortion; it can also integrate the data from each time frame. Stacking them along the time dimension yields the device perception fusion features. This provides accurate and efficient feature input for subsequent stages of judgment.
[0069] By using adaptive modulation in both channel and spatial dimensions, key features strongly correlated with instrument operation are precisely enhanced, effectively suppressing irrelevant interference and feature redundancy, making the fused features more discriminative and targeted. At the same time, residual fusion preserves the original feature information, avoiding feature distortion, and further improving the accuracy and robustness of recognition in subsequent stages, adapting to complex and ever-changing clinical surgical scenarios.
[0070] S6. Output the judgment result of the current stage of surgery based on the instrument perception fusion characteristics. Based on the instrument perception fusion characteristics, through temporal modeling, classification operation and stability optimization, the judgment result of the current stage of surgery is output to ensure the accuracy, stability and real-time performance of the judgment result, so as to meet the needs of real-time auxiliary decision-making in clinical surgery. The specific implementation steps are as follows: First, the sensory fusion features of the device Perform Global Average Pooling (GAP) operation to generate the first... The phase determination feature vector of a frame is expressed as follows:
[0071] The stage determination feature vectors of each time frame are stacked along the time dimension to form a temporal feature sequence. This operation compresses high-dimensional fusion features into low-dimensional feature vectors, effectively reducing the complexity of classification calculations and improving the real-time performance of the system's recognition while retaining core discriminative information.
[0072] Secondly, a temporal model is used to perform temporal modeling on the stage determination feature vectors of multiple frames, capturing the temporal dependencies between stages during the operation. The transitions between surgical stages have clear temporal logic (e.g., incision → separation → hemostasis → suturing → irrigation). Temporal modeling can capture this logical relationship, avoiding stage misjudgment caused by abnormal features in a single frame (e.g., instrument occlusion, non-standard operating procedures). The expression is:
[0073] in The output feature sequence after time series modeling , For the first The feature vectors corresponding to each time step. The temporal Transformer contains 4 Transformer Blocks, each containing a multi-head self-attention layer (8 heads) and a fully connected layer (GELU activation function), which can effectively capture long temporal dependencies and adapt to the continuous dynamic process of surgical operations.
[0074] Finally, a linear classification layer is used to classify the feature vectors after temporal modeling, outputting the probability distribution of the surgical stage to which the current frame belongs, expressed as:
[0075] in The weight matrix of the linear classification layer. This refers to the number of surgical stage categories, corresponding to the five core stages of laparoscopic surgery: incision, dissection, hemostasis, suturing, and irrigation. For bias vectors, For the first The probability of a frame belonging to each predefined surgical stage category; each stage is the output result.
[0076] Example 2 like Figures 2-4 The stage recognition system shown includes: The video input and encoding unit is used to receive surgical video frame sequences and generate initial spatiotemporal feature representations; The spatiotemporal feature extraction unit is used to extract deep spatiotemporal features at multiple scales. The instrument detection and instrument feature extraction unit is used to detect surgical instruments and extract instrument features based on surgical video frame sequences; A cross-attention fusion unit is used to fuse deep spatiotemporal features with the instrument features using a cross-attention mechanism to generate fused spatiotemporal features. The instrument perception modulation unit is used to adaptively modulate the fused spatiotemporal features to generate instrument perception fusion features. The output unit is used to output the judgment result of the current stage of the surgery based on the instrument perception fusion characteristics.
[0077] The above are merely preferred embodiments of the present invention and are not intended to limit the present invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A stage recognition method based on surgical instrument feature fusion, characterized in that, Specifically, it includes: Receive surgical video frame sequences, perform preprocessing and preliminary encoding, and generate initial spatiotemporal features; A multi-layer neural network is used to process the initial spatiotemporal features and extract deep spatiotemporal features at multiple scales. Based on the surgical video frame sequence, surgical instruments are detected and instrument features are extracted; A spatiotemporal-instrument cross-attention fusion algorithm is constructed, and the cross-attention mechanism in the algorithm is used to fuse deep spatiotemporal features and instrument features to generate fused spatiotemporal features; Adaptive modulation of the fused spatiotemporal features for device perception is performed to generate fused features for device perception. The system outputs the judgment result of the current stage of the surgery based on the instrument perception fusion characteristics.
2. The stage identification method based on surgical instrument feature fusion according to claim 1, characterized in that: The specific steps for detecting the surgical instruments and extracting their features are as follows: The instrument feature map of each frame of the image is extracted using an instrument branch feature extraction network; Based on the device bounding box, the device region features are cropped from the device feature map using the ROIAlign operation; The device region features are converted into device embedding vectors through global average pooling and linear mapping. After encoding the geometric information of the instrument bounding box into a geometric position code, it is added to the instrument embedding vector to generate the instrument feature.
3. The stage identification method based on surgical instrument feature fusion according to claim 1, characterized in that: The specific steps for generating the fused spatiotemporal features are as follows: Using deep spatiotemporal features as queries and instrument features as keys and values, perform instrument-guided video feature enhancement to obtain updated video features; Using instrument features as the query and deep spatiotemporal features as the key and value, perform scene-aware instrument feature completion to obtain updated instrument features; Temporal consistency modeling is performed on updated video features and updated instrument features along the time dimension, and fused spatiotemporal features are output.
4. The stage identification method based on surgical instrument feature fusion according to claim 3, characterized in that: Before performing video feature enhancement using deep spatiotemporal features as queries and device features as keys and values for device guidance, the process also includes: Deep spatiotemporal features are unfolded in the spatial dimension to generate a spatial token sequence.
5. The stage identification method based on surgical instrument feature fusion according to claim 3, characterized in that: The temporal consistency modeling uses a multi-head self-attention mechanism to model the temporal feature sequences of the same spatial location or the same instrument along the time dimension.
6. The stage identification method based on surgical instrument feature fusion according to claim 1, characterized in that: The specific steps for generating the device perception fusion features are as follows: Aggregate device features to generate a semantic summary of the device; Channel-level modulated signals are generated based on the semantic summary of the device, and channel modulation is performed on the fused spatiotemporal features; Based on the instrument testing results, a spatial prior distribution of the instrument is constructed, and the fused spatiotemporal features are spatially modulated. The features after channel modulation and spatial modulation are residually fused with the fused spatiotemporal features to output the fused features perceived by the instrument.
7. The stage identification method based on surgical instrument feature fusion according to claim 6, characterized in that: The channel modulation includes: The device semantic summary is mapped to channel-gated weights by using the Sigmoid activation function, and the weights are applied to each channel of the fused feature map.
8. The stage identification method based on surgical instrument feature fusion according to claim 6, characterized in that: The spatial modulation includes: The instrument bounding box set is rasterized into a spatial mask. The channel-modulated features are concatenated with a spatial mask and then generated by convolution and sigmoid activation to create a spatial gating weight map, which is used to weight the spatial location of the feature map.
9. The stage identification method based on surgical instrument feature fusion according to claim 1, characterized in that: The specific steps for outputting the determination result of the current stage of surgery based on the instrument perception fusion features are as follows: Global average pooling is applied to the instrument perception fusion features to generate stage-determined feature vectors; Temporal modeling of stage determination feature vectors across multiple frames is performed using a temporal Transformer. The probability distribution of the surgical stage to which the current frame belongs is output through a linear classification layer.
10. A stage recognition system based on surgical instrument feature fusion, characterized in that, include: The video input and encoding unit is used to receive surgical video frame sequences and generate initial spatiotemporal feature representations; The spatiotemporal feature extraction unit is used to extract deep spatiotemporal features at multiple scales. The instrument detection and instrument feature extraction unit is used to detect surgical instruments and extract instrument features based on surgical video frame sequences; The cross-attention fusion unit is used to construct a spatiotemporal-device cross-attention fusion algorithm. The cross-attention mechanism in this algorithm is used to fuse deep spatiotemporal features and device features to generate fused spatiotemporal features. The instrument perception modulation unit is used to adaptively modulate the fused spatiotemporal features to generate instrument perception fusion features. The output unit is used to output the judgment result of the current stage of the surgery based on the instrument perception fusion characteristics.