Surgical phase determination method, apparatus, system, electronic device, and storage medium
By combining local spatial feature extraction networks, global spatial feature fusion networks, and temporal convolutional networks, this method solves the problem of experience-dependent and complex determination of surgical stages in traditional pancreaticoduodenectomy, achieving efficient and accurate classification of surgical stages and reducing surgical risks and time.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- PEKING UNIVERSITY THIRD HOSPITAL (THE THIRD CLINICAL MEDICAL SCHOOL OF PEKING UNIVERSITY)
- Filing Date
- 2026-03-11
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional methods for determining the surgical stage during pancreaticoduodenectomy rely on the surgeon's experience, which can easily lead to misjudgments by novice surgeons. This increases the complexity of intraoperative monitoring, raises operational risks, and prolongs the operation time, thereby increasing the patient's anesthetic risks and postoperative recovery time.
A combined approach based on local spatial feature extraction network, global spatial feature fusion network and temporal convolutional network is adopted. By fusing local spatial features, global spatial relationships and time series features in surgical videos, a multilayer perceptron model is used to classify surgical stages.
It improves the accuracy of surgical stage classification, can efficiently extract spatiotemporal features from surgical videos, comprehensively understand the dynamic process of surgical stages, and reduce the complexity and risk of surgical operations.
Smart Images

Figure CN122244464A_ABST
Abstract
Description
[0001] This disclosure claims priority to Chinese Patent Application No. 202511004311.X, filed on July 21, 2025, entitled "Method, Apparatus, System, Electronic Device and Storage Medium for Determining Surgical Stages", the entire contents of which are incorporated herein by reference. Technical Field
[0002] This disclosure relates to the field of surgical assistance technology, and in particular to a method, apparatus, system, electronic device and storage medium for determining surgical stages. Background Technology
[0003] Pancreaticoduodenectomy (LPD) is a complex and high-risk surgical procedure primarily used to treat pancreatic cancer, bile duct cancer, and duodenal tumors. This procedure demands a high level of skill, involving the dissection and reconstruction of multiple vital organs and blood vessels. With advancements in surgical techniques, especially the widespread adoption of minimally invasive techniques, pancreaticoduodenectomy has become increasingly common in clinical practice. However, the complexity of the surgery and the lengthy operation time increase intraoperative risks and place extremely high demands on the surgeon's experience and technical expertise.
[0004] Traditional pancreaticoduodenectomy typically involves several stages: tissue transection, pancreaticojejunostomy, choledochojejunostomy, and gastrojejunostomy. Each stage requires precise surgical technique and close monitoring. Currently, surgeons primarily rely on their experience and intraoperative imaging to determine the surgical stage and proceed with the procedure.
[0005] However, traditional methods for determining surgical stages have certain limitations: 1) Reliance on experience: The surgeon's experience and skill level greatly influence the success rate of the surgery, and novice surgeons are prone to misjudgment and errors during the procedure; 2) Difficulty in intraoperative monitoring: During the surgery, the surgeon needs to monitor multiple steps simultaneously, and the complexity of intraoperative monitoring increases the surgeon's workload; 3) High operational risk: Surgical procedures are complex and usually require a long operation time, increasing the patient's anesthesia risk and postoperative recovery time, and easily leading to postoperative complications. Therefore, there is an urgent need for a method for determining surgical stages to accurately identify the surgical stage and assist the surgeon in making precise operations. Summary of the Invention
[0006] In order to solve the above-mentioned technical problems, or at least partially solve the above-mentioned technical problems, this disclosure provides a method, apparatus, system, electronic device and storage medium for determining the surgical stage.
[0007] In a first aspect, embodiments of this disclosure provide a method for determining the surgical stage, the method comprising: Local spatial features of each frame in the surgical video are obtained based on a local spatial feature extraction network. Global spatial feature modeling is performed based on a global spatial feature fusion network to obtain the global spatial relationship between surgical operation areas; Extracting temporal series features from surgical videos using temporal convolutional networks; The local spatial features, the global spatial relationships, and the time series features are fused to obtain the target fused features; Based on the target fusion features, the surgical stages are classified to determine the category of the surgical stage.
[0008] In some embodiments, the local spatial feature extraction network employs a multi-layer convolutional neural network, wherein the convolutional kernels of the multi-layer convolutional neural network have different sizes; The method for obtaining local structural features of each frame in the surgical video using a local spatial feature extraction network includes: Local features of each frame of the image are extracted using a multi-layer convolutional neural network; The size of the convolution kernel is selected based on the surgical operation area of each frame of the image to capture the detailed areas and structural features of each frame of the image. Based on the local features of each frame of the image, the detailed regions and structural features of each frame of the image, the output feature maps of multiple convolutional layers are generated. The output feature map includes a low-level feature map, a mid-level feature map, and a high-level feature map.
[0009] In some embodiments, the global spatial feature modeling based on a global spatial feature fusion network to obtain the global spatial relationship between surgical operation areas includes: Construct a graph structure, which includes nodes and edges; Each frame of the image is transformed into nodes of a graph structure. Nodes represent surgical operation areas, and edges represent the spatial relationships between various surgical operation areas. Each node contains local spatial features and temporal information. The features of each node are aggregated by a graph neural network to obtain the enhanced spatial features of each node. Based on the enhanced spatial features of each node, the global spatial relationship between the surgical operation areas is output.
[0010] In some embodiments, the temporal convolutional network includes multiple convolutional layers, each of which includes causal convolution and dilated convolution; The extraction of time-series features from surgical videos based on temporal convolutional networks includes: The local spatial features and global spatial relationships are used as temporal data and input into a temporal convolutional network to extract temporal dependent features through causal convolution. Expanding the receptive field through dilated convolution allows for the capture of dependencies over long periods of time. Based on the time-dependent features and the long-term dependencies, time-series features are output, which are used to characterize the evolution of the surgical stage at each time point.
[0011] In some embodiments, fusing the local spatial features, the global spatial relationships, and the time series features to obtain the target fused features includes: The local spatial features, the global spatial relationships, and the time series features are input into a fully connected layer. The fully connected layer performs feature fusion based on bilinear pooling to obtain the target fused features.
[0012] In some embodiments, classifying surgical stages based on the target fusion features to determine the category of the surgical stage includes: A multilayer perceptron model was used as the surgical stage classifier. The target fusion features are input into the surgical stage classifier for classification, and the classification result of the surgical stage is obtained through the surgical stage classifier; By combining label smoothing technology, the classification results of surgical stages are optimized to determine the category of surgical stage.
[0013] Secondly, embodiments of this disclosure provide a surgical stage determination device, the device comprising: The acquisition unit is used to acquire the local spatial features of each frame of the surgical video based on the local spatial feature extraction network; The obtained unit is used to perform global spatial feature modeling based on a global spatial feature fusion network, thereby obtaining the global spatial relationship between surgical operation areas; The extraction unit is used to extract time-series features from surgical videos based on a temporal convolutional network. The fusion unit is used to fuse the local spatial features, the global spatial relationships, and the time series features to obtain the target fused features; The determining unit is used to classify the surgical stage based on the target fusion features and determine the category of the surgical stage.
[0014] Thirdly, embodiments of this disclosure provide a surgical stage determination system, which includes a local spatial feature extraction network, a global spatial feature fusion network, a temporal convolutional network, a fully connected layer, and a surgical stage classifier; The local spatial feature extraction network is used to obtain the local spatial features of each frame of the surgical video; The global spatial feature fusion network is used to model global spatial features and obtain the global spatial relationship between surgical operation areas; The temporal convolutional network is used to extract temporal series features from surgical videos; The fully connected layer is used to fuse the local spatial features, the global spatial relationships, and the time series features to obtain the target fused features; The surgical stage classifier is used to classify surgical stages based on the target fusion features and determine the category of the surgical stage. Fourthly, embodiments of this disclosure provide an electronic device, including: Memory; Processor; and Computer programs; The computer program is stored in the memory and configured to be executed by the processor to implement the method as described in the first aspect.
[0015] Fifthly, embodiments of this disclosure provide a computer-readable storage medium having a computer program stored thereon, the computer program being executed by a processor to implement the method as described in the first aspect.
[0016] In a sixth aspect, embodiments of this disclosure also provide a computer program product comprising a computer program or instructions that, when executed by a processor, implement the method described in the first aspect.
[0017] The surgical stage determination method, apparatus, system, electronic device, and storage medium provided in this disclosure acquire local spatial features of each frame in a surgical video based on a local spatial feature extraction network, perform global spatial feature modeling based on a global spatial feature fusion network to obtain the global spatial relationship between surgical operation areas, extract time-series features of the surgical video based on a temporal convolutional network, fuse the local spatial features, the global spatial relationship, and the time-series features to obtain a target fusion feature, and classify the surgical stages based on the target fusion feature to determine the category of the surgical stage. Compared with the prior art, this disclosure can efficiently extract spatiotemporal features from surgical videos, comprehensively understand the dynamic process of surgical stages, improve the accuracy of surgical stage classification, and handle complex surgical operation steps. Attached Figure Description
[0018] The accompanying drawings, which are incorporated in and form a part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure.
[0019] To more clearly illustrate the technical solutions in the embodiments of this disclosure or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0020] Figure 1 A flowchart of a surgical stage determination method provided in this embodiment of the present disclosure; Figure 2 A flowchart of a surgical stage determination method provided in another embodiment of this disclosure; Figure 3 A flowchart of a surgical stage determination method provided in another embodiment of this disclosure; Figure 4 This is a schematic diagram of the overall process of the surgical stage determination method provided in the embodiments of this disclosure; Figure 5 A flowchart illustrating the training process of the local spatial feature extraction network provided in this embodiment of the disclosure; Figure 6 This is a schematic diagram of global spatial feature fusion provided in an embodiment of the present disclosure; Figure 7 A flowchart illustrating the training process of a temporal convolutional network provided in this embodiment of the disclosure; Figure 8 This is a schematic diagram of the surgical stage determination device provided in an embodiment of the present disclosure; Figure 9 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this disclosure. Detailed Implementation
[0021] To better understand the above-mentioned objectives, features, and advantages of this disclosure, the solutions disclosed herein will be further described below. It should be noted that, unless otherwise specified, the embodiments and features described herein can be combined with each other.
[0022] Numerous specific details are set forth in the following description in order to provide a full understanding of this disclosure, but this disclosure may also be implemented in other ways different from those described herein; obviously, the embodiments in the specification are only some, and not all, of the embodiments of this disclosure.
[0023] To address this issue, this disclosure provides a method for determining the surgical stage, which will be described below with reference to specific embodiments.
[0024] Figure 1This is a flowchart illustrating the surgical stage determination method provided in this embodiment. The method is executed by an electronic device, which can be a portable mobile device such as a smartphone, tablet, or laptop; or a fixed device such as a personal computer or server. The server can be a single server, a server cluster, a distributed cluster, or a centralized cluster. This method can be applied to scenarios requiring surgical stage determination. It can assist in determining the surgical stage of pancreaticoduodenectomy, and can also be used to determine the surgical stage of other surgeries, improving the accuracy of surgical stage determination and assisting the surgeon in performing precise operations. It is understood that the surgical stage determination method provided in this embodiment can also be applied to other scenarios.
[0025] The following is about Figure 1 The method for determining the surgical stage is described below, and the specific steps involved are as follows: S101. Obtain the local spatial features of each frame of the surgical video based on the local spatial feature extraction network.
[0026] In this step, the electronic device constructs a local spatial feature extraction network, and further, as follows: Figure 4 As shown, the surgical video is acquired, and the local spatial features of each frame in the surgical video are obtained based on a local spatial feature extraction network. Optionally, the local spatial feature extraction network can be a convolutional neural network (CNN), without specific limitations.
[0027] First, a multi-layer convolutional neural network (CNN) is used to extract local features from each frame of the image. Convolutional layers with variable receptive fields are employed to ensure the CNN can capture detailed regions (such as surgical instruments and tissue areas) as well as large-scale structural features. Simultaneously, an adaptive CNN is combined to adjust the size and characteristics of the convolutional kernels according to the specific surgical video scene, thereby enhancing the expressive power of the image features.
[0028] S102. Global spatial feature modeling is performed based on a global spatial feature fusion network to obtain the global spatial relationship between surgical operation areas.
[0029] In this step, the global spatial feature fusion network can be a graph neural network (GNN), without specific limitations. For example... Figure 4 As shown, electronic devices perform global spatial feature modeling, for example, using graph neural networks (GNNs). GNNs can establish the relationships between surgical video frames through graph structure models, such as the spatial relationship between the surgical operation areas in the current frame and the frames before and after.
[0030] S103. Extracting time-series features from surgical videos based on temporal convolutional networks.
[0031] In this step, such as Figure 4 As shown, a Temporal Convolutional Network (TCN) is used to extract time-series features from surgical videos. TCN employs convolutional operations, processing time-series data through a series of convolutional layers. This effectively addresses the vanishing gradient problem and allows for parallel computation, accelerating the training process. TCN is suitable for long-term time-series data and can efficiently extract the temporal dependencies of action steps from surgical videos.
[0032] S104. The local spatial features, the global spatial relationships, and the time series features are fused to obtain the target fused features.
[0033] In this step, after extracting the time-series features of the surgical video, such as Figure 4 As shown, the electronic device will perform feature fusion, fusing the local spatial features, the global spatial relationships, and the time series features to obtain target fused features, so as to comprehensively understand each stage of the surgical process.
[0034] S105. Based on the target fusion features, classify the surgical stages to determine the category of the surgical stage.
[0035] In this step, the electronic device can classify the surgical stage based on the target fusion features to determine the category of the surgical stage. Specifically, such as... Figure 4 As shown, electronic devices can classify surgical stages and determine the category of the surgical stage through a classifier model or classification network.
[0036] In some embodiments, to improve the performance of the local spatial feature extraction network, the global spatial feature fusion network, and the temporal convolutional network, the training process was optimized, and the following strategies were adopted, especially considering the characteristics of surgical video data: 1) Spatiotemporal data augmentation: In addition to image augmentation (such as rotation, cropping, etc.), temporal data augmentation (such as time slice perturbation, time reversal) is also used to improve the robustness of the above network model.
[0037] 2) Hybrid loss function: In addition to the traditional cross-entropy loss, distance loss (Contrastive Loss) is added to encourage the network model to produce similar features in similar time series data, thereby improving the accuracy of the network model.
[0038] 3) Adaptive learning rate: By using cyclic learning rate scheduling, the learning rate is dynamically adjusted during training, which improves training efficiency and the accuracy of the final model.
[0039] By applying spatiotemporal data augmentation and hybrid loss functions, overfitting was effectively reduced and the model's generalization ability in variable surgical scenarios was improved.
[0040] This disclosure embodiment obtains local spatial features of each frame in a surgical video based on a local spatial feature extraction network, models global spatial features based on a global spatial feature fusion network to obtain global spatial relationships between surgical operation areas, extracts temporal series features from the surgical video based on a temporal convolutional network, and fuses the local spatial features, global spatial relationships, and temporal series features to obtain target fused features. Based on these target fused features, the surgical stages are classified to determine their categories. Compared to existing technologies, this disclosure embodiment can efficiently extract spatiotemporal features from surgical videos, comprehensively understand the dynamic process of surgical stages, improve the accuracy of surgical stage classification, and handle complex surgical operation steps.
[0041] Figure 2 A flowchart of a surgical stage determination method provided in another embodiment of this disclosure is shown below. Figure 2 As shown, the method includes the following steps: S201. Extract local features of each frame of image through a multi-layer convolutional neural network.
[0042] In this embodiment, the local spatial feature extraction network employs a multi-layer convolutional neural network, with the convolution kernels of the multi-layer convolutional neural network having different sizes. In this step, the electronic device extracts local features from each frame of the image using the multi-layer convolutional neural network. These local features include, but are not limited to, surgical instruments and tissue structures.
[0043] S202. Select the size of the convolution kernel based on the surgical operation area of each frame image to capture the detailed areas and structural features of each frame image.
[0044] In this step, such as Figure 5 As shown, convolutional layer design is implemented, specifically the receptive field of the convolutional layers is designed to enable the network model to capture both detailed and global information simultaneously. By using convolutional kernels of different sizes to increase the receptive field, the network model can focus on both small-scale details (such as cutting lines, instrument details, etc.) and large-scale regions (such as large organ regions like the pancreas and intestines) when extracting local features. Specifically, the following convolutional kernels are used: A. Small convolutional kernels (3x3 and 5x5): Used to capture detailed regions, helping to identify local details such as surgical instruments, blood vessels, and tissue incisions.
[0045] B. Large convolutional kernels (7x7 and 11x11): used to capture larger structural information, such as the layout of large organs and surgical operation areas.
[0046] In surgical videos, different scenes may have different image features (e.g., differences in features at different sites during surgery), such as... Figure 5 As shown, an adaptive convolutional neural network was designed to dynamically adjust the size and characteristics of the convolutional kernel to adapt to inputs in different scenarios.
[0047] Specifically, a learnable kernel size is generated through a convolutional layer, automatically adjusting the receptive field size based on the features of the input image (such as the surgical operation area). For example, when the input image contains small surgical instruments, the network tends to use smaller kernels to extract details; when the image contains large areas of organs or tissues, the network uses larger kernels to extract overall structural features.
[0048] S203. Based on the local features of each frame of the image, the detailed regions and structural features of each frame of the image, generate the output feature maps of multiple convolutional layers.
[0049] In this step, the electronic device generates output feature maps from multiple convolutional layers based on the local features, detailed regions, and structural features of each frame of the image. In some embodiments, a deep convolutional neural network (ResNet-50) can be used to gradually extract high-level features of the image. By stacking multiple convolutional layers, low-level features (such as edges, corners, and textures) to higher-level semantic information (such as instruments, tissue shapes, and organ locations) in the image can be extracted step by step.
[0050] After the above convolutional layer operations, as Figure 5 As shown, the network generates output feature maps from multiple convolutional layers. Optionally, the output feature maps include low-level feature maps, mid-level feature maps, and high-level feature maps. Each feature map represents the spatial information of the input image at different levels. Low-level feature maps represent low-level information such as captured edges, corners, and textures; mid-level feature maps represent the complex shapes of extracted local regions, such as instrument shapes and tissue textures; and high-level feature maps represent more abstract and advanced semantic information, such as organ boundaries and surgical operation areas.
[0051] In some embodiments, such as Figure 5 As shown, after generating the output feature maps of multiple convolutional layers, the method further includes: feature map normalization and activation, and pooling layer design.
[0052] For feature map standardization and activation: 1) Standardization: Batch normalization is used to standardize the output of each layer to avoid gradient vanishing and accelerate the training process. 2) Activation function: The ReLU (Rectified Linear Unit) activation function is used to increase the non-linearity of the network, thereby improving the model's learning and expressive capabilities. For certain special layers, LeakyReLU or ELU activation functions can also be considered to further improve the gradient propagation effect.
[0053] For pooling layer design, to reduce computational cost and further extract higher-level features, a pooling layer is usually followed by a convolutional layer. The following pooling strategy is used: Max Pooling: used to reduce spatial dimensionality while retaining the most important feature information. A commonly used pooling window is 3x3 with a stride of 2.
[0054] S204. Global spatial feature modeling is performed based on a global spatial feature fusion network to obtain the global spatial relationship between surgical operation areas.
[0055] Specifically, the implementation process and principle of S204 and S102 are the same, and will not be repeated here.
[0056] S205. The local spatial features and global spatial relationships are used as temporal data and input into a temporal convolutional network to extract time-dependent features through causal convolution.
[0057] In this step, such as Figure 7 As shown, the electronic device inputs the local spatial features and global spatial relationships as temporal data into a temporal convolutional network (TCN). By extracting time-dependent features through causal convolution, it ensures that the processing of time series data conforms to causal relationships (i.e., the current time step only depends on the current time step and previous time steps). Using causal convolution can avoid the leakage of future information. Employing a temporal convolutional network, which processes time series data through a series of convolutional layers, effectively solves the gradient vanishing problem and has the advantage of parallel computing, helping to accelerate the training process.
[0058] The temporal convolutional network includes multiple convolutional layers, each of which includes causal convolution and dilated convolution.
[0059] Temporal Convolutional Networks (TCNs) consist of multiple convolutional layers, each employing a combination of causal and dilated convolutions. The model's output is a feature representation at each time step, capturing the dependencies within the time series. TCNs primarily comprise input layers, convolutional layers, and output layers.
[0060] (1) Input layer: Local spatial features and global spatial relationships are input into TCN as time series data for further time feature modeling.
[0061] (2) Convolutional layers: TCN uses causal convolutional and dilated convolutional layers to extract the temporal dependency features of the surgical stage. Each convolutional layer processes long-term sequences, increases the receptive field, and captures dependencies at different time scales.
[0062] (3) Output layer: The output of TCN is the feature corresponding to each time step, which represents the surgical stage information or action step features at that time point, and is further used for surgical stage identification and prediction.
[0063] S206. Expand the receptive field through dilated convolution to capture long-term dependencies.
[0064] Furthermore, such as Figure 7 As shown, dilated convolution can expand the receptive field and capture dependencies over long periods. This is particularly important for analyzing long-term steps in surgical videos, enabling the capture of subtle changes between different time intervals during the surgical procedure.
[0065] In some embodiments, temporal convolutional networks employ parallel computation to accelerate the training process.
[0066] S207. Based on the time-dependent features and the long-term dependency relationship, output time-series features, which are used to characterize the evolutionary features of the surgical stage at each time point.
[0067] In this step, such as Figure 7 As shown, the electronic device can output time series features based on the time-dependent features and the long-term dependency relationship. The time series features are used to characterize the evolution features of the surgical stage at each time point, that is, the surgical stage information or action step features at that time point, which can be used for the identification and prediction of the surgical stage.
[0068] S208. The local spatial features, the global spatial relationships, and the time series features are fused to obtain the target fused features.
[0069] Specifically, the implementation process and principle of S208 and S104 are the same, and will not be repeated here.
[0070] S209. Based on the target fusion features, classify the surgical stages to determine the category of the surgical stage.
[0071] Specifically, the implementation process and principle of S209 and S105 are the same, and will not be repeated here.
[0072] This embodiment extracts local features from each frame of an image using a multi-layer convolutional neural network. The size of the convolutional kernel is selected based on the surgical operation region of each frame to capture the detailed regions and structural features of each frame. Based on the local features, detailed regions, and structural features of each frame, output feature maps of multiple convolutional layers are generated. Further, a global spatial feature fusion network is used to model global spatial features, obtaining the global spatial relationships between surgical operation regions. These local spatial features and global spatial relationships are used as temporal data and input into a temporal convolutional network. Causal convolution extracts time-dependent features, and dilated convolution expands the receptive field to capture long-term dependencies. Next, based on the time-dependent features and the long-term dependencies, time-series features are output. The local spatial features, global spatial relationships, and time-series features are fused to obtain a target fused feature. Finally, the surgical stage is classified based on the target fused feature to determine the category of the surgical stage. Using this method, the present invention combines multi-layer convolutional networks, graph neural networks, and temporal convolutional networks to efficiently extract spatiotemporal features from surgical videos, comprehensively understand the dynamic process of surgical stages, improve the accuracy of surgical stage classification, and handle complex surgical operation steps.
[0073] Figure 3 A flowchart of a surgical stage determination method provided in another embodiment of this disclosure is shown below. Figure 3 As shown, the method includes the following steps: S301. Obtain the local spatial features of each frame of the surgical video based on a local spatial feature extraction network.
[0074] Specifically, the implementation process and principle of S301 and S101 are the same, and will not be repeated here.
[0075] S302. Construct a graph structure, which includes nodes and edges.
[0076] In this step, such as Figure 6 As shown, a graph structure is constructed, which includes nodes and edges. Node generation: Features of each operating region (e.g., instrument shape, tissue texture, etc.) are used as input features for the node. Edge generation: Edges are constructed based on the spatial relationships between video frames. For example, edges are generated by calculating the relative positions and temporal information of operating regions in adjacent frames, which can model the spatial dependencies between consecutive frames during surgery.
[0077] S303. Convert each frame of image into nodes of a graph structure. Nodes represent surgical operation areas, and edges represent the spatial relationships between various surgical operation areas. Each node contains local spatial features and temporal information.
[0078] In this step, each frame of the surgical video is transformed into a node in a graph structure. Each node contains local spatial features and temporal information. Node: Each node represents an operational region (e.g., instrument, tissue, organ, etc.) in the surgical video. Edge: Edges represent the spatial relationships between different operational regions, indicating the dependencies and relative positional relationships between them.
[0079] S304. The features of each node are aggregated through a graph neural network to obtain the enhanced spatial features of each node.
[0080] Graph Neural Networks (GNNs) propagate information within a graph structure through graph convolution operations (GCN or GAT), enabling each node to incorporate features from its neighbors. Through iterative graph convolution operations, the network progressively enhances the spatial features of its nodes. Specifically, for example... Figure 6 As shown, the graph neural network aggregates the features of each node to obtain the enhanced spatial features of each node. This process includes two steps: 1) Node feature aggregation: Each node incorporates the information of its neighbors into its own features through graph convolution, thereby enhancing the understanding of global spatial relationships. 2) Message passing and updating: The graph neural network updates the node representations through multiple rounds of message passing, ensuring that the representation of each node not only contains its own information but also the spatial relationships associated with it.
[0081] S305. Based on the enhanced spatial features of each node, output the global spatial relationship between the surgical operation areas.
[0082] In this step, such as Figure 6 As shown, after obtaining the enhanced spatial features of each node, the electronic device can output the global spatial relationship between the surgical operation areas based on the enhanced spatial features of each node, thereby providing richer and more global spatial information for the entire surgical video analysis. Specifically, global spatial feature fusion can be performed based on the enhanced spatial features of each node to obtain global spatial fusion features, which are used to characterize the global spatial relationship between the surgical operation areas.
[0083] In some embodiments, such as Figure 6 As shown, after global spatial fusion features, the method further includes: normalization and activation, pooling layer and feature compression, to further output the fused spatial features.
[0084] S306. Extracting temporal series features from surgical videos based on temporal convolutional networks.
[0085] Specifically, the implementation process and principle of S306 and S104 are the same, and will not be repeated here.
[0086] S307. The local spatial features, the global spatial relationships, and the time series features are fused to obtain the target fused features.
[0087] Specifically, the implementation process and principle of S307 and S104 are the same, and will not be repeated here.
[0088] S308. A multilayer perceptron model is used as the classifier for the surgical stage.
[0089] In this step, a multilayer perceptron (MLP) is used as the classifier for the surgical stage.
[0090] S309. Input the target fusion features into the surgical stage classifier for classification, and obtain the classification result of the surgical stage through the surgical stage classifier.
[0091] In this step, the target fusion features are passed to the surgical stage classifier to classify the surgical stage and obtain the classification result of the surgical stage.
[0092] S310. Combining label smoothing technology, the classification results of the surgical stage are optimized to determine the category of the surgical stage.
[0093] To improve classification accuracy, label smoothing was employed to optimize the classification results for surgical stages, determine the category of the surgical stage, reduce overfitting of the classifier model by smoothing the labels, and improve the model's generalization ability on unseen data.
[0094] This embodiment of the disclosure obtains local spatial features of each frame of a surgical video using a local spatial feature extraction network, constructing a graph structure. This graph structure includes nodes and edges, transforming each frame into a node of the graph structure. Nodes represent surgical operation areas, and edges represent the spatial relationships between these areas. Each node contains local spatial features and temporal information. Further, the features of each node are aggregated using a graph neural network to obtain enhanced spatial features for each node. Based on these enhanced spatial features, the global spatial relationships between surgical operation areas are output. Next, temporal series features of the surgical video are extracted using a temporal convolutional network. The local spatial features, global spatial relationships, and temporal series features are fused to obtain a target fused feature. A multilayer perceptron model is then used as a surgical stage classifier. The target fused feature is input into the surgical stage classifier for classification. The classification result of the surgical stage is obtained through the surgical stage classifier. Combined with label smoothing technology, the classification result of the surgical stage is optimized to determine the category of the surgical stage. This method can efficiently extract spatiotemporal features from surgical videos, comprehensively understand the dynamic process of surgical stages, improve the accuracy of surgical stage classification, and handle complex surgical operation steps. Graph structures can effectively capture spatial relationships during surgery, such as the relative positions of instruments and tissues, and the temporal dependencies between regions. By fusing global spatial features based on graph neural networks, we can better understand the overall process of surgical operations, rather than just local details, which is beneficial for the identification and stage division of complex surgeries.
[0095] Figure 8 This is a schematic diagram of the surgical stage determination device provided in an embodiment of this disclosure. The surgical stage determination device may be an electronic device as described in the above embodiments, or it may be a component or assembly within that electronic device. The surgical stage determination device provided in this embodiment of the disclosure can execute the processing flow provided in the surgical stage determination method embodiments, such as... Figure 8 As shown, the surgical stage determination device 40 includes: an acquisition unit 41, a obtaining unit 42, an extraction unit 43, a fusion unit 44, and a determination unit 45; wherein, the acquisition unit 41 is used to acquire the local spatial features of each frame of the surgical video based on a local spatial feature extraction network; the obtaining unit 42 is used to perform global spatial feature modeling based on a global spatial feature fusion network to obtain the global spatial relationship between surgical operation areas; the extraction unit 43 is used to extract the time series features of the surgical video based on a temporal convolutional network; the fusion unit 44 is used to fuse the local spatial features, the global spatial relationship, and the time series features to obtain a target fused feature; and the determination unit 45 is used to classify the surgical stage based on the target fused feature to determine the category of the surgical stage.
[0096] Optionally, the local spatial feature extraction network employs a multi-layer convolutional neural network, wherein the convolutional kernels of the multi-layer convolutional neural network have different sizes; When the acquisition unit 41 acquires the local structural features of each frame of the surgical video based on the local spatial feature extraction network, it is specifically used to: extract the local features of each frame of the image through a multi-layer convolutional neural network; select the size of the convolutional kernel based on the surgical operation area of each frame of the image to capture the detailed regions and structural features of each frame of the image; and generate output feature maps of multiple convolutional layers based on the local features of each frame of the image, the detailed regions and structural features of each frame of the image; wherein, the output feature maps include low-level feature maps, mid-level feature maps and high-level feature maps.
[0097] Optionally, when the obtaining unit 42 performs global spatial feature modeling based on a global spatial feature fusion network to obtain the global spatial relationship between surgical operation areas, it is specifically used for: constructing a graph structure, the graph structure including nodes and edges; converting each frame image into nodes of the graph structure, where nodes represent surgical operation areas and edges represent the spatial relationship between various surgical operation areas, and each node contains local spatial features and temporal information; aggregating the features of each node through a graph neural network to obtain the enhanced spatial features of each node; and outputting the global spatial relationship between surgical operation areas based on the enhanced spatial features of each node.
[0098] Optionally, the temporal convolutional network includes multiple convolutional layers, each of which includes causal convolution and dilated convolution; When the extraction unit 43 extracts the time-series features of the surgical video based on the temporal convolutional network, it is specifically used to: input the local spatial features and global spatial relationships as temporal data into the temporal convolutional network, extract time-dependent features through causal convolution; expand the receptive field through dilated convolution to capture long-term dependencies; and output time-series features based on the time-dependent features and the long-term dependencies, wherein the time-series features are used to characterize the evolutionary features of the surgical stage at each time point.
[0099] Optionally, when the fusion unit 44 fuses the local spatial features, the global spatial relationship, and the time series features to obtain the target fused features, it is specifically used to: input the local spatial features, the global spatial relationship, and the time series features into a fully connected layer, and perform feature fusion based on bilinear pooling through the fully connected layer to obtain the target fused features.
[0100] Optionally, when the determining unit 45 classifies the surgical stage based on the target fusion features and determines the category of the surgical stage, it specifically uses: a multilayer perceptron model as the surgical stage classifier; inputting the target fusion features into the surgical stage classifier for classification, obtaining the classification result of the surgical stage through the surgical stage classifier; and combining label smoothing technology to optimize the classification result of the surgical stage and determine the category of the surgical stage.
[0101] Figure 8 The surgical stage determination device of the illustrated embodiment can be used to execute the technical solution of the above method embodiment. Its implementation principle and technical effect are similar, and will not be repeated here.
[0102] This disclosure also provides a surgical stage determination system, which includes a local spatial feature extraction network, a global spatial feature fusion network, a temporal convolutional network, a fully connected layer, and a surgical stage classifier. The local spatial feature extraction network is used to acquire the local spatial features of each frame in the surgical video. The global spatial feature fusion network is used to perform global spatial feature modeling to obtain the global spatial relationship between surgical operation areas. The temporal convolutional network is used to extract the temporal series features of the surgical video. The fully connected layer is used to fuse the local spatial features, the global spatial relationship, and the temporal series features to obtain a target fused feature. The surgical stage classifier is used to classify the surgical stage based on the target fused feature to determine the category of the surgical stage.
[0103] Figure 9 This is a schematic diagram of the structure of an electronic device according to an embodiment of this disclosure. See below for details. Figure 9 It shows a schematic diagram of a structure suitable for implementing the electronic device 600 in the embodiments of this disclosure. Figure 9 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments disclosed herein.
[0104] like Figure 9 As shown, electronic device 600 may include a processing device (e.g., a central processing unit, a graphics processor, etc.) 601, which can perform various appropriate actions and processes according to a program stored in read-only memory (ROM) 602 or a program loaded from storage device 608 into random access memory (RAM) 603 to implement the surgical stage determination method as described in the embodiments of this disclosure. Various programs and data required for the operation of electronic device 600 are also stored in RAM 603. The processing device 601, ROM 602, and RAM 603 are interconnected via bus 604. Input / output (I / O) interface 605 is also connected to bus 604.
[0105] Typically, the following devices can be connected to I / O interface 605: input devices 606 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 607 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 608 including, for example, magnetic tapes, hard disks, etc.; and communication devices 609. Communication device 609 allows electronic device 600 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 9 An electronic device 600 with various devices is shown; however, it should be understood that it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively.
[0106] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts, thereby implementing the surgical stage determination method as described above. In such embodiments, the computer program can be downloaded and installed from a network via a communication device 609, or installed from a storage device 608, or installed from a ROM 602. When the computer program is executed by the processing device 601, it performs the functions defined in the methods of embodiments of this disclosure.
[0107] It should be noted that the computer-readable medium described in this disclosure can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.
[0108] In some implementations, clients and servers can communicate using any currently known or future-developed network protocol such as HTTP (Hypertext Transfer Protocol) and can interconnect with digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), the Internet (e.g., the Internet of Things), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future-developed networks.
[0109] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device.
[0110] The aforementioned computer-readable medium carries one or more programs that, when executed by the electronic device, cause the electronic device to: Local spatial features of each frame in the surgical video are obtained based on a local spatial feature extraction network. Global spatial feature modeling is performed based on a global spatial feature fusion network to obtain the global spatial relationship between surgical operation areas; Extracting temporal series features from surgical videos using temporal convolutional networks; The local spatial features, the global spatial relationships, and the time series features are fused to obtain the target fused features; Based on the target fusion features, the surgical stages are classified to determine the category of the surgical stage.
[0111] Optionally, when one or more of the above-described procedures are executed by the electronic device, the electronic device may also perform other steps described in the above embodiments.
[0112] Computer program code for performing the operations of this disclosure can be written in one or more programming languages or a combination thereof, including but not limited to object-oriented programming languages such as Java, Smalltalk, and C++, as well as conventional procedural programming languages such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0113] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0114] The units described in the embodiments of this disclosure can be implemented in software or hardware. The names of the units are not, in some cases, intended to limit the specific unit.
[0115] The functions described above in this document can be performed at least in part by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), system-on-a-chip (SoCs), complex programmable logic devices (CPLDs), and so on.
[0116] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0117] The above description is merely a preferred embodiment of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features disclosed in this disclosure that have similar functions.
[0118] Furthermore, while the operations are described in a specific order, this should not be construed as requiring these operations to be performed in the specific order shown or in a sequential order. In certain environments, multitasking and parallel processing may be advantageous. Similarly, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of this disclosure. Certain features described in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments.
[0119] Although the subject matter has been described using language specific to structural features and / or methodological logic, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. Rather, the specific features and actions described above are merely illustrative examples of implementing the claims.
Claims
1. A surgical stage determination method characterized by, The method includes: Local spatial features of each frame in the surgical video are obtained based on a local spatial feature extraction network. Global spatial feature modeling is performed based on a global spatial feature fusion network to obtain the global spatial relationship between surgical operation areas; Extracting temporal series features from surgical videos using temporal convolutional networks; The local spatial features, the global spatial relationships, and the time series features are fused to obtain the target fused features; Based on the target fusion features, the surgical stages are classified to determine the category of the surgical stage.
2. The method of claim 1, wherein, The local spatial feature extraction network employs a multi-layer convolutional neural network, wherein the convolutional kernels of the multi-layer convolutional neural network have different sizes; The method for obtaining local structural features of each frame in the surgical video using a local spatial feature extraction network includes: Local features of each frame of the image are extracted using a multi-layer convolutional neural network; The size of the convolution kernel is selected based on the surgical operation area of each frame of the image to capture the detailed areas and structural features of each frame of the image. Based on the local features of each frame of the image, the detailed regions and structural features of each frame of the image, the output feature maps of multiple convolutional layers are generated. The output feature map includes a low-level feature map, a mid-level feature map, and a high-level feature map.
3. The method of claim 1, wherein, The global spatial feature modeling based on the global spatial feature fusion network obtains the global spatial relationships between surgical operation areas, including: Construct a graph structure, which includes nodes and edges; Each frame of the image is transformed into nodes of a graph structure. Nodes represent surgical operation areas, and edges represent the spatial relationships between various surgical operation areas. Each node contains local spatial features and temporal information. The features of each node are aggregated by a graph neural network to obtain the enhanced spatial features of each node. Based on the enhanced spatial features of each node, the global spatial relationship between the surgical operation areas is output.
4. The method of claim 1, wherein, The temporal convolutional network includes multiple convolutional layers, each of which includes causal convolution and dilated convolution; The extraction of time-series features from surgical videos based on temporal convolutional networks includes: The local spatial features and global spatial relationships are used as temporal data and input into a temporal convolutional network to extract temporal dependent features through causal convolution. Expanding the receptive field through dilated convolution allows for the capture of dependencies over long periods of time. Based on the time-dependent features and the long-term dependencies, time-series features are output, which are used to characterize the evolution of the surgical stage at each time point.
5. The method of claim 1, wherein, The process of fusing the local spatial features, the global spatial relationships, and the time series features to obtain the target fused features includes: The local spatial features, the global spatial relationships, and the time series features are input into a fully connected layer. The fully connected layer performs feature fusion based on bilinear pooling to obtain the target fused features.
6. The method of claim 1, wherein, The classification of surgical stages based on the target fusion features to determine the category of surgical stage includes: A multilayer perceptron model was used as the surgical stage classifier. The target fusion features are input into the surgical stage classifier for classification, and the classification result of the surgical stage is obtained through the surgical stage classifier; By combining label smoothing technology, the classification results of surgical stages are optimized to determine the category of surgical stage.
7. A surgical phase determination apparatus characterized by, include: The acquisition unit is used to acquire the local spatial features of each frame of the surgical video based on the local spatial feature extraction network; The obtained unit is used to perform global spatial feature modeling based on a global spatial feature fusion network, thereby obtaining the global spatial relationship between surgical operation areas; The extraction unit is used to extract time-series features from surgical videos based on a temporal convolutional network. The fusion unit is used to fuse the local spatial features, the global spatial relationships, and the time series features to obtain the target fused features; The determining unit is used to classify the surgical stage based on the target fusion features and determine the category of the surgical stage.
8. A surgical stage determination system, characterized by, The surgical stage determination system includes a local spatial feature extraction network, a global spatial feature fusion network, a temporal convolutional network, a fully connected layer, and a surgical stage classifier; The local spatial feature extraction network is used to obtain the local spatial features of each frame of the surgical video; The global spatial feature fusion network is used to model global spatial features and obtain the global spatial relationship between surgical operation areas; The temporal convolutional network is used to extract temporal series features from surgical videos; The fully connected layer is used to fuse the local spatial features, the global spatial relationships, and the time series features to obtain the target fused features; The surgical stage classifier is used to classify surgical stages based on the target fusion features and determine the category of the surgical stage.
9. An electronic device, comprising: include: Memory; processor; as well as Computer programs; The computer program is stored in the memory and configured to be executed by the processor to implement the method as described in any one of claims 1-6.
10. A computer-readable storage medium having stored thereon a computer program, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1-6.