A YOLOv8 combined with SAM2 surgical instrument segmentation method
By combining YOLOv8 and SAM2 models and introducing frame-by-frame detection and state management mechanisms, the drift and error accumulation problems of instrument segmentation in surgical videos were solved, thereby improving the stability and accuracy of instrument segmentation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HARBIN UNIV OF SCI & TECH
- Filing Date
- 2026-03-19
- Publication Date
- 2026-06-19
AI Technical Summary
Existing medical image segmentation methods are prone to mask drift, local omissions, and error accumulation in surgical video scenarios, making it difficult to guarantee the temporal continuity and stability of the segmentation results.
By combining YOLOv8 and SAM2 models, stable spatial priors are provided through frame-by-frame detection, lightweight geometric state maintenance, and missing state management mechanisms, thereby improving the accuracy and stability of instrument segmentation in surgical videos.
In complex surgical videos, the temporal continuity and stability of instrument segmentation were achieved, improving the accuracy and robustness of the segmentation results.
Smart Images

Figure CN122244440A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the fields of medical image processing, computer vision and deep learning technology, and specifically relates to a surgical instrument segmentation method using YOLOv8 combined with SAM2. Background Technology
[0002] With the development of robot-assisted surgery and endoscopic imaging technology, instrument recognition and segmentation based on surgical videos has become an important research direction in the field of computer-assisted surgery. Instrument segmentation results provide fundamental visual information for tasks such as intraoperative navigation, instrument tracking, motion analysis, and intelligent decision support. Therefore, achieving accurate and stable instrument segmentation in complex surgical scenarios is of great significance. Existing medical image segmentation methods have achieved certain results in static image tasks, but in surgical video scenarios, due to the slender structure, large morphological changes, and high movement speed of instruments, as well as the presence of tissue occlusion, specular reflection, smoke interference, and short-term disappearance and reappearance of targets in the surgical environment, existing methods are prone to mask drift, local omissions, and error accumulation in long video sequence segmentation, making it difficult to guarantee the temporal continuity and stability of the segmentation results.
[0003] SAM2, as a general segmentation model for images and videos, possesses strong segmentation capabilities and cross-frame propagation ability. However, when directly applied to surgical video scenarios, its subsequent frame segmentation results mainly rely on initialization cues and internal temporal propagation mechanisms. In complex endoscopic videos, it remains susceptible to changes in target morphology and external interference, leading to a decline in segmentation performance. Therefore, it is necessary to propose a method to improve the stability of SAM2 instrument segmentation in surgical videos without modifying its original network structure and training weights. Summary of the Invention
[0004] To overcome the problems of drift, loss, and error accumulation in surgical instrument segmentation during long-term propagation in existing technologies, this invention proposes a surgical instrument segmentation method combining YOLOv8 and SAM2. Without altering the SAM2 network structure and training weights, this method introduces frame-by-frame detection and observation, lightweight geometric state maintenance, and missing state management mechanisms to provide stable spatial priors for the segmentation propagation process, thereby improving the accuracy and stability of instrument segmentation in long-sequence surgical videos.
[0005] The present invention solves the above-mentioned technical problems by adopting the following technical solution:
[0006] A surgical instrument segmentation method combining YOLOv8 and SAM2 includes:
[0007] Step a: Construct the EndoVis 2017 and EndoVis 2018 datasets for training and evaluation, process the surgical videos and their corresponding annotations in the datasets, and divide them according to the evaluation protocol.
[0008] Step b: Construct a detection-guided segmentation framework based on YOLOv8 and SAM2. YOLOv8 is used to output the instrument detection box frame by frame, and SAM2 generates the segmentation result of the current frame according to the prompt information and uses its internal memory mechanism to propagate across frames.
[0009] Step c: In the video sequence reasoning process, a lightweight geometric state maintenance module is introduced to integrate the current frame detection observation and historical state information to obtain a stable geometric representation of the target, and use it as a spatial prior for segmentation propagation.
[0010] Step d: Utilize the missing state management mechanism to update and maintain the target state in consecutive frames, and output the final instrument segmentation mask by combining the video segmentation propagation results of SAM2.
[0011] The surgical instrument segmentation method using YOLOv8 combined with SAM2 described above, specifically includes the following steps in step b:
[0012] Step b1: Select SAM2 as the main model for video surgical instrument segmentation, and use its video segmentation capabilities and cross-frame information propagation capabilities to complete instrument segmentation.
[0013] Step b2: Select YOLOv8 as the frame-by-frame target detection module. During the inference process, output the instrument detection box and its corresponding confidence in the current video frame. In the initialization phase, use the detection box as a prompt input to SAM2 to complete the initial segmentation of the instrument target.
[0014] Step b3: In subsequent frames, YOLOv8 continues to provide spatial observation information of the instrument, and SAM2 performs cross-frame segmentation propagation based on the prompt information and internal memory mechanism.
[0015] The surgical instrument segmentation method using YOLOv8 combined with SAM2 described above, specifically includes the following steps in step c:
[0016] Step c1: Extract the corresponding bounding box from the segmentation mask output by SAM2 at the current time, and use it as the instantaneous geometric observation information of the current frame.
[0017] Step c2: Use the lightweight geometry state maintenance module to perform state modeling on the target center position and target frame scale, and perform fusion and update based on the geometric consistency between the current observation and the historical state.
[0018] Step c3: Use a smooth update mechanism to maintain the stability of the target position and scale between consecutive frames to obtain a continuous and stable target geometric state representation, and use this state as the spatial prior for SAM2 cross-frame segmentation propagation.
[0019] The surgical instrument segmentation method using YOLOv8 combined with SAM2 described above, specifically includes the following steps in step d:
[0020] Step d1: Introduce a missing state management mechanism. When an observation is successfully matched, update the state and clear the missing count. When an observation is missing or the matching fails, accumulate the missing count. When the missing count exceeds a preset threshold, mark the target as missing.
[0021] Step d2: Based on the missing state management results and using the spatial prior information corresponding to the valid targets, guide the cross-frame segmentation propagation of SAM2 and output the surgical instrument segmentation results. Attached Figure Description
[0022] Figure 1 This is an overall framework diagram of the method proposed in this invention.
[0023] Figure 2 This is a schematic diagram of the geometrically guided tracking framework in this invention.
[0024] Figure 3 These are qualitative comparison diagrams of the surgical instrument segmentation results of the method of this invention and the comparative method under different complex and challenging scenarios. Among them, (a) is the scenario of instrument morphology change; (b) is the scenario of a new instrument entering the field of view; (c) is the scenario of incomplete initialization; (d) is the scenario of segmentation error propagation; and (e) is the scenario of the instrument disappearing for a long time and then briefly reappearing. Detailed Implementation
[0025] To make the objectives, technical solutions, and features of this invention clearer, the specific embodiments of this invention will be described in further detail below with reference to the accompanying drawings.
[0026] The specific embodiments of the present invention will now be described with reference to the accompanying drawings. This embodiment provides a surgical instrument segmentation method using YOLOv8 combined with SAM2, the overall framework of which is as follows: Figure 1 As shown, the geometrically guided tracking framework is as follows: Figure 2 As shown, the specific steps include:
[0027] Step a: Construct a surgical video instrument segmentation dataset for training and evaluation. Preprocess the video frame images and their corresponding instrument annotations, defining instrument pixels as foreground and the remaining regions as background, thus forming a binary instrument segmentation task. The dataset includes the EndoVis 2017 and EndoVis 2018 datasets.
[0028] Step b: Construct a detection-guided inference framework consisting of a YOLOv8 detection module and a SAM2 video segmentation module. YOLOv8 outputs the instrument detection box during the initialization phase and serves as a prompt input to SAM2 to complete the target initialization in the first frame. In subsequent video frames, YOLOv8 continuously provides spatial observation information of the instrument. Specifically:
[0029] Step b1: Select SAM2 as the main model for video segmentation. SAM2 generates the segmentation mask for the current frame under given cues and uses its internal memory mechanism to complete cross-frame propagation.
[0030] Step b2: Select YOLOv8 as the object detection module. Assume the video sequence at time... The input frame is YOLOv8 outputs the detection set on this frame.
[0031]
[0032] in, Indicates the number of detection boxes in the current frame. Indicates the first One detection box, This represents the corresponding detection confidence level. During the initialization phase, the detection bounding box... Input SAM2 as a prompt to obtain the initial target mask.
[0033] Step b3: In subsequent frames, SAM2 outputs a segmentation mask based on the propagation state of the previous time step, while YOLOv8 simultaneously provides detection observations for the current frame, forming a joint inference mechanism of "detection observation and segmentation propagation" to provide external geometric constraints for the target state update in subsequent time steps.
[0034] Step c: During video sequence reasoning, a lightweight geometric state maintenance module is introduced to model the geometric state of the target across consecutive frames. By fusing current frame detection observations with historical state information, a stable target state representation is obtained. Specifically:
[0035] Step c1: Let SAM2 be at time... The output target mask is The corresponding bounding box is extracted from the mask and denoted as the instantaneous geometric observation.
[0036]
[0037] in, This represents the mapping operation from a mask to a bounding box.
[0038] Step c2: Define the target at time [time]. The smooth geometric state is
[0039]
[0040] in, Indicates the center location of the target. and These represent the width and height of the target bounding box, respectively. The current observation is fused with historical states, and the target state is updated using an exponential moving average.
[0041]
[0042] in, This is the smoothing coefficient. When the current frame observation successfully matches the historical state, it is updated according to the above formula; when the observation is missing or the match fails, it remains unchanged. Or, perform only a weak update.
[0043] Step c3: Through the above-mentioned smooth update mechanism, the target maintains a continuous and stable state representation under slow changes in position and scale, and uses this state as the geometric prior for SAM2 cross-frame mask propagation to reduce mask drift in long-term propagation.
[0044] Step d: Based on detection observations, smoothed geometric states, geometric consistency matching, and missing state management, the target state is maintained frame by frame, and the optimized instrument binary segmentation result is output. Specifically:
[0045] Step d1: For the current frame YOLOv8 output detection set SAM2 output segmentation mask Extracting the observation box from the mask. The status is maintained by the tracker. The optimal matching detection box for the target at the current moment is defined as:
[0046]
[0047] in, This represents the geometric consistency matching function.
[0048] Step d2: The geometric consistency matching function is jointly composed of IoU overlap, normalized center distance, scale variation constraint, and detection confidence, and can be expressed as:
[0049]
[0050] in, These are the weighting coefficients. Represents the normalized center distance. This represents the cost of scaling. If the following conditions are met:
[0051]
[0052] Then it is considered that the detection observation has successfully matched the current state, and it is used Update the target state; otherwise, treat it as an unmatched observation.
[0053] Step d3: When an observation match is successful, update the status:
[0054]
[0055] and with As the segmentation propagation of SAM2 as the prior constraint of the current frame space; when the observation does not meet the matching condition, let:
[0056]
[0057] or
[0058]
[0059] To reduce the interference of abnormal observations on state estimation.
[0060] Step d4: To address situations such as occlusion, short-term missed detections, and instruments temporarily leaving the field of view, a missing status management mechanism is introduced. Let the missing counter be... Its update rules are as follows:
[0061]
[0062] when
[0063]
[0064] When this happens, the current target is marked as lost, and no further effective updates will be made using the current observations.
[0065] Step d5: This invention uses region-based and boundary-based indicators to evaluate the segmentation results. Region overlap IoU is defined as:
[0066]
[0067] in, To predict the mask, This is the truth mask. The Dice coefficients are defined as follows:
[0068]
[0069] Boundary accuracy The comprehensive index is calculated from the precision and recall rates of the predicted boundary and the true boundary. It is used to simultaneously reflect the quality of regional overlap and boundary fit. CIoU adopts a frame-by-frame statistical method, first calculating the IoU of each frame, and then averaging it at the video sequence and dataset levels.
[0070] Step d6: By performing frame-by-frame detection and observation, smooth update of geometric state, geometric consistency matching, and management of missing states, stable temporal constraints are provided for the segmentation propagation process of SAM2, thereby reducing the mask drift and short-term target loss problems in long sequence propagation, and finally outputting the optimized instrument binary segmentation result.
[0071] The results of binary segmentation of surgical instruments on the EndoVis 2017 and EndoVis 2018 datasets are shown in Tables 1 and 2, respectively. Binary IoU and Dice were used as evaluation metrics. Experimental results show that the present invention achieves good segmentation performance on both datasets. On the EndoVis 2017 dataset, the Binary IoU reaches 90.44%, and the Dice reaches 94.86%; on the EndoVis 2018 dataset, the Binary IoU reaches 91.24%, and the Dice reaches 95.21%. Compared with U-Net, TernausNet, MF-TAPNet, GSAM+Cutie, and some SAM-based segmentation methods, the present invention achieves better results on the main evaluation metrics, indicating that the present invention can accurately achieve binary segmentation of instruments in complex endoscopic surgical videos and has good stability and robustness. Figure 3 This is a schematic diagram of the surgical instrument segmentation results of the method of the present invention in a complex and challenging scenario, which further demonstrates that the present invention has good temporal continuity and propagation stability in complex surgical video scenarios.
[0072] Table 1. Segmentation results of different surgical instruments in the EndoVis 2017 dataset.
[0073] Method Binary IoU (%) Dice(%) U-Net 75.44 84.37 TernausNet 83.60 90.01 MF-TAPNet 87.56 93.37 GSAM+Cutie 88.00 93.00 SAM Box(GT) 89.19 - SAM2-video(GT BOX) 75.41 82.06 SAM2-Image(GT Box) 90.97 - Ours 90.44 94.86
[0074] Table 2. Segmentation results of different surgical instruments in the EndoVis 2018 dataset.
[0075] Method Binary IoU (%) Dice(%) GSAM+Cutie 81.00 88.00 SAM Box(GT) 89.35 - SAM2-video(GT BOX) 89.43 93.88 SAM2-Image(GT Box) 90.18 - Ours 91.24 95.21
Claims
1. A surgical instrument segmentation method using YOLOv8 combined with SAM2, comprising the following steps: Step a: Construct the EndoVis 2017 and EndoVis 2018 datasets for training and evaluation, process the surgical videos and their corresponding annotations in the datasets, and divide them according to the evaluation protocol. Step b: Construct a detection-guided segmentation framework based on YOLOv8 and SAM2. YOLOv8 is used to output the instrument detection box frame by frame, and SAM2 generates the segmentation result of the current frame according to the prompt information and uses its internal memory mechanism to propagate across frames. Step c: In the video sequence reasoning process, a lightweight geometric state maintenance module is introduced to integrate the current frame detection observation and historical state information to obtain a stable geometric representation of the target, and use it as a spatial prior for segmentation propagation. Step d: Utilize the missing state management mechanism to update and maintain the target state in consecutive frames, and output the final instrument segmentation mask by combining the video segmentation propagation results of SAM2.
2. The surgical instrument segmentation method using YOLOv8 combined with SAM2 as described in claim 1, characterized in that, Step b specifically involves: Step b1: Select SAM2 as the main model for video surgical instrument segmentation, and use its video segmentation capabilities and cross-frame information propagation capabilities to complete instrument segmentation. Step b2: Select YOLOv8 as the frame-by-frame target detection module. During the inference process, output the instrument detection box and its corresponding confidence in the current video frame. In the initialization phase, use the detection box as a prompt input to SAM2 to complete the initial segmentation of the instrument target. Step b3: In subsequent frames, YOLOv8 continues to provide spatial observation information of the instrument, and SAM2 performs cross-frame segmentation propagation based on the prompt information and internal memory mechanism.
3. The surgical instrument segmentation method using YOLOv8 combined with SAM2 as described in claim 1, characterized in that, Step c specifically involves: Step c1: Extract the corresponding bounding box from the segmentation mask output by SAM2 at the current time, and use it as the instantaneous geometric observation information of the current frame. Step c2: Use the lightweight geometry state maintenance module to perform state modeling on the target center position and target frame scale, and perform fusion and update based on the geometric consistency between the current observation and the historical state. Step c3: Use a smooth update mechanism to maintain the stability of the target position and scale between consecutive frames to obtain a continuous and stable target geometric state representation, and use this state as the spatial prior for SAM2 cross-frame segmentation propagation. A surgical instrument segmentation method combining YOLOv8 and SAM2 as described in claim 1, characterized in that, Step d specifically involves: Step d1: Introduce a missing state management mechanism. When an observation is successfully matched, update the state and clear the missing count. When an observation is missing or the matching fails, accumulate the missing count. When the missing count exceeds a preset threshold, mark the target as missing. Step d2: Based on the missing state management results and using the spatial prior information corresponding to the valid targets, guide the cross-frame segmentation propagation of SAM2 and output the surgical instrument segmentation results.