A small target real-time detection method, system and device for micro-precise sperm extraction
By improving the YOLOv1m model and combining cross-frame consistency processing and optical flow motion consistency verification, the problem of detecting sparse small targets in microscopic sperm extraction was solved, achieving real-time detection results with high recall, low redundancy and high stability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA JILIANG UNIV
- Filing Date
- 2026-03-23
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies cannot effectively detect sparse small targets in real time during microscopic sperm extraction, resulting in high false negative rates and poor robustness, especially in low contrast and weak motion conditions where accurate positioning is difficult.
An improved YOLOv1m model is adopted, which optimizes the target detection model to improve detection accuracy and stability by inserting convolutional attention modules into the backbone and neck networks and introducing a lightweight secondary reflow path, combined with cross-frame consistency processing and optical flow motion consistency verification.
It enables the detection of sparse sperm under complex backgrounds and weak motility conditions, reduces the false negative rate, improves the accuracy and robustness of detection, and meets the real-time operation requirements of microscopic sperm retrieval.
Smart Images

Figure CN122244861A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer vision, specifically relating to a method, system, and device for real-time detection of small targets for microscopic precision extraction. Background Technology
[0002] In assisted reproductive medicine, micro-TESE is a procedure that requires the retrieval of extremely sparse, mature small targets from complex testicular tissue. Since these small targets are only about 3–5 μm in diameter and are often in a state of low contrast, complex background, and sparse number under a microscope, doctors need to rely on the naked eye to search field by field under a microscope. The procedure is time-consuming, has a high risk of missed detection, and is highly subjective.
[0003] With the development of artificial intelligence and computer vision technologies, researchers are attempting to apply small target detection algorithms to microscopic video scenes to assist or partially replace manual searches. Existing work can be broadly categorized as follows: One category is medical image recognition methods based on image segmentation and classification. For example, convolutional neural network structures such as U-Net are used to perform pixel-level segmentation or region recognition of rare sperm under low-magnification microscopic fields. These methods can achieve high accuracy offline, but they are mostly based on static images and lack the ability to model video temporal sequences, thus easily failing in real-time surgical scenarios due to factors such as microscopic tremors and focal plane fluctuations. Another category is engineering methods based on real-time target detection frameworks. For example, algorithms such as YOLO are used for rapid detection and localization of small targets in microscopic videos, demonstrating good real-time performance and robustness in standardized tasks such as routine semen analysis. However, these studies typically target laboratory video samples with high sperm density and active motility, where the detection environment is relatively stable. In contrast, in microscopic sperm retrieval, the target density is extremely low, the movement is weak or even static, and the complex background interference from tissue cells and debris makes existing methods unsuitable for direct transfer, leading to false positives and false negatives. Furthermore, some studies have attempted to combine motion information (such as optical flow analysis and trajectory tracking) to improve temporal consistency in video tasks. However, in microscopic settings, stage movement, brightness fluctuations, and focal plane jumps often introduce non-target motion. Existing optical flow methods lack specific constraints, easily misinterpreting jitter as target motion, resulting in insufficient stability of the output results.
[0004] While existing technologies provide feasible evidence for the automatic identification of sparse small targets in microscopic videos, they still suffer from several drawbacks. Firstly, under low contrast and weak motion conditions, the detector's sensitivity is insufficient, making it difficult to effectively capture sparsely distributed small targets, resulting in a high false negative rate. Secondly, due to the lack of cross-frame consistency constraints, the same small target is easily identified repeatedly in consecutive video frames, causing flickering interference in the prompts and resulting in redundant and distorted statistical results. Furthermore, existing methods are not robust to unavoidable operational interferences such as instrument jitter and rapid focal plane switching during microsurgery, making it difficult to maintain the stability of the detection results. In summary, existing technologies are limited by these deficiencies in sensitivity, consistency, and robustness, resulting in the inability to accurately detect the location of small targets in real time. Summary of the Invention
[0005] To address the problem of the inability to accurately detect the location of small targets in real time, this invention provides a method, system, and device for real-time detection of small targets in microscopic sperm extraction techniques.
[0006] To achieve the above objectives, the present invention provides the following technical solution: A first aspect of this invention provides a method for real-time detection of small targets in microscopic sperm extraction, comprising: Real-time video frames from the intraoperative microscope are acquired to obtain the target image to be detected; The target image is input into a pre-trained target detection model, which outputs candidate target images. The target detection model is based on the YOLOv11m model, and a convolutional attention module is inserted after the C3 module of the backbone network and the neck network of the YOLOv11m model. A lightweight secondary backflow path is introduced into the neck network to act on P3. The lightweight secondary backflow path feeds back to P3 step by step through upsampling and concatenation of semantic features to obtain enhanced P3 features. The process of inputting the target image into a pre-trained target detection model and outputting candidate target images includes the following steps: The target image is convolved and downsampled by the backbone network to extract multi-scale features, forming three-scale features P3, P4, and P5. After the three-scale features are fused from top to bottom and reflowed from bottom to top by the neck network, the fused P4 and P5 are fed back to P3 through the lightweight secondary reflow path after upsampling and stitching, resulting in an enhanced P3. The fused P4, P5, and enhanced P3 are then input into the corresponding detection head to generate candidate target images corresponding to the target image. The candidate target image is subjected to cross-frame consistency processing and optical flow motion consistency verification to obtain the final recognition image; Based on the final recognized image, the location of the target to be detected is determined.
[0007] A second aspect of the present invention provides a real-time detection device for small targets in microscopic sperm extraction, comprising: The acquisition module is used to acquire real-time video frames from the intraoperative microscope to obtain the target image to be detected; The processing module is used to input the target image into a pre-trained target detection model and output candidate target images. The target detection model is based on the YOLOv11m model. A convolutional attention module is inserted after the C3 module of the backbone network and the neck network of the YOLOv11m model. A lightweight secondary backflow path is introduced into the neck network to act on P3. The lightweight secondary backflow path feeds back to P3 step by step through semantic feature upsampling and concatenation to obtain enhanced P3 features. The process of inputting the target image into a pre-trained target detection model and outputting candidate target images includes the following steps: The target image is convolved and downsampled by the backbone network to extract multi-scale features, forming three-scale features P3, P4, and P5. After the three-scale features are fused from top to bottom and reflowed from bottom to top by the neck network, the fused P4 and P5 are fed back to P3 through the lightweight secondary reflow path after upsampling and stitching, resulting in an enhanced P3. The fused P4, P5, and enhanced P3 are then input into the corresponding detection head to generate candidate target images corresponding to the target image. The verification module is used to perform cross-frame consistency processing and optical flow motion consistency verification on the candidate target image to obtain the final recognition image. The determination module is used to determine the location of the target to be detected based on the final recognized image.
[0008] A third aspect of the present invention provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method described in any one of claims 1 to 7.
[0009] A fourth aspect of the present invention provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method described in any one of claims 1 to 7.
[0010] The real-time detection method for small targets in microscopic sperm extraction provided by this invention has the following beneficial effects: The target image of the target to be detected is acquired to determine its location within the image. This target image is then input into a pre-trained target detection model, which outputs candidate target images. The target detection model comprises a backbone network, a neck network, and a multi-scale detection head module. By redesigning the feature enhancement path at multiple output positions in the neck region and deeply embedding convolutional attention modules (CBAM) multiple times, the visibility and representation ability of low-contrast, weakly motile sperm are significantly improved. These modules fully utilize the information in the target image, thereby enhancing the accuracy of target detection. A secondary reflow path is used to specifically enhance the semantic information from higher-level P4 and P5 layers. The resulting enhanced P3, with its targeted secondary reflow design, retains its detail advantages while gaining strong semantic discriminative power, directly and effectively addressing the core pain point of weak features and easy missed detection of small targets in complex environments. By deduplicating, filtering, and verifying the candidate target images output by the multi-scale detection head module, it is ensured that when the same sperm is stably tracked in consecutive frames, it will not be re-highlighted and counted in every frame. This avoids inflated statistical counts caused by repeated counting of the same target, ensuring the reliability of sperm counting results and maintaining high stability of the system in real surgical environments. This solves the problem of not being able to accurately detect small targets in real time. Attached Figure Description
[0011] To more clearly illustrate the embodiments and design schemes of the present invention, the accompanying drawings required for this embodiment will be briefly described below. The drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0012] Figure 1 This is a flowchart illustrating a method for real-time detection of small targets in microscopic sperm extraction, as provided in an embodiment of the present invention. Figure 2 This is a schematic diagram of the structure of a target detection model provided in an embodiment of the present invention; Figure 3 This is a schematic diagram of a process for real-time detection of small targets in microscopic sperm extraction, provided by an embodiment of the present invention. Figure 4 This is a schematic diagram of a small target real-time detection device for microscopic sperm extraction provided in an embodiment of the present invention; Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention; Figure 6 This is a schematic diagram of the coordinates and confidence information of sperm output by a target detection model provided in an embodiment of the present invention; Figure 7This is a schematic diagram of sperm distribution in a sparse environment provided by an embodiment of the present invention; Figure 8 This is a schematic diagram of a detection result in a low signal-to-noise ratio scenario provided by an embodiment of the present invention; Figure 9 This is a schematic diagram of sperm detection results between two adjacent frame images provided in an embodiment of the present invention. Detailed Implementation
[0013] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of the invention. However, those skilled in the art will understand that the invention can be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods are omitted so as not to obscure the description of the invention with unnecessary detail.
[0014] The terms "first," "second," etc., used in the specification and claims of this invention are used to distinguish similar objects and not to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that embodiments of the invention can be implemented in orders other than those illustrated or described herein, and the objects distinguished by "first," "second," etc., are generally of the same class and the number of objects is not limited; for example, a first object can be one or more. Furthermore, in the specification and claims, "and / or" indicates at least one of the connected objects, and the character " / " generally indicates that the preceding and following objects are in an "or" relationship.
[0015] Furthermore, it should be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising..." does not exclude the presence of additional identical elements in the process, method, article, or apparatus that includes that element.
[0016] This invention addresses the shortcomings of existing technologies by proposing a real-time small target detection algorithm for sparse sperm in microsurgical sperm retrieval procedures, which fundamentally solves the following four types of technical problems: 1. Detection issues under low contrast and weak motility: By optimizing the small target real-time detector, the ability to extract features of sperm fine structure is enhanced, ensuring timely and accurate identification of sparse sperm even under complex backgrounds and weak motility conditions.
[0017] 2. Cross-frame repeated identification and redundancy of prompts: Design spatiotemporal constraint rules to determine the consistency of the detected target across frames. The identification event is triggered only when the target first appears or when a significant displacement occurs, which effectively avoids the same sperm being repeatedly identified as a new target in consecutive frames and reduces prompt flickering and counting deviation.
[0018] 3. Stability issues under surgical environment interference: Introduce an optical flow motion consistency verification mechanism to bind the detection results with kinematic features. Utilize amplitude and direction consistency to filter out false detections caused by microscopic jitter, focus jumps, or afterimages, thereby improving robustness and reliability.
[0019] 4. Real-time performance and closed-loop output: Each module adopts a cascaded architecture to achieve a complete closed loop of detection, deduplication, and verification under a latency constraint of less than 100 milliseconds, ensuring stable and reliable results and truly meeting the clinical needs of real-time surgical operations.
[0020] In summary, the goal of this invention is to establish a real-time algorithm framework that balances high recall, low redundancy, and strong stability in the actual surgical scenario of micro-TESE, breaking through the bottlenecks of existing technologies and providing a practical artificial intelligence solution for intraoperative retrieval of sparse sperm.
[0021] The following describes in detail, with reference to the accompanying drawings, a method, apparatus, electronic device, and readable storage medium for real-time detection of small targets for microscopic sperm extraction according to embodiments of the present invention.
[0022] Figure 1 This is a flowchart illustrating a real-time detection method for small targets in microscopic sperm extraction, provided by an embodiment of the present invention. This method can be executed by a terminal device or a server. Figure 1 As shown, this real-time detection method for small targets in microscopic sperm retrieval includes:
[0023] Step S1: Acquire real-time video frames from the intraoperative microscope and obtain the target image corresponding to the real-time video frame.
[0024] The target to be detected refers to a small target whose location needs to be determined using a target detection model. As an example, in the embodiment, in a micro-TESE (micro-treasure retrieval of sperm) scenario, the target to be detected is a single, viable, mature sperm cell extracted from the patient's testicular tissue, existing within the microscope's field of view, and suitable for subsequent intracytoplasmic sperm injection (ICSI). Furthermore, the target image includes multiple consecutive frames or a continuous video clip. This target image can be acquired using an intraoperative microscope camera with a resolution of 1080p and a frame rate of 60fps (which can be adjusted equivalently). Further, the acquired target image is sized according to the engineering configuration (maintaining consistency with the training size) to reduce the impact of scale jitter on subsequent features; the temporal frame number is preserved for cross-frame processing by the model.
[0025] Step S2: Input the target image into the pre-trained target detection model to obtain the candidate target image output by the target detection model.
[0026] The target image is convolved and downsampled by the backbone network to extract multi-scale features, forming three-scale features P3, P4, and P5. After the three-scale features are fused from top to bottom and reflowed from bottom to top by the neck network, the fused P4 and P5 are fed back to P3 step by step through the lightweight secondary reflow path after upsampling and splicing to obtain the enhanced P3. The fused P4, P5 and the enhanced P3 are input into the corresponding detection head to generate the candidate target image corresponding to the target image.
[0027] Among them, the target detection model is as follows Figure 2 As shown, the model includes a backbone network, a neck network, a multi-scale detection head module, and a verification output module. This target detection model is based on the YOLOv11m model with six convolutional attention modules inserted, all connected after the C3 module. A secondary reflow path is inserted after the feature pyramid fusion unit corresponding to the YOLOv11m neck network. Based on the backbone network, multi-scale features are generated from the target image. These multi-scale features include P3, P4, and P5. These multi-scale features are processed by the feature pyramid fusion unit to obtain fused P3, P4, and P5 features. The fused P3 is then enhanced through the secondary reflow path to obtain an enhanced P3. The multi-scale detection head module is used to detect the enhanced P3 and the fused P4 and P5 features at different scales to obtain candidate target images corresponding to the target image.
[0028] As an example, the secondary reflow path is configured as follows: the lightweight secondary reflow path includes a first upsampling layer, a first concatenation layer, a first C3 layer, a second upsampling layer, a second concatenation layer, and a second C3 layer; the first upsampling layer upsamples the fused P5 features; the first concatenation layer concatenates the upsampled features with the fused P4 features; the first C3 layer performs convolution and feature fusion on the concatenated features; the second upsampling layer further upsamples the features; the second concatenation layer concatenates the features with the fused P3 features; finally, the second C3 layer performs convolution and feature fusion to output an enhanced P3 for small target detection.
[0029] Preferred, such as Figure 2 As shown, this improved model inserts six CBAM convolutional attention modules on top of YOLOv11m, located in layers 3 (256 channels), 6 (448 channels), 16 (512 channels), 20 (384 channels), 24 (512 channels), and 28 (768 channels). These CBAM modules are immediately following the C3k2 feature extraction or feature fusion modules to enhance feature representation capabilities. Layers 3 and 6 are located in the Backbone to enhance basic feature extraction, layers 16 and 20 are located in the Top-Down path of the Head to enhance top-down feature fusion, and layers 24 and 28 are located in the Bottom-Up path to enhance bottom-up feature aggregation. In addition, the model innovatively introduces a lightweight secondary reflow mechanism, which propagates the high-level semantic features of P5 down through layers 29-34 and fuses them with the intermediate layer features before sending them to the final P3 detection layer. The entire design inserts attention mechanisms at key nodes of multi-scale feature processing, thereby improving detection performance.
[0030] As an example, this model is based on a YOLO detection model with a deep embedded Convolutional Block Attention Module (CBAM), and further designs a lightweight secondary backflow path that operates only on P3. This path feeds back from high-level semantic features to the small target branch of P3 step by step, using a lightweight convolutional structure to reduce latency and avoiding repeated stacking of attention modules at the backflow node, thereby achieving targeted enhancement of sparse small targets.
[0031] Step S3: Perform cross-frame consistency processing and optical flow motion consistency verification on the candidate target image to obtain the final recognition image.
[0032] Specifically, the model-generated, overlaid visual recognition results (recognition images) are transformed into precise, quantified spatial coordinate information that can be used for clinical decision-making and surgical procedures. As an example, the final result is a recognition image with visual aids such as highlighted boxes and crosshairs. The pixel coordinates of these visual aids are parsed and mapped back to the true spatial coordinates of the original microscope field of view. The final output is the digitized location data of one or more targets to be detected.
[0033] Furthermore, this invention utilizes video temporal information to eliminate repeated identification of the same target in consecutive frames through trajectory association and deduplication. It also analyzes scene motion patterns to distinguish between real targets and artifacts caused by operational jitter and focal length changes. Cross-frame processing ensures the stability of the output results; the same sperm is only indicated and counted once, greatly improving the doctor's visual experience and data accuracy. Optical flow verification efficiently filters out most false targets caused by environmental interference, such as tissue afterimages and jitter blur, ensuring the system maintains extremely high reliability in realistic surgical environments.
[0034] Step S4: Determine the location of the target to be detected based on the final recognized image.
[0035] This invention provides a complete workflow for processing raw images using an end-to-end deep learning model specifically designed for sparse small object detection. Its core function is to transform raw microscopic video frames with low contrast, complex backgrounds, and interference into a stable, clear, and spatiotemporally consistent recognition result. In the highly challenging scenario of microscopic sperm retrieval, it achieves an automated sperm detection solution with high recall, low false positive rate, stable output, and real-time performance, thereby freeing doctors from the tedious and error-prone task of manual visual searching.
[0036] In some embodiments, the backbone network includes a first convolutional layer, a second convolutional layer, a first C3 module, a third convolutional layer, a first convolutional attention calculation module, a third convolutional layer, a second C3 module, a second convolutional attention calculation module, a fourth convolutional layer, a third C3 module, a fifth convolutional layer, a fourth C3 module, a spatial pyramid pooling fast module, and a cross-stage partial spatial attention module connected in sequence. The backbone network is used to generate a first target feature map, a second target feature map, and a third target feature map corresponding to the target image based on the feature image corresponding to the target image. This includes: the target image features are processed sequentially based on the backbone network; the first target feature map is output after processing by the second convolutional attention calculation module; further, the second target feature map is output after processing by the third C3 module; and even further, the third target feature map is output after processing by the cross-stage partial spatial attention module.
[0037] Among them, the three feature maps P3, P4, and P5 generated in the target detection model of the present invention correspond to the first target feature map, the second target feature map, and the third target feature map in the embodiment. The first target feature map has the highest resolution, and the third target feature map has the lowest resolution.
[0038] According to the technical solution provided in the embodiments of the present invention, the backbone network achieves efficient and robust extraction of sparse small target features in microscopic images through a hierarchical and depth-enhanced architecture. Its specific process and effects are as follows:
[0039] The preprocessed input image first undergoes preliminary feature extraction and downsampling through the first and second convolutional layers. This process quickly focuses on basic visual patterns (such as edges and textures) and reduces the data dimensionality, laying the foundation for subsequent depth calculations.
[0040] Subsequently, the features are processed by the first C3 module, which performs powerful nonlinear feature transformation by integrating multiple convolutions and residual connections. Its core function is to enhance the model's representational ability, learn from simple features and combine them into more complex structures. After further downsampling by the third convolutional layer, the features enter the first convolutional attention calculation module. This module dynamically calibrates the feature response by parallel computing channel attention and spatial attention. Its function is to guide the network to focus on the channels and key spatial regions in the image that have richer information, thus achieving "attention focusing" in the early stages of the backbone network.
[0041] Subsequently, the feature flow passes through the second C3 module and the second convolutional attention calculation module for deeper feature transformation and secondary attention optimization. The purpose of this stage is to consolidate and strengthen intermediate features, making the network more sensitive to specific patterns of sperm morphology. By further purifying the feature map, it ensures that key target information is preserved and enhanced when the features are passed to the back end of the network.
[0042] Next, the fourth convolutional layer performs crucial downsampling, reducing the feature map resolution to 1 / 16 of the original image. Its output is processed by the third C3 module and used as the first target feature map. The purpose of this layer is to preserve relatively rich spatial details. The feature stream continues to be downsampled by the fifth convolutional layer, then processed by the fourth C3 module, and output as the second target feature map (P4). The purpose of this layer is to balance semantic information and spatial details.
[0043] Finally, the features are processed by the Spatial Pyramid Pooling (SPPF) module and the cross-stage partial spatial attention module to output the third target feature map (P5). The SPPF module uses pooling kernels of different scales in parallel to efficiently aggregate multi-scale contextual information, enabling the network to understand the target's representation in different receptive fields. The C2PSA module establishes global spatial dependencies, allowing any point on the feature map to interact with all locations in the entire map. The combined effect of these two modules is to inject the model with powerful global context awareness and spatial reasoning capabilities. This allows the network to make accurate judgments based on global information even when the target is partially occluded, has extremely low contrast, or is in a complex background, greatly improving the model's robustness and discriminative power in real surgical scenarios.
[0044] In summary, this backbone network embodiment, through the combined effect of hierarchical structure and deep embedding attention and context modeling mechanisms, ultimately produces high-quality multi-scale feature maps that contain detailed, semantic, and global information, thus ensuring the high-precision real-time detection of sparse small targets in the entire system.
[0045] In some embodiments, the feature pyramid fusion unit includes a first sampling layer, a first concat module, a fifth C3 module, a third convolutional attention calculation module, a second sampling layer, a second concat module, a sixth C3 module, a fourth convolutional attention calculation module, a sixth convolutional layer, a third concat module, a seventh C3 module, a fifth convolutional attention calculation module, a seventh convolutional layer, a fourth concat module, an eighth C3 module, and a sixth convolutional attention calculation module connected in sequence; the secondary backflow path includes a third sampling layer, a fifth concat module, a ninth C3 module, a fourth sampling layer, a sixth concat module, and a tenth C3 module connected in sequence.
[0046] According to the technical solution provided in the embodiments of the present invention, the feature pyramid fusion unit and the secondary reflow path jointly construct a multi-level, bidirectional information flow network through a series of refined sampling, splicing, convolution calculation, and attention enhancement operations, aiming to optimize and integrate multi-scale features from the backbone network. The core function of the feature pyramid fusion unit is to realize a standard top-down and bottom-up bidirectional fusion path. First, upsampling is performed through the first sampling layer to align the high-level semantic features with the low-level features in the spatial dimension. Then, channel splicing is performed through the first concat module, allowing the high-level semantics and low-level details to initially merge. The fifth C3 module then performs nonlinear transformation and refinement on the spliced mixed features to learn an effective fusion representation. Next, the third convolutional attention calculation module is applied to the fused features, which adaptively recalibrates the channel and spatial weights, suppresses noise interference from background organization, and highlights potential candidate target image regions. This top-down path effectively transmits rich semantic context information to the bottom layer of the feature pyramid.
[0047] To supplement the localization details of high-level features, this unit further opens a bottom-up path. Downsampling is performed through the second sampling layer, aligning the detailed low-level features with the mid-level features in scale, and then concatenating them with the mid-level features through the second concat module. The sixth C3 module then refines the concatenated result, and the fourth convolutional attention computation module further optimizes it, ensuring that while returning detailed information to higher levels, the discriminative nature of the features is maintained. Subsequently, through the collaborative operation of the sixth convolutional layer, the third concat module, the seventh C3 module, and the fifth convolutional attention computation module, further bottom-up feature propagation and enhancement are completed. Finally, through processing by the seventh convolutional layer, the fourth concat module, the eighth C3 module, and the sixth convolutional attention computation module, the fused P4 and P5 features are output. This bidirectional fusion mechanism ensures that each scale of the feature pyramid simultaneously possesses the spatial details required for accurate localization and the high-level semantics required for classification and recognition, significantly improving the model's ability to represent multi-scale targets, especially small targets with varying appearance and size.
[0048] Building upon this foundation, the secondary backflow path, as an enhancement design, specifically strengthens the key low-level features in the aforementioned fusion results. This unit first upsamples the deeply fused, higher-level P5 features through the third sampling layer, matching their low-level feature resolution with that of the fused P4 features. Then, it concatenates these features through the fifth concat module, injecting additional, more global contextual information into the low-level features. The ninth C3 module learns and extracts more discriminative combined features from this concatenated feature set. To further aggregate information, this path continues through the fourth sampling layer and the sixth concat module, introducing the fused P3 features. The tenth C3 module then performs the final feature extraction, outputting a highly enhanced P3 specifically for small object detection. The secondary backflow path creates a dedicated, high-intensity semantic information delivery channel for the core small object detection branch (which typically relies on high-resolution low-level features), greatly enriching the feature content of this branch. This effectively and specifically addresses the problem of missed detections of sparse small objects in complex backgrounds due to weak features, comprehensively improving the system's recall rate.
[0049] In some embodiments, before performing cross-frame consistency processing and optical flow motion consistency verification on the candidate target image to obtain the final recognized image, the process includes: The sparse small target locations corresponding to the candidate target image are filtered by confidence thresholding and non-maximum suppression to obtain the preliminary candidate target image corresponding to the target to be detected.
[0050] According to the technical solution provided in the embodiments of the present invention, after the model outputs candidate target images, key confidence threshold filtering and non-maximum suppression operations are performed to refine the detection results. This step aims to clean up the original detection boxes output by the model by removing false positives and redundancy. Specifically, confidence threshold filtering directly filters out prediction boxes with low confidence levels, which are often hesitant in the model's judgment. These typically correspond to background noise, tissue artifacts, or obvious false detections, effectively eliminating a large number of false targets. Next, the non-maximum suppression algorithm processes the remaining overlapping detection boxes. Its principle is to select the candidate box with the highest confidence level from the densely packed candidate boxes around each target as a representative, while suppressing redundant boxes that highly overlap with it but have low confidence levels.
[0051] The joint processing workflow improves the purity and reliability of the final output candidate target image set. It transforms the coarse results from the original model output, which contain a large amount of noise and duplicate bounding boxes, into a clean, accurate, and unique target list. This not only significantly reduces the system's false positive rate, providing doctors with highly reliable auxiliary information and avoiding being misled by erroneous prompts, but also reduces the processing burden on subsequent verification output modules, as the data sent to later stages has already undergone high purification. This ensures the efficiency and stability of the entire system's processing flow, a crucial prerequisite for ensuring high-quality output of the final recognized images.
[0052] In some embodiments, before inputting the target image into a pre-trained target detection model and obtaining the recognized image output by the target detection model, the method further includes: The target image is subjected to contrast-limited adaptive histogram equalization to obtain the corresponding feature image.
[0053] According to the technical solution provided in the embodiments of the present invention, before inputting the target image into the target detection model, it is first subjected to Limit Contrast Adaptive Histogram Equalization (CLAHE) processing. This step enhances the image quality from the data source. Specifically, its beneficial effect is particularly prominent in the specific scenario of microscopic sperm retrieval: it can make those blurry sperm targets with minimal contrast to the surrounding tissue background clearer and easier to identify.
[0054] In some embodiments, the candidate target image is subjected to cross-frame consistency processing and optical flow motion consistency verification to obtain the final recognized image, including: If the intersection-union ratio (IUU) of the first preliminary candidate target image in the current frame and the second preliminary candidate target image in the previous frame is greater than or equal to a first preset threshold, and the center point displacement is less than a second preset threshold, then the first preliminary candidate target image and the second preliminary candidate target image are determined to be the same target to be detected. The optical flow field of the target image in the corresponding consecutive frames is calculated. If the optical flow pattern of the preliminary candidate target image region is consistent with the optical flow pattern of the surrounding background region in terms of direction and amplitude, it is determined to be a false detection and filtered out to obtain the optical flow field determination result. The recognition image corresponding to the target image is determined based on the optical flow field determination result.
[0055] According to the technical solution provided in the embodiments of the present invention, the verification output module performs a set of precise logical judgments combining spatiotemporal information to purify and confirm candidate target images. This process first performs cross-frame consistency processing, the core function of which is to use the temporal continuity of the video to determine the uniqueness and persistence of the target. Specifically, if a candidate target image in the current frame highly overlaps spatially with a target in the previous frame and its position is stable, it is determined to be the same entity. Next, the module performs optical flow motion consistency verification, which distinguishes real sperm from artifacts caused by manipulation from the perspective of motion patterns. By calculating the dense optical flow field between consecutive frames and comparing the consistency between the local motion of the candidate target image and the global background motion, it can effectively identify false targets that are not moving on their own, but rather caused by the overall movement of the microscope stage or by focal length jitter.
[0056] This greatly improves the system's robustness and anti-interference ability in real surgical environments, significantly reduces the false detection rate caused by unavoidable minor vibrations during operation, and ensures that every target marked in the final output recognition image is a highly reliable real sperm that has undergone spatiotemporal dual verification, greatly enhancing the reliability of clinical applications.
[0057] In some embodiments, before inputting the target image into a pre-trained target detection model to obtain the candidate target image output by the target detection model, the method further includes: Obtain a training sample set, which includes different samples to be detected and the corresponding real target location maps; train the target detection model based on the training sample set to obtain a trained target detection model.
[0058] Specifically, the sample to be detected is input into the target detection model in training to obtain the location map output by the target detection model; based on the location map and the real target location map, the loss function of the target detection model is calculated; if the loss function is less than a preset value, then the trained target detection model is determined to be obtained.
[0059] As an example, the object detection model is trained in an end-to-end manner, optimizing the model parameters by minimizing the loss function between the predicted location and the ground truth label; when the model's performance on the validation set reaches a preset standard, training is stopped and the model parameters are saved.
[0060] Based on the above inventive concept, the present invention proposes an embodiment to achieve the same technical effect, such as... Figure 3 As shown, it includes the following steps: S1. Input Acquisition and Preprocessing.
[0061] S1-1, Data Acquisition Configuration: Acquisition source: intraoperative microscope camera; resolution 1080p, frame rate 60fps (adjustable equivalently).
[0062] Input specifications: The dimensions are normalized according to the engineering configuration (keeping the same as the training dimensions) to reduce the impact of scale jitter on subsequent features; the temporal frame number is maintained for cross-frame processing.
[0063] S1-2, Pretreatment: The captured video frames are processed using CLAHE to enhance image contrast and detail. The processed frames are then used as input to S2.
[0064] S2. Improved Backbone Feature Extraction.
[0065] S2-1, Basic Path: The input frame is convolved and downsampled, and then enters the C3 series modules step by step to extract multi-scale spatial / semantic features, forming P3, P4 and P5 three-scale feature outputs.
[0066] S2-2, Structural-level reinforcement: Multi-point attention embedding (CBAM): CBAM is deeply embedded in the outputs (aligned channels) at each level of the backbone, jointly modeling the channels and space, significantly enhancing the response of sparse small targets in complex backgrounds. It is not a simple add-on module, but rather a reconstruction of the enhancement path at multiple output points, enabling attention and backbone extraction to be coupled and coordinated.
[0067] High-level context and global spatial dependency: SPPF is connected in series at the high-level output of the Backbone to aggregate multi-scale context, and then C2PSA is connected to model global spatial dependency to improve the separability and detectability of weak signal targets.
[0068] S2-3, Output: The three feature maps P3 / P4 / P5 are generated for subsequent fusion. After the above reconstruction, Backbone has the ability to enhance small targets with "multi-layer attention + global dependency".
[0069] S3, Feature Fusion Neck (FPN+PAN+Lightweight Secondary Reflow).
[0070] S3-1, FPN (Top-Down) fusion: Upsample P5→P4→P3 step by step and align and stitch them with the corresponding low-level features; embed CBAM at key fusion nodes to suppress background noise such as tissue texture, highlight potential sperm regions, and obtain enhanced small-scale representation.
[0071] S3-2, PAN (Bottom-Up) flow: The fused low-level features are then downsampled and re-sampled at P3→P4→P5 to supplement details and maintain multi-scale consistency. CBAM is also embedded at key nodes to improve cross-scale stability and discriminative power.
[0072] S3-3, Lightweight secondary reflux “acting only on P3”: After completing the standard FPN+PAN, an additional small backflow path is designed from the high-level semantics to P3: Starting from high semantic features, the data is fed back to P3 through upsampling and concatenation. Only the small target branch (P3) is enhanced, and lightweight convolution is used to control latency; To avoid channel mismatch and redundant attention calculations after concat, CBAM is no longer superimposed on this path; the enhanced P3 is then used as the detection input for the main branch of the small target.
[0073] S4, Multi-scale detection head inference.
[0074] S4-1, Three-Scale Detection Head: Detection heads are set up at three scales: P3, P4, and P5, and candidate box coordinates, confidence scores, and class labels (nc=1: sperm / background) are output.
[0075] S4-2, Function Description: Three-scale inspection is a common practice in the YOLO series to cover different target sizes; the performance improvement of this invention is mainly due to the structural improvements of S2 / S3, while the inspection head maintains compatibility and engineering integrity.
[0076] S5. Post-processing and result verification.
[0077] S5-1, Basic Filtration: The candidate boxes are filtered using confidence thresholds and redundancy is removed using NMS to obtain the first version of candidate results.
[0078] S5-2, Cross-frame repetition detection and suppression: For candidate target images in adjacent frames, a combination of IoU and displacement thresholding is used to determine whether they belong to the "same sperm": If the IoU of a candidate in the previous frame is greater than or equal to a set threshold and the center displacement is less than the displacement threshold, it is considered a continuation of the same target, and the "new target" event is not triggered. The recognition event and counting are only triggered when the target "appears for the first time" or "significant displacement occurs." This avoids the repeated reporting of the same sperm in consecutive frames, eliminating prompt flashing and duplicate statistics. The threshold is adjustable according to the data distribution and microscopic magnification.
[0079] S5-3, Optical Flow Motion Consistency Verification: The optical flow vector field of consecutive frames is calculated, and the average / median optical flow vector within each candidate box is extracted and compared with the global / neighborhood background optical flow. If the motion direction and amplitude of the candidate target image are consistent with or abnormal with the background (e.g., consistent with the overall motion of the stage), it is determined to be a false detection and is removed. If the optical flow of the candidate target image is significantly different from the background and consistent with the detection trajectory, it is retained. This mechanism effectively filters out false positives caused by microscope shaking, focus switching and illumination fluctuations, and improves intraoperative stability.
[0080] S5-4, Results: The candidates after cross-frame deduplication and optical flow verification are the stable and reliable final detection set, which is used for real-time display and subsequent statistics.
[0081] S6. Real-time result output.
[0082] The final detection results are displayed on the eyepiece / external screen as a rectangle with a confidence level overlaid for intraoperative reference; the system is designed with a "millisecond-level closed loop" to meet real-time requirements.
[0083] In "sparse / low contrast / jitter" scenarios, the improvements of S2 / S3 / S5 can provide stable prompts and reduce missed detections and false alarms.
[0084] Experimental verification: The improved network proposed in this invention was validated on a self-built dataset. Compared with the official YOLOv11m model, the SpermBest model trained in this invention has achieved significant improvements in multiple metrics such as mAP, Precision, and Recall, with mAP@0.5:0.95 being improved by more than 21%.
[0085] like Figure 6 As shown, under normal microscopic conditions, the target density is moderate and the contrast is good. Using the algorithm of this invention, the detection module outputs the coordinates and confidence information of each sperm in real time, which is highly consistent with the results of manual identification, verifying the accuracy of the algorithm under normal conditions.
[0086] like Figure 7 As shown, in sparse scenarios, only a very small number of sperm are present, making manual retrieval time-consuming. The algorithm of this invention can complete target detection and output within milliseconds, significantly shortening retrieval time and improving intraoperative efficiency.
[0087] like Figure 8As shown, in sparse scenes with extremely low contrast or complex backgrounds, some sperm targets are almost impossible to identify by the human eye. This invention, through image preprocessing (CLAHE enhancement) and feature extraction using a small target detection network, can successfully detect these weak-signal sperm and display them stably on the output interface, demonstrating the algorithm's advantages in low signal-to-noise ratio scenarios.
[0088] like Figure 9 As shown, when the microscope moves the stage or adjusts the focus, the same sperm may appear consecutively in multiple frames. Traditional detection methods may misidentify it as multiple new targets, causing redundancy. The cross-frame repetition recognition suppression and optical flow motion consistency verification mechanism of this invention can identify it as the same target, maintain uniqueness and trajectory continuity, and avoid duplicate statistics.
[0089] Key technical differences between this invention and existing technologies: 1. Most existing methods use standard sperm datasets, and the objects of detection are often dense sperm or regular samples. This invention targets the intraoperative scenario of micro-TESE, characterized by sparse sperm count, low contrast, weak motility, and accompanied by microscopic vibration and light fluctuations. Therefore, its application background and applicability are significantly different from existing work, and it is more in line with actual clinical needs.
[0090] 2. Regarding feature modeling, while existing improved YOLO or CNN methods have introduced attention mechanisms or small target detection layers, they often remain at the level of conventional multi-scale designs. This project embeds CBAM in multiple locations on the backbone and neck, and enhances the visibility and detection sensitivity of small target sperm through P3 branching and lightweight secondary reflow, which is something that existing methods have not covered.
[0091] 3. At the temporal level, most existing works employ independent frame-by-frame detection, lacking cross-frame consistency, leading to the repeated identification of the same sperm in multiple frames. This project proposes a cross-frame repeated identification suppression mechanism, which uses IoU and displacement thresholds to determine whether identification is triggered only when the target first appears or undergoes significant displacement, thereby ensuring stable output and avoiding flickering and duplicate counting.
[0092] 4. Regarding robustness, existing methods generally rely solely on detector output and lack physical layer motion constraints. This project introduces optical flow consistency verification to validate the motion direction and amplitude of the detection results, effectively filtering out false detections caused by microscopic jitter, focus switching, or illumination changes, ensuring continuous and reliable detection results even in complex intraoperative environments.
[0093] 5. Finally, regarding application positioning, existing research has largely focused on offline analysis or research datasets, making it difficult to directly apply to real-time intraoperative detection. This invention, however, is explicitly geared towards real-time clinical applications, maintaining high accuracy while also considering inference speed and practicality, thus better meeting the requirements of surgical scenarios.
[0094] Compared with the closest prior art mentioned above, the beneficial effects achieved by the present invention are mainly as follows: Significantly reduces false negative rate: The attention mechanism improves detection sensitivity in low-contrast, low-motion environments.
[0095] Avoid redundant identification across frames: The cross-frame determination logic ensures that the same sperm is not counted repeatedly in consecutive frames, improving statistical accuracy and avoiding surgical interference.
[0096] Enhanced anti-interference capability: Optical flow consistency verification filters out pseudo-motions such as jitter and focus switching, improving the system's stability in complex surgical environments.
[0097] Meeting real-time clinical needs: The system's closed-loop latency is less than 100 milliseconds, truly achieving "millisecond-level" prompts usable during surgery.
[0098] Overall clinical value: Significantly shortens the time for sparse sperm retrieval during micro-TESE surgery, reduces the risk of missed detection, and improves success rate and surgical efficiency.
[0099] All of the above-mentioned optional technical solutions can be combined in any way to form optional embodiments of the present invention, and will not be described in detail here.
[0100] The following are embodiments of the apparatus of the present invention, which can be used to execute embodiments of the method of the present invention. For details not disclosed in the embodiments of the apparatus of the present invention, please refer to the embodiments of the method of the present invention.
[0101] Figure 4 This is a schematic diagram of a real-time detection device for small targets in microscopic sperm extraction, provided by an embodiment of the present invention. Figure 4 As shown, the real-time detection device for small targets in microscopic sperm extraction includes:
[0102] The acquisition module 401 is used to acquire real-time video frames from the intraoperative microscope to obtain the target image to be detected.
[0103] The processing module 402 is used to input the target image into a pre-trained target detection model and output a candidate target image. The target detection model is based on the YOLOv11m model. A convolutional attention module is inserted after the C3 module of the backbone network and the neck network of the YOLOv11m model. A lightweight secondary backflow path is introduced in the neck network to act on P3. The lightweight secondary backflow path feeds back to P3 step by step through upsampling and concatenation of semantic features to obtain enhanced P3 features.
[0104] The process of inputting the target image into a pre-trained target detection model and outputting candidate target images includes the following steps: convolving and downsampling the target image through the backbone network to extract multi-scale features, forming three-scale features of P3, P4, and P5; performing top-down fusion and bottom-up backflow on the three-scale features through the neck network, and then feeding back the fused P4 and P5 to P3 step by step through the lightweight secondary backflow path via upsampling and splicing to obtain an enhanced P3; inputting the fused P4, P5, and enhanced P3 into the corresponding detection head to generate candidate target images corresponding to the target image.
[0105] The verification module 403 is used to perform cross-frame consistency processing and optical flow motion consistency verification on the candidate target image to obtain the final recognition image.
[0106] The determination module 404 is used to determine the location of the target to be detected based on the final recognized image.
[0107] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
[0108] Figure 5 This is a schematic diagram of the electronic device 5 provided in an embodiment of the present invention. Figure 5 As shown, the electronic device 5 of this embodiment includes: a processor 501, a memory 502, and a computer program 503 stored in the memory 502 and executable on the processor 501. When the processor 501 executes the computer program 503, it implements the steps in the various method embodiments described above. Alternatively, when the processor 501 executes the computer program 503, it implements the functions of each module / unit in the various device embodiments described above.
[0109] Electronic device 5 can be a desktop computer, laptop, handheld computer, cloud server, or other electronic device. Electronic device 5 may include, but is not limited to, processor 501 and memory 502. Those skilled in the art will understand that... Figure 5 This is merely an example of electronic device 5 and does not constitute a limitation on electronic device 5. It may include more or fewer components than shown, or different components.
[0110] The processor 501 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
[0111] The memory 502 can be an internal storage unit of the electronic device 5, such as a hard disk or RAM of the electronic device 5. The memory 502 can also be an external storage device of the electronic device 5, such as a plug-in hard disk, SmartMediaCard (SMC), SecureDigital (SD) card, or FlashCard. The memory 502 can also include both internal and external storage units of the electronic device 5. The memory 502 is used to store computer programs and other programs and data required by the electronic device.
[0112] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0113] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, as well as combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0114] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0115] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0116] It should be noted that the specific embodiments described above enable those skilled in the art to more fully understand the present invention, but do not limit the present invention in any way. Therefore, although the present invention has been described in detail in this specification, those skilled in the art should understand that modifications or equivalent substitutions can still be made to the present invention; and all technical solutions and improvements that do not depart from the spirit and scope of the present invention are covered within the protection scope of the patent of the present invention. No reference numerals in the claims should be construed as limiting the scope of the claims.
Claims
1. A method for real-time detection of small targets in microscopic sperm extraction, characterized in that, The method includes the following steps: Real-time video frames from the intraoperative microscope are acquired to obtain the target image to be detected; The target image is input into a pre-trained target detection model, which outputs candidate target images. The target detection model inserts a convolutional attention module after the C3 module of the backbone network and the neck network of the YOLOv11m model. A lightweight secondary backflow path acting on P3 is introduced into the neck network. The lightweight secondary backflow path feeds back to P3 step by step through upsampling and concatenation of semantic features to obtain enhanced P3 features. The candidate target images are subjected to cross-frame consistency processing and optical flow motion consistency verification to obtain the final recognition image; Based on the final recognized image, the location of the target to be detected is determined; The process of inputting the target image into a pre-trained target detection model and outputting candidate target images includes the following steps: The target image is convolutionally and downsampled through the backbone network, and convolutional calculation is performed through the convolutional attention module to generate three-scale features of P3, P4 and P5. The three-scale features are input into the feature pyramid through the neck network and fused. Convolutional attention is calculated at the fusion node to obtain fused P3, P4, and P5. The fused P4 and P5 are upsampled and concatenated with the fused P3 through the lightweight secondary backflow path to obtain the enhanced P3.
2. The real-time detection method for small targets in microscopic sperm extraction according to claim 1, characterized in that, The process of performing convolution and downsampling operations on the target image through the backbone network, and generating P3, P4, and P5 three-scale features through the convolution attention module, includes the following steps: The backbone network is used to generate multi-scale features corresponding to the target image based on the feature image corresponding to the target image, including the following steps: The target image features are processed sequentially based on the backbone network, and P3 is output after being processed by the convolutional attention calculation module. P4 is output after processing by module C3; P5 is output after being processed by the cross-stage partial spatial attention module.
3. The real-time detection method for small targets in microscopic sperm extraction according to claim 1, characterized in that, The lightweight secondary reflow path includes a first upsampling layer, a first concatenation layer, a first C3 layer, a second upsampling layer, a second concatenation layer, and a second C3 layer. The first upsampling layer upsamples the fused P5 features. The first concatenation layer concatenates the upsampled features with the fused P4 features. The first C3 layer performs convolution and feature fusion on the concatenated features. The second upsampling layer further upsamples the features. The second concatenation layer concatenates the features with the fused P3 features. Finally, the second C3 layer performs convolution and feature fusion to output the enhanced P3.
4. The method for real-time detection of small targets for microscopic sperm extraction according to claim 1, characterized in that, Before performing cross-frame consistency processing and optical flow motion consistency verification on the candidate target images to obtain the final recognition image, the method further includes: performing confidence thresholding and non-maximum suppression filtering on the sparse small target positions corresponding to the candidate target images to obtain preliminary candidate target images corresponding to the target to be detected.
5. The method for real-time detection of small targets for microscopic sperm extraction according to claim 1, characterized in that, Before inputting the target image into a pre-trained target detection model to obtain the recognized image output by the target detection model, the process further includes: The target image is subjected to contrast-limited adaptive histogram equalization to obtain the feature image corresponding to the target image.
6. The method for real-time detection of small targets for microscopic sperm extraction as described in claim 1, characterized in that, The cross-frame consistency processing of candidate target images includes the following steps: If the intersection-union ratio of the current frame candidate target and the previous frame candidate target corresponding to the candidate target image is greater than or equal to a first preset threshold, and the center point displacement is less than a second preset threshold, then the current frame candidate target and the previous frame candidate target are determined to be the same target to be detected.
7. The method for real-time detection of small targets for microscopic sperm extraction as described in claim 1, characterized in that, The optical flow motion consistency verification of the candidate target image includes the following steps: Calculate the optical flow field of consecutive frames corresponding to the candidate target image. If the optical flow pattern of the candidate target image region is consistent with the optical flow pattern of the surrounding background region in both direction and amplitude, then the candidate target corresponding to the candidate target image is determined to be a false detection and is filtered out.
8. A real-time detection device for small targets in microscopic sperm extraction, characterized in that, include: The acquisition module is used to acquire real-time video frames from the intraoperative microscope to obtain the target image to be detected; The processing module is used to input the target image into a pre-trained target detection model and output candidate target images. The target detection model is based on the YOLOv11m model, and a convolutional attention module is inserted after the C3 module of the backbone network and the neck network of the YOLOv11m model. A lightweight secondary backflow path is introduced in the neck network to act on P3. The lightweight secondary backflow path feeds back to P3 step by step through semantic feature upsampling and concatenation to obtain enhanced P3 features. The process of inputting the target image into a pre-trained target detection model and outputting candidate target images includes the following steps: The target image is convolved and downsampled by the backbone network to extract multi-scale features, forming three-scale features P3, P4, and P5. After the three-scale features are fused from top to bottom and reflowed from bottom to top by the neck network, the fused P4 and P5 are fed back to P3 through the lightweight secondary reflow path after upsampling and stitching, resulting in an enhanced P3. The fused P4, P5, and enhanced P3 are then input into the corresponding detection head to generate candidate target images corresponding to the target image. The verification module is used to perform cross-frame consistency processing and optical flow motion consistency verification on the candidate target image to obtain the final recognition image; The determination module is used to determine the location of the target to be detected based on the final recognized image.
9. A computer-readable storage medium, characterized in that, The storage medium stores a computer program, which, when executed by a processor, implements the method described in any one of claims 1 to 7.
10. A computer device, characterized in that, The method includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method described in any one of claims 1 to 7.