Endoscope video intelligent analysis system and method based on multi-agent cooperation

The intelligent endoscopic video analysis system, which utilizes multi-agent collaboration, solves the problems of time-consuming and privacy-preserving endoscopic examinations, enabling efficient and accurate lesion detection and standardized medical report generation while ensuring data privacy.

CN122245651APending Publication Date: 2026-06-19JUJIAOXINCHUANG MEDICAL ELECTRONICS (SHANGHAI) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JUJIAOXINCHUANG MEDICAL ELECTRONICS (SHANGHAI) CO LTD
Filing Date
2026-04-24
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Current endoscopic examinations rely on manual operation and experience-based judgment by doctors, which is time-consuming and carries the risk of missing tiny lesions. Existing AI technology is difficult to effectively process the temporal information and anatomical transition patterns of endoscopic videos, and the deployment of such systems poses privacy and security risks.

Method used

An intelligent endoscopic video analysis system based on multi-agent collaboration is adopted, including edge computing devices, local servers and host computers. Through the collaborative work of multiple agent modules, lesion target detection, spatiotemporal feature extraction and medical description generation are performed to achieve linkage between lesion, location and description, and privacy and security are ensured through structured metadata data transmission.

🎯Benefits of technology

It enables intelligent analysis of the entire endoscopic video process, improving the efficiency and accuracy of lesion detection, generating standardized medical reports, protecting data privacy, and reducing the waste of computing resources and human error.

✦ Generated by Eureka AI based on patent content.
Patent Text Reader

Abstract

This application relates to the field of intelligent medical image analysis technology, and discloses an intelligent analysis system and method for endoscopic videos based on multi-agent collaboration. The system includes an edge computing device, a local server, and a host computer. The edge computing device executes the first and second agent modules in parallel, respectively using a target detection and instance segmentation shared backbone network to extract lesion segmentation masks and calculate physical dimensions using dynamic distance compensation; it also uses a spatiotemporal graph convolutional network combined with an anatomical topology finite state machine to output site segmentation results, and sends structured metadata to the local server via a message middleware. The local server executes the third and fourth agent modules sequentially, using retrieval enhancement generation technology to drive a large language model to generate medical descriptive text, and generating a medical examination report based on spatiotemporal alignment matching. The host computer performs visualization rendering and triggers closed-loop regeneration through label correction operations. This invention achieves intelligent analysis of the entire endoscopic video process.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of intelligent medical image analysis technology, and in particular to an intelligent analysis and automatic report generation technology for endoscopic videos based on a multi-agent collaborative architecture. Background Technology

[0002] Gastrointestinal endoscopy is a core tool for the diagnosis and screening of digestive system diseases, playing an irreplaceable role in clinical scenarios such as early colorectal cancer screening and upper gastrointestinal lesion assessment. In a typical clinical workflow, gastroenterologists insert endoscopes (such as gastroscopes and colonoscopes) into the patient's body through natural cavities to observe the morphological characteristics of the digestive tract mucosa in real time. They qualitatively determine, mark the location, and assess the size of lesions such as polyps, ulcers, erosions, and bleeding, and write a structured examination report after the examination.

[0003] However, current endoscopic image analysis and report generation processes still largely rely on doctors' manual operation and experience. Taking colonoscopy as an example, a complete examination typically lasts several tens of minutes, generating a massive amount of video data. After the examination, doctors need to review the video frame by frame to locate key lesions, manually extract images, label the location and characteristics of lesions, and then compile a written report based on memory and notes. This process is not only time-consuming but also carries the risk of missing minute lesions due to human error. Especially in high-volume endoscopy centers, doctors handle a large number of cases daily, posing challenges to both the efficiency and quality of report writing.

[0004] In recent years, research on AI-assisted endoscopic analysis has gradually emerged. Some existing technologies attempt to use a single AI model to assist in examinations, such as using object detection models to identify polyps in static images, or using large language models to generate text descriptions from single-frame images. However, these technical solutions generally have the following limitations: In terms of video analysis, existing systems mostly only support independent recognition of single-frame images, and cannot effectively process the temporal information in endoscopic videos, making it difficult to capture the dynamic spatiotemporal distribution characteristics of lesions and the gradual transition patterns of anatomical sites as the endoscope continuously advances in the digestive tract; In terms of report generation, there is a lack of an effective correlation mechanism between the detection results of object recognition models and anatomical site information, resulting in the inability to automatically achieve the linkage of "lesion-site-clinical description" in the report, leading to insufficient completeness of report content and clinical decision support capabilities; In terms of system deployment, a single device cannot simultaneously meet the low latency requirements of visual inspection tasks and the high computing power requirements of large language model inference, and the network transmission of patient image data poses privacy and security risks; In terms of research efficiency, doctors need to manually edit videos and annotate keyframes to organize research materials, which is time-consuming and labor-intensive. Therefore, there is an urgent need for a technical solution that can achieve intelligent analysis of the entire endoscopic video process, multi-dimensional information collaboration and linkage, and take into account both deployment efficiency and data security. Summary of the Invention

[0005] The purpose of this application is to provide an intelligent analysis system and method for endoscopic video based on multi-agent collaboration, so as to solve the problems mentioned in the background art.

[0006] This application discloses an intelligent analysis system for endoscopic video based on multi-agent collaboration, including an edge computing device, a local server, and a host computer that are connected in sequence via communication. The edge computing device is configured to: acquire a video stream collected by an endoscope device; perform frame extraction and frame quality filtering on the video stream at a preset frequency to obtain a video frame sequence; and execute a first proxy module and a second proxy module in parallel on the video frame sequence. The first proxy module adopts a model that shares a backbone network for target detection and instance segmentation, extracts the segmentation mask of the lesion target in the video frame, and calculates the physical size of the lesion target in combination with the real-time distance of the lens, and outputs the target detection result with a timestamp. The second proxy module uses a spatiotemporal graph convolutional network to extract spatiotemporal features from multiple consecutive frames, and combines a preset anatomical topology finite state machine to perform state transition constraints, outputting part segmentation results with timestamps. The edge computing device encapsulates the target detection results and the part segmentation results into structured metadata, and sends them to the local server through a message middleware, while the original video stream is stored on the edge computing device. The local server is configured to receive the structured metadata and execute the third proxy module and the fourth proxy module in sequence. The third agent module uses the aforementioned structured metadata as search criteria to retrieve similar cases from the pre-built vector database as prompt word examples, and combines the structured metadata with the input to the large language model to generate medical description text; The fourth proxy module performs spatiotemporal alignment matching between the target detection result and the site segmentation result based on the timestamp, binds the lesion target to the corresponding anatomical site, and integrates the structured metadata and the medical description text to generate a medical examination report; The host computer is configured to: receive the structured metadata and medical examination report pushed by the local server, perform visual rendering and display, and respond to the user's label correction operation on the interface, send the correction instruction back to the local server to trigger the third agent module and the fourth agent module to regenerate the corresponding medical description text and medical examination report.

[0007] In a preferred embodiment, the first agent module calculates the physical size of the lesion target in conjunction with the real-time distance of the lens, specifically including: Count the total number of foreground pixels in the segmentation mask. And calculate the pixel-level equivalent diameter ; Obtain the real-time distance from the endoscope lens to the digestive tract mucosa. According to the formula Calculate the single-frame physical size of the lesion target, where Based on camera intrinsic parameters and the real-time distance The calculated dynamic scaling factor represents the physical length of each pixel at a specific distance, expressed in millimeters per pixel, and varies with distance. Increase and increase; A sliding window of a preset length is maintained on the time axis. After removing outliers from the physical size of a single frame in multiple consecutive frames, the median of a preset confidence interval is calculated as the final output physical size.

[0008] In a preferred embodiment, the second proxy module uses a spatiotemporal graph convolutional network to extract spatiotemporal features from multiple consecutive frames, and combines this with a preset anatomical topological finite state machine for state transition constraints, specifically including: A single video frame is divided into multiple spatial regions as spatial nodes. The feature vectors of the spatial nodes are extracted. Spatial edges are established between the spatial nodes that are physically adjacent. Spatial nodes with the same spatial position in multiple consecutive frames are connected along the time axis to establish time edges, so as to construct a dynamic spatiotemporal graph. Spatiotemporal features are aggregated along the spatial and temporal edges using multi-layer spatiotemporal graph convolutional layers. After processing by global average pooling and a fully connected classifier, the probability distribution of each anatomical part category at the current time step is output. The probability distribution of multiple consecutive frames is stored in a confidence accumulation buffer and then summed over time to obtain a cumulative score. When the cumulative score of the target anatomical part exceeds the preset switching threshold and maintains the trend for multiple consecutive time steps, a state transition candidate is triggered. The anatomical topology finite state machine makes a determination based on the built-in anatomical topology constraint table. If the candidate target anatomical part and the current anatomical part are not legal adjacent topology nodes, the transition signal is intercepted and the current anatomical part state is forcibly maintained.

[0009] In a preferred embodiment, the second agent module, in conjunction with a preset anatomical topological finite state machine, further includes the following for state transition constraints: The lens motion vector calculated by optical flow method is used to dynamically sense the advance and retreat directions of the camera, so as to dynamically adjust the allowed transition directions of the anatomical topology finite state machine. If the camera retreats, switching to the distal anatomical site is prohibited. When the highest confidence level of the output is lower than the preset low confidence level threshold, or the difference between the two highest confidence levels is lower than the preset difference threshold, the current time step is marked as a transitional state at the junction of anatomical sites. When multiple consecutive frames are in the transition state, the feature sequence is backtracked to the historical direction of the time axis for local smoothing filtering. Before the preset anatomical marker of the next anatomical site is detected and the confidence level is greater than the key frame threshold, the original anatomical site state is maintained.

[0010] In a preferred embodiment, the edge computing device performs frame extraction and frame quality filtering on the video stream at a preset frequency, specifically including: The preset frequency is set to a fixed frame rate that matches the physical advance speed of the endoscope; The extracted video frames are preprocessed, including: extracting the highlight detection mask for areas where the brightness exceeds the set high brightness threshold and performing adaptive histogram equalization repair processing, and applying Gamma correction processing to areas where the brightness is below the set low brightness threshold to improve the contrast of dark areas. The preprocessed video frames are subjected to a four-level frame quality screening process: external frames and black screen frames are filtered by scene-level model; blurred frames, water quality interference frames, and still frames are filtered by quality-level model; logically abnormal frames are filtered by combining timing counters and anatomical path constraints; and video frames with abnormal size fluctuations are filtered by calculating confidence intervals when calculating the physical size of the lesion target. Only video frames that pass the four-level frame quality screening are input to the first proxy module and the second proxy module in parallel.

[0011] In a preferred embodiment, the specific process by which the third agent module generates the medical description text includes: The image feature descriptions and diagnostic conclusions in the approved historical medical examination reports are converted into feature vectors to construct the vector database. During the reasoning phase, the lesion targets and anatomical features contained in the structured metadata are used as query conditions to retrieve similar cases from the vector database as prompt word examples for few-sample reasoning. The character setting instructions, the current structured metadata, and the retrieved prompt word examples are assembled into a structured prompt word template; Without updating network parameters, the structured prompt word template is input into the locally privately deployed large language model for context learning and reasoning, and the medical description text is output.

[0012] In a preferred embodiment, the fourth proxy module performs spatiotemporal alignment matching between the target detection result and the site segmentation result based on a timestamp, binding the lesion target to the corresponding anatomical site, specifically including: For the lesion target detected by the first agent module at the target timestamp, in the part segment sequence with start and end times output by the second agent module, search for candidate anatomical parts that satisfy the condition that the absolute value of the time difference is less than a preset time threshold. If there are multiple candidate anatomical sites at the junction of anatomical sites, the candidate anatomical site with the highest cumulative confidence is selected for spatiotemporal binding. When generating the medical examination report, the fourth agent module automatically matches corresponding clinical recommendations based on the category and physical size of the lesion target according to a preset rule engine.

[0013] In a preferred embodiment, the model adopted by the first agent module, which uses a shared backbone network for target detection and instance segmentation, includes: A shared feature extraction backbone network is used to extract multi-scale feature maps from the input frame image; The target detection branch receives the feature map output by the shared feature extraction backbone network, performs lesion target detection, and outputs the lesion category and detection box coordinates. The instance segmentation branch receives feature maps of multiple scales output by the shared feature extraction backbone network, performs pixel-level instance segmentation, and outputs the segmentation mask; During the training phase, a composite loss function is constructed, which includes target detection box localization loss, classification confidence loss, and instance segmentation mask loss. The gradient of the instance segmentation mask loss is backpropagated to the shared feature extraction backbone network, thereby improving the feature perception capability of the lesion target through the instance segmentation task.

[0014] In a preferred embodiment, the medical examination report generated by the fourth agent module adopts a four-segment structure, including: Preoperative preparation phase: Intestinal cleanliness score is assessed based on the identification results of the water quality detection model during the frame quality screening process; Operation process segment: Sort by the timestamps of each record in the segmented results of the described part, and describe the insertion and withdrawal paths of the endoscope; Observation results section: Based on the binding results of the spatiotemporal alignment matching, the target detection results of the lesion targets corresponding to each anatomical location are grouped according to the anatomical location and presented in conjunction with the medical description text; Recommended Actions Section: Based on the rule engine, clinical recommended actions are automatically matched according to the category of the lesion target and the physical size.

[0015] This application also provides a multi-agent collaborative intelligent analysis method for endoscopic videos, applied to the above-mentioned system, comprising the following steps: Video preprocessing steps: The edge computing device acquires the endoscope video stream through the acquisition card, performs frame extraction and frame quality screening at a preset frequency, and obtains a video frame sequence; Parallel recognition steps: The first agent module and the second agent module on the edge computing device process the video frame sequence in parallel; the first agent module uses a model with a shared backbone network for target detection and instance segmentation to extract the segmentation mask of the lesion target, and calculates the physical size of the lesion target in combination with the real-time distance of the lens, and outputs the target detection result with a timestamp; the second agent module uses a spatiotemporal graph convolutional network to extract the spatiotemporal features of multiple consecutive frames, and combines an anatomical topological finite state machine to perform state transition constraints, and outputs the part segmentation result with a timestamp. Meta data transmission steps: The edge computing device encapsulates the target detection result and the part segmentation result into structured metadata, and sends it to the local server through a message middleware, while the original video stream is stored in the edge computing device; Description generation steps: The third agent module on the local server uses the structured metadata as search conditions to retrieve similar cases from the vector database as prompt word examples, and combines the structured metadata with the input to the large language model to generate medical description text; Report generation steps: The fourth proxy module on the local server performs spatiotemporal alignment matching between the target detection results and the site segmentation results based on the timestamp, binds the lesion target to the corresponding anatomical site, and integrates the structured metadata and the medical description text to generate a medical examination report; Visualization and closed-loop feedback steps: The host computer receives the structured metadata and the medical examination report, performs visualization rendering and display, and responds to the user's tag correction operation by sending the correction instruction back to the local server, triggering the third agent module and the fourth agent module to regenerate the corresponding medical description text and medical examination report.

[0016] The intelligent endoscopic video analysis system and method based on multi-agent collaboration provided by this invention has achieved significant technical effects in multiple dimensions through its overall architecture design and the coordinated cooperation between its various technical modules. The technical effects obtained are described below from the aspects of system architecture, each agent module, and their collaborative mechanism.

[0017] At the overall system architecture level, this invention deploys the first and second agent modules on the edge computing device side for parallel execution, and deploys the third and fourth agent modules on the local server side for sequential execution. The two ends communicate via a message middleware, transmitting only structured metadata, not the raw video stream. This layered deployment architecture resolves the inherent contradiction in computing power requirements between high-frequency real-time visual detection and large language model semantic generation: the edge computing device undertakes detection tasks sensitive to inference latency, while the local server undertakes generation tasks that are computationally demanding but latency-insensitive. This allows two types of tasks, inherently difficult to handle simultaneously on a single device, to be executed efficiently and without interference. Simultaneously, the bandwidth-optimized design for transmitting only structured metadata enables the system to run smoothly in a typical local area network environment, while the design of keeping the raw video stream locally on the edge computing device ensures the privacy and security of patient image data from a physical architecture perspective. At the multi-agent collaborative communication level, the frame queue, detection result queue, part result queue, description queue, and report queue defined in the message middleware implement asynchronous communication and collaborative scheduling between agent modules in an event-driven manner. The description queue only triggers the third agent module when the target detection result corresponding to the same frame contains a lesion target, avoiding unnecessary large language model inference caused by non-lesion frames, and effectively saving server-side computing resources.

[0018] At the video preprocessing level, this invention sets the frame extraction frequency to a fixed frame rate that matches the physical advancement speed of the endoscope. This ensures that the physical displacement of the lens between adjacent frames is much smaller than the diameter of the small lesions of clinical concern at this frame rate, thus achieving a balance between lesion capture rate and computational resource load. High-light suppression and low-light enhancement preprocessing performed on the extracted frames address the problems of high-light reflection from the endoscope light source on the moist mucosa surface and the difficulty in identifying small lesions caused by shadow areas at intestinal bends, respectively. A four-level frame quality screening mechanism progressively removes in vitro frames, black screen frames, blurred frames, water quality interference frames, still frames, logically abnormal frames, and size abnormal frames from scene level, quality level, logical level to statistical level. This ensures that subsequent proxy modules only process high-quality frame sequences, reducing the waste of GPU resources from invalid computation and avoiding interference from low-quality frames on detection and recognition accuracy.

[0019] At the first proxy module level, the object detection branch and the instance segmentation branch share the same feature extraction backbone network architecture. This eliminates the need for the two tasks to run feature extractors independently, effectively reducing GPU memory usage during model runtime and enabling efficient operation on edge computing devices with limited hardware resources. During the training phase, the gradient of the instance segmentation mask loss is backpropagated to the shared backbone network, forming a "segmentation-driven detection" gradient propagation mechanism: the segmentation task requires the network to make accurate judgments on lesion edge pixels, forcing the backbone network to learn more refined pixel-level feature representations. These refined features, in turn, enhance the object detection branch's ability to perceive features of small, low-contrast lesions, thereby improving the recall rate of small target lesions. The two-level confidence threshold setting ensures that doctors can see all suspected targets during operation with a lower real-time detection threshold, while a higher high-quality screening threshold ensures that only high-confidence detection results are used in physical size calculations and final report statistics, achieving a balance between real-time auxiliary observation and ensuring the reliability of report data.

[0020] At the level of lesion physical size estimation, this invention calculates the pixel-level equivalent diameter based on the total number of foreground pixels in the instance segmentation mask. Compared to the coarse conversion scheme based on the area of ​​the rectangular detection frame, this method can accurately fit the contour of irregular lesions and eliminate the interference of a large amount of background noise in the rectangular frame on the area estimation. On this basis, a dynamic scaling factor K(Z) based on camera intrinsic parameters and the real-time distance from the lens to the mucosa is introduced to map pixels to physical size, overcoming the problem that the fixed scaling factor cannot adapt to the dynamic changes in the focal length and observation distance of the endoscope lens. The temporal statistical filtering mechanism, which removes outliers from the estimated size of multiple consecutive frames within a sliding window and takes the median of a pre-set confidence interval, further eliminates the measurement error caused by single-frame jitter and angle tilt, making the final output physical size clinically reliable. The above three links—precise surface extraction of the segmentation mask, dynamic distance compensation, and temporal statistical filtering—are stacked step by step to form a complete accuracy guarantee link from pixel-level measurement to physical-level output.

[0021] At the second agent module level, the spatiotemporal graph convolutional network maintains the inherent adjacency relationships of the digestive tract lumen surface topology by performing spatial graph convolution and temporal convolution on a non-Euclidean graph structure. Compared to traditional grid convolution methods, it can capture the global topological pattern of anatomical structures more accurately. An anatomical topological finite state machine introduced at the decision layer uses the cumulative score of the confidence accumulation buffer after time-weighted summation as the decision signal. It judges the legality of state transition candidates according to the anatomical topological constraint table, allowing only transitions between adjacent topological nodes and blocking illegal jumps. This mechanism allows the temporal smoothing information provided by the spatiotemporal graph convolutional network to cascade with the topological constraints imposed by the finite state machine, effectively suppressing classification jump noise generated in the transition regions of traditional frame-by-frame classification methods. Combined with enhancement mechanisms such as optical flow sensing of camera movement direction to dynamically adjust allowed transition directions, marking transition states at low confidence levels, and keyframe anchoring, the robustness and smoothness of part recognition under various complex operational scenarios are further ensured. The architecture design of a shared backbone network and independent classification heads allows the same model skeleton to be adapted to both colonoscopy and gastroscopy by loading different classification heads, thus avoiding classification confusion between different examination types while reusing common digestive tract mucosal feature representations.

[0022] At the third agent module level, a locally deployed large language model, without updating network parameters, retrieves historical cases most similar to the current case from a vector database as context examples for few-shot inference. These cases are then combined with the current structured metadata to assemble a structured prompt word template for inference. This avoids the risk of overfitting and high computational and maintenance costs associated with traditional model fine-tuning, while fully utilizing the extensive pre-trained knowledge of the base model. The effectiveness of this retrieval enhancement generation mechanism directly benefits from the quality of the structured metadata output by the upstream first and second agent modules—high-precision lesion target attributes and highly temporally consistent site segmentation results provide reliable query conditions for vector retrieval, enabling the retrieval results to highly match the current case, thereby guiding the large language model to generate more accurate and standardized medical descriptions. An optional knowledge graph verification mechanism, by extracting entities from the generated text and verifying consistency with preset association rules, can detect missing necessary related elements in the description text and trigger regeneration, providing additional assurance for the completeness of the medical description.

[0023] At the fourth agent module level, time-stamp-based spatiotemporal alignment matching binds the lesion targets detected by the first agent module to the corresponding anatomical sites identified by the second agent module, achieving a three-layer linkage of "lesion-site-description". Due to the system's high frame rate and the parallel processing of the same frame by the two agent modules, the timestamp deviation is minimal, ensuring the reliability of spatiotemporal alignment matching. In the four-segment report structure, the preoperative preparation segment automatically assesses intestinal cleanliness based on the identification results of the water quality detection model during frame quality screening, eliminating the need for manual scoring by doctors; the operation process segment automatically describes the insertion and withdrawal paths of the endoscope by timestamp; the observation results segment presents the detection results and medical descriptions in a linked group by anatomical site; and the recommended measures segment automatically matches clinical recommendations based on lesion type and physical size using a rule engine. This structured report generation method ensures the completeness and standardization of the report content, reducing the risk of information omission.

[0024] At the level of upper-computer visualization and closed-loop feedback, multi-color segmented progress bars intuitively indicate the time interval distribution of each anatomical site with different colors. Combined with the user-click jump function, this allows doctors to quickly locate the target intestinal segment without manually replaying the video. The abnormal event peak graph uses the cumulative number of lesion occurrences as its data source, and after smoothing and attenuation processing, it is superimposed and rendered on the time axis, making high-confidence abnormal intervals immediately apparent. The label correction operation triggers the regeneration of the third and fourth agent modules, forming a complete closed-loop control loop from perception to cognition, from manual feedback to adaptive reconstruction, ensuring that the final report accurately reflects the professional judgment of clinicians. The function of one-click export of keyframe images and GIF animations with overlaid annotation boxes and category labels provides a convenient tool for data organization in clinical research.

[0025] In summary, this invention, through the computing power allocation and privacy protection mechanism formed by the layered deployment architecture of edge computing devices and local servers, the technical improvements of the four agent modules, and the cascading collaborative relationship established among them through message middleware, structured metadata data transmission, and spatiotemporal alignment matching, jointly realizes the intelligent analysis of endoscopic videos from raw acquisition to structured report output. It has achieved significant technical improvements in multiple dimensions such as analysis efficiency, detection accuracy, size estimation reliability, robustness of part identification, description standardization, and report completeness.

[0026] The specification of this application contains numerous technical features distributed across various technical solutions. Listing all possible combinations of these technical features (i.e., technical solutions) would make the specification excessively lengthy. To avoid this problem, the various technical features disclosed in the above-described invention, the various technical features disclosed in the following embodiments and examples, and the various technical features disclosed in the accompanying drawings can be freely combined to form various new technical solutions (all of which are considered to have been described in this specification), unless such a combination of technical features is technically infeasible. For example, one example discloses feature A+B+C, and another example discloses feature A+B+D+E. Features C and D are equivalent technical means that serve the same function, and technically only one needs to be used; they cannot be used simultaneously. Feature E can technically be combined with feature C. Therefore, the solution A+B+C+D should not be considered as described because it is technically infeasible, while the solution A+B+C+E should be considered as described. Attached Figure Description

[0027] Figure 1 This is a schematic diagram of the structure of an intelligent endoscopic video analysis system based on multi-agent collaboration according to an embodiment of this application.

[0028] Figure 2 This is a flowchart illustrating an intelligent endoscopic video analysis method based on multi-agent collaboration according to an embodiment of this application. Detailed Implementation

[0029] In the following description, many technical details are presented to help the reader better understand this application. However, those skilled in the art will understand that the technical solutions claimed in this application can be implemented even without these technical details and various variations and modifications based on the following embodiments.

[0030] Explanation of some concepts: Edge computing devices are industrial-grade inference terminals equipped with embedded GPU computing units, deployed close to the data source, used to perform low-latency visual inference tasks locally without transmitting raw data to a remote server for processing.

[0031] Spatiotemporal Graph Convolutional Network (ST-GCN) is a deep learning network architecture that performs convolution operations on a dynamic graph structure consisting of spatial nodes and temporal edges. It can simultaneously capture topological neighborhood relationships in the spatial dimension and feature evolution trends in the temporal dimension.

[0032] A finite state machine (FSM) is a discrete mathematical model that has a finite number of states and transitions between states under specific conditions. In this application, it is used to apply state transition rules to the part classification results according to a preset anatomical topological constraint table.

[0033] Retrieval Augmentation (RAG) is a reasoning technique that combines information retrieval with generative language models. Before the generation stage, information fragments related to the current query are retrieved from a pre-built knowledge base or vector database and embedded as contextual clues to enhance the accuracy and relevance of the language model's output without updating the model's network parameters.

[0034] A segmentation mask is a pixel-level binary mask output by an instance segmentation model, where foreground pixels are marked as target regions and background pixels are marked as non-target regions, used to accurately delineate the contour boundaries of target objects.

[0035] Message middleware refers to the asynchronous communication infrastructure between various software modules in a distributed system. It enables loosely coupled data exchange between producers and consumers through message queue mechanisms, allowing each module to run independently and be triggered on demand.

[0036] A host computer is a terminal computer in an industrial control or medical equipment system that performs functions such as human-computer interaction, data visualization, and issuing operating instructions.

[0037] A vector database is a database system specifically designed for storing, indexing, and retrieving high-dimensional feature vectors. It supports approximate nearest neighbor retrieval based on vector similarity. In this application, it is used to store feature vectors of historical cases to support enhanced retrieval generation.

[0038] Chain-of-Thought is a technique that guides large language models to output hints about intermediate reasoning steps before generating the final answer, thereby improving the output quality and interpretability of the model in tasks requiring multi-step logical reasoning.

[0039] The dynamic scaling factor K(Z) is a conversion factor for the physical length of each pixel, calculated based on the camera intrinsic parameters and the real-time distance Z from the endoscope lens to the digestive tract mucosa. Its value is dynamically adjusted as the distance Z changes, and it is used to map the measurement results in the pixel coordinate system to the actual size in the physical coordinate system.

[0040] A shared backbone network refers to the feature extraction network portion that is used by multiple different functional branches in a multi-task learning architecture. Each branch reuses the feature maps extracted by this network to reduce redundant computation and memory usage.

[0041] The following is a brief summary of some of the innovative aspects of this application: In summary, the technical solution of this invention is not a simple parallel or substitution combination of several known technical means, but rather a systematic grasp of the multiple technical contradictions in the specific application scenario of intelligent analysis of the entire process of endoscopic video. A layered multi-agent processing architecture with inherent technical synergy is constructed between the edge computing device 10 and the local server 20, so that the various technical features form an organic whole that is interdependent and mutually beneficial, thereby achieving outstanding technical effects that exceed the expectations of simply superimposing individual technical means.

[0042] Specifically, the core technical contradiction faced by this invention lies in the fact that the stringent requirements of high-frequency real-time visual detection tasks on inference latency, and the rigid demands of large-scale computing power on semantic generation tasks driven by large language models, cannot be simultaneously met on a single computing device. To resolve this contradiction, this invention deploys the first agent module 11 and the second agent module 12 on the edge computing device 10 for parallel execution, while deploying the third agent module 21 and the fourth agent module 22 on the local server 20 for sequential execution. The two ends communicate via a message middleware, transmitting only structured metadata S3 instead of the raw video stream. This architecture design is not solely based on bandwidth saving considerations, but is inextricably linked to the system's privacy protection mechanism: because the original video stream is stored locally on the edge computing device 10 without being transmitted over the network, the leakage path of patient image data is eliminated at the physical architecture level; at the same time, it is precisely because only lightweight structured metadata S3 is transmitted that the subsequent spatiotemporal alignment matching based on timestamps by the fourth agent module 22 becomes possible—if the original video frames are transmitted, the fourth agent module 22 will need to re-execute visual analysis to obtain alignment information, resulting in a significant increase in the server-side computing burden, which violates the original intention of layered deployment.

[0043] Furthermore, the model architecture of the shared backbone network for object detection and instance segmentation in the first proxy module 11 has a deep technical coupling relationship with the physical size calculation method based on the equivalent diameter of the segmentation mask and dynamic distance compensation. The multi-scale feature map output by the shared backbone network serves both the object detection branch and the instance segmentation branch. The gradient of the instance segmentation mask loss is backpropagated to the shared backbone network, forcing the backbone network to learn a more refined pixel-level feature representation—and this refined feature representation is precisely the basis for the accuracy of the subsequent calculation of the pixel-level equivalent diameter D_pixel based on the total number of foreground pixels N_pixels of the segmentation mask. If the edge of the segmentation mask is rough or the foreground region is not accurately determined, the calculation of the pixel-level equivalent diameter will introduce a systematic bias, which will then be amplified into the measurement error of the physical size through the mapping of the dynamic scaling factor K(Z). In other words, the gradient propagation mechanism of "segmentation to promote detection" not only improves the detection recall rate, but also indirectly ensures the accuracy of physical size estimation by improving the quality of the segmentation mask—this synergistic gain relationship across functional modules is not something that those skilled in the art can readily foresee when facing a single technical problem.

[0044] Meanwhile, the joint design of the spatiotemporal graph convolutional network and the anatomical topology finite state machine in the second agent module 12 constitutes another set of features with inherent technical connections. The spatiotemporal graph convolutional network divides video frames into multiple spatial region blocks and establishes temporal edges along the time axis. It performs multi-layer graph convolution on the dynamic spatiotemporal graph composed of spatial region block nodes and their spatial and temporal edges, thereby achieving spatiotemporal feature aggregation in non-Euclidean space. However, spatiotemporal feature aggregation alone is not enough to eliminate classification jump noise caused by rapid camera shake or insufficient inflation. Therefore, it is necessary to introduce the anatomical topology finite state machine into the decision layer to forcibly constrain the state transition path according to the physical adjacency relationship of digestive organs. The key technical connection here lies in the fact that the input signal of the finite state machine is not the single-frame classification result, but the cumulative score obtained by weighting and summing the multi-frame probability distribution output by the spatiotemporal graph convolutional network after the confidence accumulation buffer time. This means that the transition decision of the finite state machine implicitly contains the temporal smoothing information provided by the spatiotemporal graph convolutional network. The two form a cascaded collaborative architecture of "feature layer temporal aggregation first, decision layer topological constraint later". If the temporal aggregation of the spatiotemporal graph convolutional network is lacking and topological constraints are directly applied to the single-frame classification result, the finite state machine will frequently receive noise trigger signals, and the interception efficiency of topological constraints will drop significantly. Conversely, if there is only a spatiotemporal graph convolutional network without topological constraints, the decision oscillation caused by the gradual change of visual features in the real part transition region cannot be avoided. The technical effect of suppressing classification jump noise produced by this double-layer nested design of "implicit topological feature learning and explicit topological rule constraint" is not simply obtained by increasing the network capacity or simply applying a sliding window average to the classification result. Its root lies in the synergistic utilization of the same anatomical topological prior knowledge at different abstraction levels by the above two mechanisms.

[0045] To further illustrate the aforementioned synergistic effect, a quantitative analysis can be performed. Assuming the single-frame classification error rate of the spatiotemporal graph convolutional network output is ε (i.e., the probability of misclassifying a correctly identified anatomical part as another part in any given frame), under the traditional frame-by-frame independent classification paradigm, the probability of at least one transition occurring in N consecutive frames is 1-(1-ε)^N. When N is large, this probability approaches 1, meaning that a transition almost inevitably occurs in a complete inspection video. After introducing a spatiotemporal graph convolutional network for temporal aggregation of T frames, due to the statistical smoothing effect of multi-frame features, the equivalent temporal aggregation error rate ε' is significantly lower than ε, but still non-zero—especially in the true part transition region, because the true labels of adjacent frames are indeed changing, and temporal aggregation may actually lead to a decision lag. Furthermore, by superimposing the topological constraints of a finite state machine, even if residual misclassification signals remain after temporal aggregation, if the state transition they point to is anatomically illegal, they will be intercepted by the topological constraints. Let the total number of anatomical sites be K. For any given current state, there are only two legal transition targets (the previous site and the next site) plus three options to maintain the current state, while there are K-3 illegal targets. Therefore, the topological constraints of the finite state machine can intercept all misclassified signals pointing to illegal targets, further reducing the effective transition error rate to ε'' = ε' × 3 / K. When K=6 (six colonic segments), the topological constraints of the finite state machine alone can reduce the residual error rate to half of the original; when K=8 (eight subdivisions of the colon including the splenic and hepatic flexures), ε'' = ε' × 3 / 8, and the residual error rate is reduced to approximately 38% of the original. This quantitative analysis shows that the temporal aggregation of spatiotemporal graph convolutional networks and the topological constraints of finite state machines have a multiplicative synergistic relationship in reducing the jump error rate, rather than a simple additive one. The spatiotemporal graph convolutional network is responsible for reducing ε to ε' (reducing the error base), while the finite state machine is responsible for reducing ε' to ε' × 3 / K (proportionally intercepting illegal directions). The combined effect of the two is ε → ε' × 3 / K, which is far superior to the level that any single mechanism can achieve.

[0046] Furthermore, the retrieval enhancement generation mechanism in the third agent module 21, which uses similar cases retrieved from the vector database as prompt examples, relies heavily on the quality and completeness of the structured metadata S3 output by the first agent module 11 and the second agent module 12. It is precisely because the first agent module 11 outputs a complete target detection result S1 containing lesion category, detection box coordinates, confidence score, and physical size, and the second agent module 12 outputs a site segmentation result S2 with high temporal consistency after topological constraints, that the third agent module 21 can use these high-quality structured features as query conditions to accurately retrieve the most relevant historical cases from the vector database as context examples for few-shot inference. In addition, the doctor's label correction operation on the host computer 30 interface can be transmitted back to the local server 20 via the network, triggering the regeneration of the third agent module 21 and the fourth agent module 22, thus forming a complete closed-loop control loop of "edge perception—server-side cognition—human feedback—adaptive reconstruction". This cascading dependency relationship, where "the quality of upstream agent output determines the retrieval accuracy of downstream agent, and terminal human feedback drives the reconstruction of the entire chain," means that the four agent modules are no longer functionally independent parallel components, but rather constitute a complete processing chain where information is refined step by step and semantics are abstracted layer by layer. The independent optimization of any single agent cannot replace the system-level technical effect brought about by overall collaboration.

[0047] Furthermore, the inventors of this application have discovered through long-term in-depth research that the fundamental reason why existing endoscopic-assisted analysis technologies are unable to meet actual clinical needs is not due to insufficient algorithm performance in a single step, but rather to the lack of organic synergy between key technical steps at the system architecture level. Object recognition, location, semantic description, and report generation each operate as isolated processing nodes, and the information flow between them lacks a unified time coordinate and data format standard, resulting in information fragmentation in the entire analysis process.

[0048] Through analysis of numerous real-world endoscopic examination cases, the inventors further discovered that achieving effective linkage between various stages in the entire video analysis process first faces a fundamental technical contradiction at the hardware level. Real-time assisted analysis of endoscopic videos requires the system to complete frame-by-frame lesion detection and site identification within millisecond-level latency, so as to provide doctors with immediate visual cues during the examination; while generating semantic descriptions and structured reports that conform to medical standards requires calling language models with massive parameter scales for deep reasoning, demanding far more GPU memory and computing power than the former. The inventors deeply understand that the fundamental opposition between the real-time requirements and computing power requirements of these two types of computing tasks makes the traditional approach of centrally deploying all functions on a single device inevitably create an irreconcilable bottleneck between low-latency detection and high-intelligence generation—if real-time detection is prioritized, sufficient GPU memory resources cannot be allocated to large language models, and vice versa. Even more challenging was the fact that, after repeated demonstrations, the inventors realized that if all tasks were centrally deployed on high-performance servers, the original video stream would need to be transmitted over the network in real time. This would not only place extremely high demands on network bandwidth but also expose patient image data to the transmission link, posing a serious risk of privacy breaches. Only by deploying different types of computing tasks in a layered manner according to their latency sensitivity and computing power requirements, and designing a reasonable data transmission protocol so that only desensitized structured metadata, rather than the original images, is transmitted between the edge and the server, can the requirements for clinical usability be met simultaneously in terms of performance, bandwidth, and privacy.

[0049] After in-depth analysis of the site identification process, the inventors realized that the digestive tract lumen is a continuous curved surface topology, with gradual transition regions between different anatomical sites, rather than discrete, abrupt boundaries. Traditional frame-by-frame image classification methods treat each frame as an independent classification sample, ignoring the temporal continuity between adjacent frames. This can easily lead to frequent jumps in classification results between adjacent sites when the camera zooms in, rotates, or passes through curved sections—for example, at the junction of the descending colon and sigmoid colon, due to slight fluctuations in the features of a single frame, the classification output may repeatedly switch between the two sites, severely affecting the accuracy of lesion anatomical localization in subsequent reports. The inventors further realized that simply increasing the network capacity of the classification model or applying a simple sliding window average to the output results could not fundamentally solve this problem, because the root cause of the jump lies in the fact that the frame-by-frame classification paradigm itself failed to incorporate the anatomical topological order of the digestive tract as structured prior knowledge into the decision-making process; while the arrangement of the various parts of the digestive tract follows a fixed anatomical sequence—for example, starting from the rectum and passing through the sigmoid colon, descending colon, transverse colon, ascending colon to the cecum—this topological constraint can fundamentally eliminate anatomically impossible site jumps.

[0050] In the lesion size estimation stage, the inventors, through repeated experiments and theoretical derivations, recognized that the traditional coarse conversion method based on the area of ​​a rectangular detection frame has two inherent defects: First, the rectangular frame cannot fit the irregular contour of the lesion, and the presence of a large amount of background noise leads to an overestimation of the area; second, the distance between the endoscope lens and the mucosal surface changes continuously during the examination, resulting in significant differences in the physical size corresponding to the same pixel area at different distances, and using a fixed conversion coefficient inevitably introduces systematic errors. The inventors realized that if the precise mask output from the instance segmentation task could be used instead of the rectangular frame as the basis for area calculation, and the mapping coefficient from pixels to physical size could be dynamically adjusted according to the real-time distance of the lens, while statistical filtering is used in the time dimension to eliminate occasional errors in a single frame, then it is possible to obtain size estimation results that combine spatial accuracy and temporal stability. More importantly, if the instance segmentation branch and the object detection branch are designed as a dual-task architecture sharing the same backbone network, the training requirement of pixel-level edge accuracy for the segmentation task will force the backbone network to learn more refined feature representations through gradient backpropagation. This "segmentation-driven detection" training strategy can not only improve the detection performance of small target lesions, but also provide higher quality mask boundaries for subsequent accurate size estimation.

[0051] The inventors also discovered through long-term observation of the clinical report writing process that the logic followed by doctors when writing examination reports is essentially a process of "associating and binding lesions with specific attributes found in specific anatomical locations with corresponding clinical recommendations." This process presupposes that a one-to-one correspondence must be established between object recognition results and location recognition results through reliable spatiotemporal alignment. Existing single-frame analysis systems, because they do not process video temporal information, cannot obtain the timestamp of lesion appearance and its correspondence with the location segmentation results on the video timeline, and therefore fundamentally lack the ability to achieve this spatiotemporal linkage.

[0052] Based on the above in-depth research, the inventors of this application propose an intelligent endoscopic video analysis system and method based on multi-agent collaboration: It addresses hardware resource conflicts caused by differences in computational features through a layered deployment architecture of edge computing devices and local servers; it solves the classification jump problem in transitional regions of body parts through a double-layer nested design of spatiotemporal graph convolutional networks and anatomical topological finite state machines; it solves the problem of accurate lesion size estimation through a fusion architecture of target detection and instance segmentation sharing a backbone network and a method based on equivalent diameter of segmentation masks and dynamic distance compensation; it solves the information linkage problem between objects, body parts, descriptions, and reports through message middleware-driven four-agent asynchronous collaboration and timestamp alignment mechanisms; and it achieves final quality control of human-machine collaboration through a closed-loop feedback mechanism of visualization rendering and label correction on the host computer. The implementation process of this invention is described in detail below through specific embodiments.

[0053] Example 1 See Figure 1 and Figure 2 This embodiment provides an intelligent endoscopic video analysis system based on multi-agent collaboration. Applied to gastrointestinal endoscopy, this system enables intelligent analysis of the entire video stream acquired by endoscopic equipment and generates structured medical examination reports. The system comprises three sequentially connected hardware entities: an edge computing device, a local server, and a host computer. These three entities form a complete data processing link via a local area network. The edge computing device handles high-frequency, lightweight visual inference tasks, the local server handles low-frequency, computationally intensive language model generation tasks, and the host computer provides the human-computer interaction interface and visualization functions. The core design concept of this layered deployment architecture is to offload real-time-critical visual detection tasks to the edge closer to the data source, while centralizing computationally demanding but latency-insensitive semantic generation tasks on the server side. This resolves the contradiction that a single computing device cannot simultaneously meet the requirements of low-latency detection and high-intelligence generation. Meanwhile, since the edge computing device only transmits encapsulated structured metadata to the local server, rather than the original video stream, it can run smoothly in a normal gigabit LAN environment, and the original video data is always stored at the edge, which protects the privacy and security of patient image data from the physical architecture level.

[0054] The hardware deployment architecture is described below.

[0055] Specifically, in this embodiment, the edge computing device is an industrial-grade inference terminal equipped with an embedded GPU. Its GPU memory must be able to simultaneously accommodate the inference models of both the first and second agent modules. The endoscope outputs the raw video stream to the acquisition card via its video output port. The acquisition card digitizes the video signal and transmits it to the edge computing device. The raw input format of the acquisition card is typically YUV422 or RGB888. In the video processing pipeline of the edge computing device, the data is immediately converted to a tensor format adapted to the input requirements of the inference engine via a format conversion module. The local server is equipped with a high-performance GPU for running computationally intensive tasks such as large language models. The host computer communicates with the local server via a local area network, using the WebSocket long-connection protocol to receive real-time pushed structured metadata and analysis results, and constructs a visualization interface based on the Qt framework. The edge computing device pushes the inference results to the local server frequently via TCP Socket or message queues. After processing, the local server pushes the generated real-time annotation information back to the host computer via a WebSocket long-connection. After receiving the data, the host computer only performs graphic overlay rendering at the UI layer, without blocking the video playback thread. For the final report, the server generates a structured file and then notifies the host computer to retrieve it via an HTTP interface.

[0056] The following describes the video preprocessing module.

[0057] After the edge computing device acquires the video stream output by the endoscope through the acquisition card, the video preprocessing module first performs frame extraction and frame quality screening to obtain a standardized video frame sequence that can be processed by the downstream agent module.

[0058] Regarding the frame rate setting, this invention sets it to a fixed frame rate that matches the physical advance speed of the endoscope. According to clinical guidelines, the standard advance speed during endoscopic examination is approximately several centimeters per minute (e.g., about 3 to 5 centimeters per minute). At this frame rate, the physical displacement of the lens between adjacent frames is much smaller than the diameter of the small lesions of clinical concern, thus ensuring that lesions are not missed from the field of view even during rapid endoscope retraction. If the frame rate is too low, there is a risk of missing small lesions during rapid endoscope withdrawal or flushing operations; if the frame rate is too high, GPU utilization saturates, and inference latency increases significantly. Therefore, this frame rate selection reflects a balance between lesion capture rate and computational resource load.

[0059] After frame extraction, the system performs image preprocessing on each frame. This preprocessing includes two aspects: highlight suppression and low-light enhancement. For highlight suppression, the endoscopic light source easily creates high-brightness reflections on the moist mucosal surface, leading to model misjudgment. The system extracts regions in the frame image whose brightness exceeds a set high-brightness threshold to generate a highlight detection mask, and then performs adaptive histogram equalization on the mask-covered area to restore the mucosal texture details hidden by strong reflected light. For low-light enhancement, for shadow areas with brightness below a set low-brightness threshold at intestinal bends or behind folds, Gamma correction is used to improve the contrast of dark areas, making small lesions in dark areas easier for subsequent models to identify.

[0060] After preprocessing, the frame sequence enters a four-level frame quality screening process. The first level is scene-level screening, which uses a scene-level model to determine if the current frame is an external frame or a black screen frame; if so, it is directly discarded. The second level is quality-level screening, which uses a quality-level classification model to filter blurry frames and water quality interference frames, and a similarity model to filter duplicate frames caused by static shots. The third level is logic-level screening, which combines a timing counter and anatomical path constraints to filter logically abnormal frames—the timing counter requires that anomalies occur consecutively to reach a preset frame threshold before being considered valid, avoiding instantaneous misjudgments; the anatomical path constraint determines the logical rationality of the frame sequence based on the anatomical topology of the digestive tract. The fourth level is statistical screening, which, during subsequent lesion size calculation, uses a pre-set confidence interval to filter video frames with abnormal size fluctuations. Only video frames that pass the above four-level frame quality screening are sent to the downstream first and second proxy modules for parallel processing. This four-level screening mechanism progressively eliminates invalid frames from coarse to fine, allowing subsequent proxy modules to process only high-quality frame sequences. This improves detection accuracy and avoids wasting GPU resources on invalid computations.

[0061] The first agent module (object recognition) is described below.

[0062] The first proxy module runs on the GPU of the edge computing device, responsible for lesion detection and instance segmentation of each frame of image, and estimating the physical size of the detected lesions. This module employs a model architecture where object detection and instance segmentation share a backbone network. Specifically, the model consists of three parts: a shared feature extraction backbone network, an object detection branch, and an instance segmentation branch. The shared feature extraction backbone network extracts multi-scale feature maps from the input frame image. The object detection branch receives the feature maps output by the backbone network to perform lesion detection, outputting the lesion category and detection box coordinates. The instance segmentation branch receives the multi-scale feature maps output by the backbone network to perform pixel-level instance segmentation, outputting a fine-grained segmentation mask. The two branches share the features extracted by the same backbone network, rather than running feature extractors independently. This shared feature extraction layer effectively reduces GPU memory usage during model runtime, enabling efficient operation on the limited hardware resources of edge computing devices.

[0063] More specifically, during the training phase, the system constructs a composite loss function that includes object detection bounding box localization loss, classification confidence loss, and instance segmentation mask loss. This composite loss function can be expressed as: Among them, LCIoU is the complete intersection-union localization loss, which is used to optimize the localization accuracy of the detection box for the lesion area. This loss comprehensively considers the overlap area between the predicted box and the real box, the distance between the center points and the aspect ratio, and is especially suitable for the irregular shape of the lesion target. The binary cross-entropy classification loss is used to supervise the confidence prediction of lesion categories; Dice loss is used to address the extreme imbalance between foreground and background pixels in instance segmentation; LBCE_mask is a mask-level binary cross-entropy loss used to refine the edge accuracy of the segmentation mask; α, β, and γ are the weight coefficients of each loss term. It should be noted that the gradient of the instance segmentation mask loss is backpropagated to the shared feature extraction backbone network, forming a "cut-to-detection" gradient propagation mechanism: the segmentation task requires the network to make accurate judgments on lesion edge pixels, which forces the backbone network to learn more refined pixel-level feature representations. These refined features, in turn, enhance the feature perception ability of the target detection branch for small, low-contrast lesions, thereby significantly improving the detection recall rate of small target lesions. Simultaneously, the refined segmentation mask output by the instance segmentation branch provides an accurate region of interest for target region cropping in subsequent lesion size estimation, replacing the traditional rectangular detection box cropping method, reducing the interference of background noise on the input of the downstream size classification network, and providing the host computer with accurate lesion contour overlay rendering.

[0064] Regarding training data, the model employs a hybrid approach combining publicly available datasets from multiple sources with private clinical data from partner hospitals. The publicly available datasets are used for pre-training to enhance the perception of general polyp morphologies, while the private clinical data is used for fine-tuning to adapt to actual examination scenarios. Annotation utilizes a human-machine collaborative process combining AI pre-segmentation and refined physician review: first, a general segmentation model is used to pre-segment the initial detection results to generate candidate masks; then, qualified physicians review and correct the annotations frame by frame to ensure quality. The backbone network is then fine-tuned using transfer learning after loading pre-trained weights from a general object detection dataset.

[0065] Furthermore, the first agent module sets two levels of confidence thresholds to balance sensitivity and accuracy. The first level is the real-time detection threshold, used to render detection results with a confidence level higher than this threshold onto the video screen of the host computer in real time, ensuring that doctors can see all suspected targets during operation, and even if there are a few false alarms, they can be ignored manually. The second level is the high-quality screening threshold, which is higher than the real-time detection threshold. Only when the detection confidence level reaches the high-quality screening threshold is the detection result of that frame sent to the physical size calculation process and included in the final output target detection result, thereby preventing blurry frames or background noise from contaminating the statistical data of the report. This dual-threshold design allows the system to achieve a good balance between assisting doctors in real-time observation and ensuring the reliability of report data.

[0066] Regarding the calculation of lesion physical dimensions, this invention employs an intelligent size assessment method based on multi-model fusion and temporal statistical steady-state constraints. This method includes target tracking, multi-path size information extraction and fusion, temporal steady-state statistics, and joint spatial and size compensation, abandoning the traditional scheme based on rough conversion of rectangular detection box area.

[0067] First, after detecting a lesion, the system establishes a unique target identifier across frames through a target tracking module. This target tracking module runs on an edge computing device and employs a motion geometry matching strategy based on the target's centroid position and Euclidean distance to associate the same lesion detected in consecutive frames with a unified target identifier, enabling subsequent cumulative statistical estimation of the target's size over time.

[0068] For each target identifier, the system performs two types of size information extraction in parallel. The first is a dual-path size classification network, comprising two different model branches: a first classification sub-network and a second classification sub-network. These branches perform hierarchical prediction on the locally cropped image of the lesion target, each outputting multiple candidate size levels and corresponding confidence scores. In one exemplary implementation, the first classification sub-network divides the lesion size into six discrete levels, corresponding to nominal sizes of approximately 2, 4, 5, 6, 8, and 10 mm; the second classification sub-network divides the lesion size into twelve discrete levels, corresponding to nominal sizes of approximately 2.5, 3.0, 3.5, 4.0, 5.0, 5.6, 6, 7, 8, 10, 14, and 20 mm. The output of the dual-path classification network includes the probability distribution of the discrete categories and the continuous size regression values ​​mapped from them. The second is a feature inference network based on regional statistical features, which takes the geometric statistical features and color distribution features of the target region as input and outputs corresponding size estimates. In one exemplary implementation, the input feature dimension of the feature inference network is thirty-four dimensions.

[0069] Subsequently, the system inputs the classification results of the dual-path classification network, the size estimate of the feature inference network, the spatial location of the target detection box in the frame image, and the proportion of the target detection box area to the total frame area (i.e., ROI ratio) into the rule fusion engine. The engine performs fusion processing according to preset fusion rules, outputting the final size estimate and corresponding volume estimate for each frame. The ROI ratio serves both a gating and correction function during the fusion process: it acts as a gating condition to select the applicable fusion rule branch, and as a correction factor to influence the switching between different size result segments.

[0070] To improve measurement stability and clinical interpretability, this invention further introduces a target-identifier-oriented temporal statistics module. Specifically, the system performs sliding window accumulation on consecutive frame size estimates of the same target identifier, calculates quantile statistics and preset confidence intervals, and outputs intervalized size results, including the median and interval values, after reaching a stability condition. In an exemplary embodiment, the preset confidence interval is a 95% confidence interval. This temporal statistical filtering mechanism effectively eliminates measurement errors caused by single-frame jitter and angular tilt, making the final output size data clinically reliable.

[0071] Furthermore, to address the systematic errors caused by distortion at the edge of the endoscope's field of view and changes in viewing angle, this invention introduces two-stage processing: spatial position compensation and size segmentation compensation.

[0072] Regarding spatial position compensation, the system divides the field of view into a central region and a non-central region based on the position of the target detection box center point in the image coordinate system, and assigns corresponding position compensation coefficients. A smaller compensation coefficient is used for the central region, and a larger compensation coefficient is used for the non-central region to correct measurement deviations caused by lens edge distortion, viewing angle changes, and imaging non-uniformity. In one exemplary embodiment, the position compensation coefficient is determined based on preset rule parameters, which are obtained through offline calibration. Optionally, the field of view can be divided into a 3×3 grid, with different compensation coefficients assigned to each region based on its position.

[0073] Regarding segmented size compensation, after obtaining the position compensation coefficients, the system performs segmented compensation on the initial size estimate. The target size is divided into at least three intervals: small size interval, medium size interval, and large size interval. For small and medium-sized targets, positive gain correction is used; for large-sized targets, suppressive correction is used to reduce the risk of systematic overestimation under large-size conditions. The segmented compensation function can be expressed as: Where s is the initial size estimate, ŝ is the compensated size, δp is the spatial position compensation amount, and αk is the segment weight corresponding to the size interval k; when the target is located in the large size interval, αk can take a negative value to achieve inhibitory correction. In an exemplary embodiment, the small size interval is less than 3 mm, the medium size interval is 3 to 15 mm, and the large size interval is greater than 15 mm.

[0074] Furthermore, the system inputs the compensated dimensions into the volume conversion module and outputs the target volume estimate, achieving a consistent assessment of dimensions and volume. The volume conversion is achieved jointly through a pre-calibrated volume conversion table and a proportional mapping function: first, a baseline volume is obtained by querying a pre-set volume table established based on phantom calibration and clinical statistics according to the fusion results; then, the volume is scaled according to the final dimensions using a proportional formula.

[0075] The aforementioned closed-loop evaluation chain of "detection-tracking-multi-model fusion-temporal steady state-stable state-spatial compensation-interval output" achieves highly robust, traceable, and interpretable evaluation of lesion size and volume.

[0076] The second agent module (part recognition) is described below.

[0077] The second proxy module also runs on an edge computing device, processing the same frame in parallel with the first proxy module, with no data dependency between them. The second proxy module is responsible for identifying the anatomical location of the endoscope based on the video frame sequence and outputting timestamped segmentation results. This module uses a spatiotemporal graph convolutional network to extract spatiotemporal features from multiple consecutive frames and combines this with a pre-defined anatomical topological finite state machine for state transition constraints. The technical consideration for choosing a graph convolutional network instead of traditional convolutional neural networks or recurrent neural networks is that the digestive tract lumen is a continuous curved surface topology, and traditional grid convolution disrupts the inherent spatial adjacency relationships of organs; while graph convolution can operate directly on non-Euclidean space, maintaining the topological consistency of the anatomical structure. Furthermore, recurrent neural networks are prone to gradient vanishing when processing long sequences, leading to the forgetting of early features, while Transformers have excessive computational cost and are insensitive to local details; neither is as suitable for this scenario as spatiotemporal graph convolutional networks.

[0078] The spatiotemporal graph is constructed as follows: a single video frame is divided into multiple non-overlapping spatial regions, each region serving as a spatial node in the graph. A lightweight feature extractor is used to extract the feature vector of each spatial node. Spatial edges are established between physically adjacent spatial nodes, forming a spatial adjacency matrix. Optionally, a self-attention mechanism can be introduced to dynamically adjust edge weights based on the feature similarity of the current frame. If two adjacent regions have significantly different textures, the weights are automatically reduced to avoid erroneous feature aggregation. Simultaneously, spatial nodes at the same spatial location in consecutive frames are connected along the time axis to establish temporal edges. Spatial edges capture the topological relationships of different regions within the same frame, while temporal edges capture the feature evolution trajectory of the same region as the camera moves forward; together, they constitute a dynamic spatiotemporal graph.

[0079] The constructed dynamic spatiotemporal graph aggregates spatiotemporal features through multiple spatiotemporal graph convolutional layers. Each layer contains a spatial graph convolution operating along the spatial edges and a one-dimensional temporal convolution operating along the time axis, followed by batch normalization and activation functions. The number of output channels in each layer gradually increases from low to high order, forming a hierarchical representation from shallow local texture features to deep global semantic features. After processing by global average pooling and a fully connected classifier, the probability distribution of each anatomical part category at the current time step is output.

[0080] Furthermore, the second proxy module adopts an architecture of a shared backbone network and independent classification heads to support different types of endoscopic examinations. The feature vector extractor of spatial nodes and the multi-layer spatiotemporal graph convolutional layer use the same set of weights as the shared backbone network, making full use of the highly similar texture features of the digestive tract mucosa in the stomach and intestines. After global average pooling, the corresponding independent fully connected classification heads are loaded according to the examination type parameter. The colonoscopy classification head outputs the probability distribution of colon anatomical parts, corresponding to anatomical segments such as the cecum, ascending colon, transverse colon, descending colon, sigmoid colon, and rectum; the gastroscopy classification head outputs the probability distribution of stomach anatomical parts, corresponding to anatomical segments such as the esophagus, cardia, fundus, body, antrum, and pylorus. The shared backbone network is first pre-trained on a mixed dataset to learn the general topological evolution rules of the digestive tract, and then the two independent classification heads are fine-tuned using data of the corresponding types, which preserves generalization ability and avoids classification confusion.

[0081] At the decision-making level, to suppress classification jump noise generated in the transition region of traditional frame-by-frame classification methods, this invention introduces an anatomical topology finite state machine for state transition constraints. The specific process is as follows: the system does not directly accept the maximum probability of a single frame, but instead stores the probability distribution of multiple consecutive frames in a confidence accumulation buffer and performs time-weighted summation (higher weight for recent frames, lower weight for older frames) to obtain a cumulative score. When the cumulative score of the target anatomical part exceeds a preset switching threshold and maintains this trend for multiple consecutive time steps, a state transition candidate is triggered. The anatomical topology finite state machine makes a determination based on a built-in anatomical topology constraint table—this constraint table defines the legal transition relationships between parts according to the anatomical order of the digestive tract; for example, transitions from the rectum to the sigmoid colon are only allowed, and direct jumps to the transverse colon are strictly prohibited. If the candidate target anatomical part and the current anatomical part are not legally adjacent topological nodes, the transition signal is intercepted and the current anatomical part state is forcibly maintained.

[0082] To more clearly illustrate the specific structure of the anatomical topological constraint table, a specific implementation of the state transition matrix is ​​given below using a colonoscopy scenario as an example. Define the set of colonic anatomical location states S = {S0: rectum, S1: sigmoid colon, S2: descending colon, S3: splenic flexure, S4: transverse colon, S5: hepatic flexure, S6: ascending colon, S7: cecum}. In the direction of endoscope insertion (advancing from the anus to the proximal end), the legal state transition path is: S0→S1→S2→S3→S4→S5→S6→S7, meaning that only transitions from the current state to its anatomically adjacent distal location are allowed; jumps across intermediate locations are strictly prohibited. Specifically, the state transition legality matrix T is defined as follows: T[Si, Sj]=1 if and only if |ij|≤1, otherwise T[Si,Sj]=0. For example, T[S0, S1]=1 (transfer from rectum to sigmoid colon is legal), T[S0, S3]=0 (direct jump from rectum to splenic flexure is illegal). In the withdrawal direction, the legal transfer path is reversed to S7→S6→S5→S4→S3→S2→S1→S0. The system dynamically loads the corresponding transfer matrix based on the currently detected advance / retreat direction.

[0083] It should be noted that the strict adjacency constraint of |ij|≤1 mentioned above is the default configuration. In actual clinical applications, considering that the anatomical transition areas of the colonic flexure (such as the splenic flexure S3 and the hepatic flexure S5) are relatively short, these flexure segments can be optionally merged with adjacent straight segments into a composite state (e.g., merging S3 into S2 or S4) to adapt to the different needs of different clinical institutions for anatomical segmentation granularity. Similarly, the state set S' = {S'0: esophagus, S'1: cardia, S'2: fundus, S'3: body, S'4: antrum, S'5: pylorus, S'6: duodenal bulb} in the gastroscopy scenario is constructed according to the same principle.

[0084] The triggering logic for state transition is as follows: Assume the finite state machine at the current time step is in state Si. After time-weighted summation of the confidence accumulation buffer, the cumulative score of the target state Sj is Scorej. State transition is executed only when the following three conditions are met: (i) Scorej exceeds the preset switching threshold θswitch; (ii) Scorej maintains an upward trend or remains above θswitch for T consecutive time steps; (iii) T[Si, Sj]=1 in the transition legality matrix, meaning the transition is anatomically legal. All three conditions must be met simultaneously. If condition (iii) is not met while conditions (i) and (ii) are met, the system determines the signal as classification jump noise and intercepts it. Simultaneously, the abnormal event is recorded in the system log for subsequent algorithm optimization reference.

[0085] Furthermore, in the intestinal region identification algorithm for the colonoscopy scenario, in addition to the aforementioned conventional anatomical intestinal segment categories, a bend region category is added to the anatomical location classification set to assist in intestinal region transition judgment and segmental steady-state identification. Specifically, the intestinal region classification set is expanded to define C = {cecum, ascending colon, transverse colon, bend region, descending colon, sigmoid colon}, where the bend region corresponds to the splenic and hepatic flexure segments of the colon.

[0086] The system uses temporal counting and interval constraints to perform steady-state determination of the classification category for each frame output. Let the cumulative count of category k be Nk(t), and the most recent frame number be τk. The cumulative count is incremented when the frame interval satisfies (t - τk) < Δ; otherwise, it is reset to zero and recounted. Here, Δ is a preset allowable interval threshold. Only when Nk(t) ≥ Nth is the category considered temporally valid, which is used to update the intestinal region state, thereby reducing the impact of instantaneous false detections on the segmentation results.

[0087] To handle the bends and transitions from the ascending colon to the transverse colon and from the transverse colon to the descending colon, the algorithm introduces a bend-triggered gating mechanism. Let Nf(t) be the cumulative count of bend category, and tf be the reference bend time. When Nf(t) ≥ Nf_th and (t - tf) > T1, the segmented state is allowed to transition from the ascending colon to the transverse colon; when Nf(t) ≥ Nf_th and (t - tf) > T2, the state is allowed to transition from the transverse colon to the descending colon. Subsequently, combined with the condition of the sigmoid colon category's continued occurrence (NSigmoid(t) ≥ Ns_th and (t - tf) > T3), the state transition from the descending colon to the sigmoid colon is triggered. Through this combined mechanism of "classification result + bend category + time gating," stable identification and natural transition of intestinal segments are achieved, improving the segmentation robustness in complex endoscopic scenarios.

[0088] Optionally, the state transition constraint also includes the following enhancement mechanisms: Firstly, the system dynamically senses the advance and retreat direction of the endoscope by combining the lens motion vector calculated using optical flow, thereby dynamically adjusting the permissible transition direction of the finite state machine. If lens retreat is detected, switching to the distal anatomical site is prohibited. Specifically, the process of sensing the advance and retreat direction using optical flow is as follows: The system calculates a dense optical flow field for two consecutive video frames, obtaining a two-dimensional displacement vector (dx, dy) for each pixel. Since the central region of the image exhibits a radially outward optical flow diffusion pattern (similar to radiation from the center of a tunnel) when the endoscope advances within the lumen, and a radially inward optical flow convergence pattern when the endoscope retreats, the system decomposes the optical flow vector of each pixel into a radial component vr and a tangential component vt, using the geometric center of the frame as the origin. The mean value of the radial component vr of all foreground pixels in the entire frame is calculated to obtain the global radial optical flow score Vradial. When Vradial is positive and its absolute value is greater than the preset advance threshold θadvance, it is determined to be in the advance state, and the finite state machine only allows transfer towards the distal anatomical site; when Vradial is negative and its absolute value is greater than the preset retreat threshold θretreat, it is determined to be in the retreat state, and the finite state machine only allows transfer towards the proximal anatomical site and cuts off the transfer path in the distal direction; when the absolute value of Vradial is less than the above two thresholds, it is determined to be in the stationary or rotating state, and the finite state machine locks the current site state without switching.

[0089] Furthermore, the optical flow direction determination result does not directly cover the transition decision of the finite state machine, but indirectly affects the state transition by adjusting the gain coefficient of the cumulative score of the target part in the confidence accumulation buffer. Specifically, when the optical flow is determined to be in-camera, the cumulative score of the candidate part in the far direction is multiplied by a direction gain coefficient αforward greater than 1 to accelerate the switching response of the legal direction, while the cumulative score of the candidate part in the near direction is multiplied by a direction attenuation coefficient αbackward less than 1 to suppress reverse misjudgment; the opposite is true for the out-of-camera state. This design, which maps the physical motion information (optical flow) at the hardware level to the probability adjustment parameters (gain coefficient) at the algorithm level, realizes the direct linkage between "operation action → algorithm constraint", so that the state transition decision of the finite state machine not only relies on the statistical inference of visual features, but also incorporates the causal information of the physical motion direction of the lens, thereby enhancing the robustness of part recognition at the causal reasoning level.

[0090] It should be noted that the optical flow calculation is performed on edge computing devices using lightweight algorithms (such as the Farneback method), and its computational overhead is much smaller than that of inference in spatiotemporal graph convolutional networks, so it will not have a significant impact on the real-time performance of the system.

[0091] Secondly, when the highest confidence score of the output is lower than the preset low confidence threshold, or the difference between the two highest confidence scores is lower than the preset difference threshold, the current time step is marked as a transitional state at the boundary of anatomical sites. In this case, the system does not force the output of a specific site label, but instead marks it as a transitional zone, avoiding erroneous definitive assertions. Thirdly, when multiple consecutive frames are in a transitional state, the system backtracks the feature sequence along the historical timeline for local smoothing filtering, and maintains the original anatomical site state until a preset anatomical marker for the next anatomical site is detected and the confidence score is greater than the keyframe threshold. This "keyframe anchoring" mechanism ensures that site switching is only confirmed when a clear anatomical marker appears, resulting in a smooth and natural recognition curve in the transitional zone, significantly reducing classification jump noise compared to traditional frame-by-frame classification methods.

[0092] To more accurately define the determination mechanism of the transition state, this invention introduces a quantitative index based on the uncertainty of feature distribution. Specifically, the probability distribution vector P = [p1, p2, ..., pK] (K is the total number of anatomical part categories) of each anatomical part category output by the last layer of the spatiotemporal graph convolutional network can be regarded as a discrete probability distribution, and its information entropy H is calculated as follows: When the information entropy H is greater than the preset high entropy threshold Hhigh, it indicates that the visual features of the current frame are highly blurred among multiple anatomical sites, and the system marks this frame as a high-uncertainty transition frame. At the same time, the system calculates the first-order difference of the information entropy sequence of N consecutive frames in the time dimension of the confidence accumulation buffer. If the difference sequence shows an inverted U-shaped trend from low to high and then back to low, it is determined to be a complete site transition event, and its peak position corresponds to the center time point of the transition zone.

[0093] Upon entering the transition state, the finite state machine freezes its current state, performs no state transitions, and maintains the output site label as the last determined site label before entering the transition state. Simultaneously, the system initiates keyframe anchoring detection: for each anatomical segment, predefined characteristic anatomical landmarks are defined—these landmarks refer to highly recognizable anatomical structural features. For colonoscopy scenarios, these include, but are not limited to, the ileocecal valve (a landmark of the cecum), the triangular folds of the hepatic flexure, and the acute-angled bend of the splenic flexure; for gastroscopy scenarios, they include, but are not limited to, the dentate line of the cardia and the circular sphincter of the pylorus. The system performs landmark detection on each frame in the transition state using a pre-trained lightweight landmark detection sub-network. When the subnetwork's confidence in detecting a marker in a frame is greater than the preset keyframe threshold θanchor and appears consecutively for a preset number of frames, the system confirms that the frame is an "anchor keyframe". Then, the finite state machine is unfrozen and a transition from the current state to the next anatomical part state is executed (provided that the transition is valid in the adjacency matrix). The timestamp of the anchor keyframe is used as the starting timestamp of the new part segment.

[0094] If no anchor keyframe is detected after the transition state has been in progress for more than the preset maximum number of transition frames Nmax_transition, the system exits the transition mode and reverts to using the accumulated score in the confidence accumulation buffer for regular state transition determination. This revert mechanism avoids the system being permanently locked in the transition state due to missed detections by the marker detection model, ensuring the system's availability in extreme cases.

[0095] Through the cascading cooperation of the four-layer mechanism—information entropy quantification, inverted U-shaped trend detection, keyframe anchoring, and maximum transition frame count backoff—the system's recognition curve in the transition region exhibits a smooth and natural transition pattern of "plateau-gradient-plateau," rather than the "sawtooth" high-frequency jumps produced by traditional frame-by-frame classification methods. Combining the aforementioned spatiotemporal graph convolutional network multi-frame feature aggregation (first layer) and finite state machine anatomical topological constraints (second layer), the above-mentioned transition state management mechanism serves as the third layer of defense. These three mechanisms work synergistically at different levels of abstraction to jointly ensure the temporal robustness of the part recognition results.

[0096] For example, the training data comes from several hundred complete colonoscopy videos collected by the endoscopy center of a partner hospital between 2020 and 2024. Anatomical locations were annotated by qualified gastroenterologists (associate chief physicians or above) using second-level timestamps. For transitional areas, soft-labeling was used to reflect realistic anatomical changes. Training employed cross-entropy loss, with KL divergence loss used for transitional areas. The optimizer was Adam, with an initial learning rate of 0.001. Data augmentation strategies included random timeline cropping and random frame dropping to simulate video stuttering and rapid endoscope movements.

[0097] The following explains message middleware and multi-agent collaborative communication.

[0098] See Figure 1 The medical description text is denoted as S4, and the final generated medical examination report is denoted as S5. Asynchronous communication and collaborative scheduling between the agent modules are achieved through a message middleware. Multiple asynchronous communication queues are defined in the message middleware. The frame queue is used to distribute the preprocessed video frame sequence to the first and second agent modules running in parallel. The detection result queue and the location result queue are used to receive the target detection results output by the first agent module and the location segmentation results output by the second agent module, respectively. The description queue is triggered under the following conditions: when the local server receives the target detection result and location segmentation result corresponding to the same frame, and the target detection result contains a lesion target, the fused data is sent to the description queue to drive the third agent module to generate a medical description. This event-driven triggering mechanism avoids unnecessary large model inferences caused by frames without lesions, saving server-side computing resources. The report queue is triggered after all video frames have been processed, and is used to package all results and drive the fourth agent module to generate the final report.

[0099] The edge computing device encapsulates the target detection results and site segmentation results output by the first and second agent modules into structured metadata (JSON format), and sends it to the local server via the internal network through a message middleware. It should be noted that only structured text metadata is transmitted during this process; the original video frame image data is not transmitted. The original video stream remains in the local storage of the edge computing device, thus ensuring data processing efficiency while eliminating the risk of patient image data leakage at the physical architecture level. This bandwidth-optimized design, which only transmits structured metadata, allows the system to operate smoothly on a standard gigabit LAN, without requiring expensive 10-gigabit network infrastructure.

[0100] The third agent module (description generation) is described below.

[0101] The third agent module runs on a local server and is responsible for generating medically compliant textual descriptions of detected lesions based on the outputs of the first and second agent modules. This module employs a locally deployed, privately owned large language model. Without updating network parameters, it completes inference through retrieval-enhanced generation techniques combined with contextual learning, eliminating the need for traditional model parameter fine-tuning. The technical considerations for choosing this fine-tuning-free approach are to avoid the risks of overfitting and high computational costs associated with fine-tuning, while fully leveraging the extensive pre-training results of the base model on general medical knowledge. The large language model is deployed entirely locally and privately on the server side, without relying on external network APIs, further ensuring data privacy.

[0102] Specifically, the description generation process includes the following steps: First, image feature descriptions and diagnostic conclusions from historical medical examination reports reviewed and approved by senior physicians are converted into feature vectors and stored in a vector database. During the inference phase, using the lesion target attributes and anatomical location features contained in the current structured metadata as query conditions, several historical cases most similar to the current case are retrieved from the vector database as examples of prompt words for few-shot inference. Then, the role setting instructions, the current structured metadata, and the retrieved prompt word examples are assembled into a structured prompt word template. Finally, this prompt word template is input into a large language model for context learning inference, outputting the medical description text. Key configurations for the inference phase include: a low temperature coefficient to suppress randomness and ensure the rigor and consistency of the medical description; an appropriate maximum generation length to cover the standard report length; and a repetition penalty coefficient to prevent statement loops.

[0103] Optionally, the third agent module can also perform knowledge graph verification after outputting the medical description text. A knowledge graph for the digestive endoscopy domain is pre-constructed, containing rules governing the relationships between lesion types, anatomical locations, and treatment methods. After entity extraction from the medical description text, the extracted entities are checked for consistency with the relationship rules in the knowledge graph. If the verification finds that necessary related elements are missing from the description text, the missing related elements are added to the prompt word template, triggering the large language model to regenerate the medical description text, with the number of retries not exceeding a preset limit.

[0104] Specifically, the knowledge graph in the field of digestive endoscopy includes the following core relational rules, which are derived from clinical guidelines for digestive endoscopy and expert consensus from senior physicians in collaborating hospitals: Rule R-01: When a polyp is detected and its maximum diameter is less than 0.5 cm, the recommended treatment is cold biopsy removal or regular follow-up observation. Rule R-02: When a polyp is detected and its maximum diameter is between 0.5 cm and 1.0 cm, the associated treatment is to recommend endoscopic mucosal resection; Rule R-03: When a polyp is detected and its maximum diameter is greater than 1.0 cm, the associated management approach is to recommend endoscopic submucosal dissection or surgical consultation for evaluation; Rule R-04: When a suspected tumor or space-occupying lesion is detected, the associated management procedure is to perform a biopsy for pathological examination and recommend further evaluation with enhanced abdominal CT. Rule R-05: When active bleeding is detected, the associated management approach is to recommend immediate endoscopic hemostasis. Rule R-06: When an ulcer is detected, the associated management approach is to recommend multiple biopsies to rule out inflammatory bowel disease, tuberculosis, or tumors, and to administer acid-suppressing or anti-inflammatory treatment. Rule R-07: When a diverticulum is detected and there are no signs of inflammation, the associated management approach is no special surgical treatment, and a high-fiber diet is recommended; Rule R-08: When diverticulitis is detected, the associated management is to recommend anti-infective treatment and to contraindicate colonoscopy until the inflammation subsides. Rule R-09: When no abnormal lesions are found, the associated management approach is to recommend determining the follow-up examination cycle based on the patient's age and family history risk stratification.

[0105] The above rules are stored in the knowledge graph in the form of triples of "lesion type + attribute condition → treatment method". After the third agent module extracts entities from the generated medical description text, it performs consistency verification between the extracted lesion type and attribute and the associated treatment method of the corresponding rule in the knowledge graph.

[0106] The fourth agent module (report generation) is described below.

[0107] The fourth proxy module also runs on the local server and starts after all video frames have been processed. It is responsible for integrating the output data from all proxy modules to generate a structured medical examination report. The primary task of this module is to perform spatiotemporal alignment matching between the target detection results output by the first proxy module and the site segmentation results output by the second proxy module, based on timestamps, thus binding the lesion target to the corresponding anatomical site. Specifically, for a lesion target detected by the first proxy module at a certain target timestamp, the second proxy module searches for candidate anatomical sites in the site segmentation sequence with start and end times that satisfy a time difference absolute value less than a preset time threshold. Because the system has a high frame rate, the processing of the same frame by the first and second proxy modules is basically synchronized, with minimal timestamp deviation, thus ensuring sufficient accuracy in spatiotemporal alignment matching. If multiple candidate anatomical sites exist at the boundary of anatomical sites, the candidate anatomical site with the highest cumulative confidence is selected for spatiotemporal binding.

[0108] After completing spatiotemporal alignment, the fourth agent module uses a large language model combined with chain-of-thought reasoning to generate a four-segment medical examination report. The preoperative preparation segment assesses intestinal cleanliness based on the water quality detection model's recognition results during frame quality screening—this model can distinguish between clear images, water or splash interference, and blurred images, automatically generating cleanliness assessment text accordingly. The operation process segment sorts the timestamps of each record in the site-segmented results and describes the endoscope's insertion and withdrawal paths in chronological order. The observation results segment, based on the spatiotemporal alignment matching binding results, groups by anatomical site and links the detection results of the lesion targets corresponding to each anatomical site with the medical description text generated by the third agent module, achieving a three-layer linkage between lesion, site, and description. The recommended measures segment, based on a preset rule engine, automatically matches corresponding clinical recommended measures according to the lesion target's category and physical size.

[0109] The clinical recommendation rules included in the rule engine are consistent with the association rules (rules R-01 to R-09) defined in the digestive endoscopy domain knowledge graph in the aforementioned third agent module. Specifically, the rule engine matches the trigger conditions of the above rules sequentially based on the category and physical size of the lesion targets bound to each anatomical site after spatiotemporal alignment matching, and automatically fills the recommendation measures section of the medical examination report with the clinical recommendation measures that meet the conditions. When there are multiple lesion targets in the same anatomical site, each lesion target matches the rules independently, and the recommendation measures are arranged in descending order of lesion severity. When no lesion targets are detected, the matching rule R-09 outputs a routine follow-up recommendation.

[0110] The following explains the host computer visualization and closed-loop feedback.

[0111] The host computer uses the Qt framework to build a human-computer interaction interface, and receives structured metadata and medical examination reports pushed by the local server through a WebSocket long connection for visualization rendering and display.

[0112] Regarding the multi-color segmented progress bar, based on the site segmentation results in the structured metadata, preset color combinations corresponding to different anatomical sites are rendered on the video playback progress bar in the interface according to time intervals, generating a multi-color segmented progress bar. Each color corresponds to an anatomical site, and users can jump to the corresponding video frame by clicking anywhere on the multi-color segmented progress bar, making it easy for doctors to quickly locate the target intestinal segment. The correspondence between colors and sites uses a preset color scheme by default, but user-defined adjustments are also supported.

[0113] Regarding the abnormal event peak diagram, below the corresponding timeline position of the multi-color segmented progress bar, an abnormal event peak diagram is overlaid and rendered with video time as the horizontal axis and the confidence score of the lesion target output by the first agent module as the vertical axis. The peak data comes from the cumulative number of lesion occurrences in multiple frames before and after. The system smooths and gradually attenuates the peak amplitude along the time axis, so that the peak diagram presents a smooth and natural envelope shape, making it easier for doctors to quickly locate high-confidence abnormal intervals by using the peak height.

[0114] Regarding label correction and closed-loop control, doctors can correct the detection labels of the first agent module in the host computer interface. Upon receiving the user's label correction operation on the interface, the host computer transmits the correction command back to the local server via the network, triggering the third and fourth agent modules to regenerate the corresponding medical description text and medical examination report based on the corrected labels. This closed-loop control mechanism ensures that the system does not merely output analysis results unidirectionally, but constitutes a complete closed loop of perception, cognition, human feedback, and adaptive reconstruction, guaranteeing that the final report accurately reflects the professional judgment of clinicians. Furthermore, the host computer also supports one-click export, allowing export of keyframe images with overlaid annotation boxes and category labels, as well as GIF animations within a user-selected time range, providing a convenient data processing tool for clinical research.

[0115] Example 2 See Figure 1 and Figure 2 This embodiment provides an intelligent analysis method for endoscopic videos based on multi-agent collaboration, applied to the system described in Embodiment 1. The method includes the following steps: Step 100, Video Preprocessing. The edge computing device acquires the endoscope video stream through the acquisition card, performs frame extraction and frame quality filtering at a preset frequency, and obtains a video frame sequence. In this step, the acquisition card digitizes the video signal output by the endoscope and transmits it to the edge computing device. The system extracts frames at a fixed frame rate matching the physical insertion speed of the endoscope, and sequentially performs highlight suppression preprocessing, low-light enhancement preprocessing, and four levels of frame quality filtering (scene-level, quality-level, logical-level, and statistical-level) on each frame. Finally, it outputs a standardized video frame sequence that has passed all filtering. The specific preprocessing method and filtering mechanism have been described in detail in Example 1 and will not be repeated here.

[0116] Step 200, Parallel Recognition Step. The first and second agent modules on the edge computing device process the video frame sequence in parallel, with no data dependency between them.

[0117] Step 210: The first proxy module performs forward inference on the current frame. A model using a shared backbone network for object detection and instance segmentation is employed to extract the segmentation mask for the lesion target. For detected lesion targets, the total number of foreground pixels in the segmentation mask is counted and calculated using the pixel-level equivalent diameter formula. The pixel-level equivalent diameter is calculated, and the real-time distance from the lens to the mucosa and the dynamic scaling factor based on camera intrinsic parameters are obtained to map the pixel-level equivalent diameter to the physical size. After estimating the size across multiple frames within a sliding window on the time axis and removing outliers, the median of a pre-set confidence interval is taken as the final physical size. The output includes timestamped target detection results, containing fields such as lesion category, bounding box coordinates, confidence score, and estimated physical size.

[0118] Step 220: The second proxy module constructs a dynamic spatiotemporal graph for multiple consecutive frames. A single video frame is divided into multiple spatial regions as spatial nodes. Feature vectors of each spatial node are extracted, and spatial edges are established between physically adjacent spatial nodes. Spatial nodes at the same spatial location in multiple consecutive frames are connected along the time axis to establish temporal edges, thus constructing the dynamic spatiotemporal graph. After spatiotemporal feature aggregation through multi-layer spatiotemporal graph convolutional layers, the probability distribution of each anatomical part category is output. The probability distribution of multiple consecutive frames is stored in a confidence accumulation buffer and time-weightedly summed. When the accumulated score exceeds a preset switching threshold, a state transition candidate is triggered. The anatomical topology finite state machine determines whether a transition is allowed based on the anatomical topology constraint table. Switching is only performed if the candidate part and the current part are legally adjacent topology nodes. The part segmentation results with timestamps are output.

[0119] Step 300, Metadata Data Transmission Step. The edge computing device encapsulates the target detection results and part segmentation results output from steps 210 and 220 into structured metadata in JSON format and sends it to the local server via a message middleware. The original video stream remains in the local storage of the edge computing device and is not transmitted over the network.

[0120] Step 400 describes the generation process. The third-party module on the local server uses structured metadata as search criteria to retrieve similar cases from a pre-built vector database as prompt examples. It then assembles the role setting instructions, the current structured metadata, and the retrieved prompt examples into a structured prompt template, which is input into a locally deployed large language model for contextual learning and reasoning to generate medical description text. Optionally, the generated medical description text is finalized only after passing knowledge graph verification.

[0121] Step 500, Report Generation Step. After all video frames have been processed, the fourth agent module on the local server starts. This module first performs spatiotemporal alignment matching between the target detection results and the site segmentation results based on timestamps, binding the lesion target to the corresponding anatomical site. Then, it integrates structured metadata and medical description text according to a four-segment structure to generate a structured medical examination report. The preoperative preparation segment assesses intestinal cleanliness based on water quality testing model results, the operation process segment describes the insertion and withdrawal path of the endoscope in timestamp order, the observation results segment realizes a three-layer linkage presentation of lesion-site-description, and the recommended measures segment automatically matches clinical recommendations through a rule engine.

[0122] Step 600: Visualization and Closed-Loop Feedback. The host computer receives structured metadata and the medical examination report, and performs visualization rendering, including rendering of multi-color segmented progress bars and abnormal event peak graphs. After reviewing the results, if the doctor finds any deviations in the automatic labeling, they can perform correction operations on the interface. In response to the user's label correction operation, the host computer sends the correction instruction back to the local server, triggering the third and fourth agent modules to re-execute steps 400 and 500, generating updated medical description text and the medical examination report, completing the closed-loop feedback control.

[0123] In summary, this invention achieves intelligent end-to-end video analysis and standardized report generation through a layered deployment architecture of edge computing devices and local servers, collaborative division of labor among four functional proxy modules, joint part recognition using spatiotemporal graph convolutional networks and finite state machines, high-precision size estimation based on segmentation masks and dynamic distance compensation, and large language model description generation driven by retrieval enhancement generation technology. The multi-color segmented progress bar, abnormal event peak graph, and label correction closed-loop control functions provided by the host computer further improve clinical efficiency and research data processing efficiency.

[0124] The following describes exemplary values ​​for the key parameters.

[0125] Furthermore, to facilitate implementation by those skilled in the art, exemplary value ranges for the main preset parameters in this invention are given below. It should be noted that the following values ​​are merely illustrative, and those skilled in the art can adjust them according to the specific type of endoscope, clinical scenario, and hardware configuration.

[0126] Regarding the frame rate, in a colonoscopy scenario, it is typically set to 2 to 4 frames per second.

[0127] For example, the high brightness threshold in frame quality filtering is set to a pixel brightness value of 240 (8-bit grayscale range 0-255); the low brightness threshold is set to a pixel brightness value of 40.

[0128] Regarding the two-level confidence thresholds in the first agent module, the real-time detection threshold is set to 0.3 for example, and the high-quality screening threshold is set to 0.6 for example.

[0129] Regarding the confidence accumulation buffer in the second agent module, the time window length is set to an example of 16 to 32 frames; switching threshold For example, it is set to 0.7; the number of time steps for the trend to continue. For example, it is set to 5 to 8 time steps.

[0130] Regarding the determination of transition states, the low confidence threshold is set to 0.5 for example, the difference threshold is set to 0.15 for example, and the high entropy threshold... For example, it is set to 1.5 (base natural logarithm). Keyframe anchoring threshold For example, it is set to 0.85, and the number of consecutive frames required for anchor confirmation is set to 3 frames. Maximum number of transition frames. For example, it is set to 64 frames.

[0131] Regarding optical flow entry / exit detection, the entry detection threshold... and withdrawal threshold For example, it is set as the global radial optical flow score. The absolute value is greater than 0.5 pixels / frame. Directional gain coefficient. For example, the directional attenuation coefficient is set to 1.2. It is set to 0.6 for example.

[0132] The preset time threshold for spatiotemporal alignment matching in the fourth proxy module is set to 500 milliseconds for example.

[0133] Regarding the preset length of the sliding window, the sliding window length used for statistical filtering of lesion physical size is exemplarily set to 10 to 20 frames, with a preset confidence interval of 95%.

[0134] Furthermore, the SLAM (Simultaneous Localization and Mapping) module refers to a computational module that calculates the pose changes of a camera (endoscope lens) in three-dimensional space in real time and constructs an environmental geometric model by performing feature matching and motion estimation on consecutive video frames. In this application, the SLAM module runs on an edge computing device and is used to calculate the inter-frame displacement based on visual feature matching between consecutive frames, serving as an auxiliary criterion for the bending-triggered gating mechanism.

[0135] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

[0136] The technical effects are explained below.

[0137] The endoscopic video intelligent analysis system and method based on multi-agent collaboration provided in the above embodiments have produced significant technical effects in multiple dimensions through their overall architecture design and the collaborative cooperation between various technical modules. The technical effects obtained are described below from the aspects of system architecture, each agent module, and their collaborative mechanism.

[0138] At the overall system architecture level, the above embodiment deploys the first and second agent modules on the edge computing device side for parallel execution, and deploys the third and fourth agent modules on the local server side for sequential execution. The two ends communicate via a message middleware, transmitting only structured metadata, not the raw video stream. This layered deployment architecture resolves the inherent contradiction in computing power requirements between high-frequency real-time visual detection and large language model semantic generation: the edge computing device undertakes detection tasks sensitive to inference latency, while the local server undertakes generation tasks that are computationally demanding but latency-insensitive. This allows two types of tasks, inherently difficult to handle simultaneously on a single device, to be executed efficiently and without interference. Simultaneously, the bandwidth optimization design for transmitting only structured metadata enables the system to run smoothly in a typical local area network environment, while the design of keeping the raw video stream locally on the edge computing device ensures the privacy and security of patient image data from a physical architecture perspective. At the multi-agent collaborative communication level, the frame queue, detection result queue, part result queue, description queue, and report queue defined in the message middleware implement asynchronous communication and collaborative scheduling between agent modules in an event-driven manner. The description queue only triggers the third agent module when the target detection result corresponding to the same frame contains a lesion target, avoiding unnecessary large language model inference caused by non-lesion frames, and effectively saving server-side computing resources.

[0139] At the video preprocessing level, the above embodiments set the frame extraction frequency to a fixed frame rate that matches the physical advancement speed of the endoscope. This ensures that the physical displacement of the lens between adjacent frames is much smaller than the diameter of the small lesions of clinical concern at this frame rate, thus achieving a balance between lesion capture rate and computational resource load. High-light suppression and low-light enhancement preprocessing performed on the extracted frames address the problems of high-light reflection from the endoscope light source on the moist mucosa surface and the difficulty in identifying small lesions caused by shadow areas at intestinal bends, respectively. The four-level frame quality screening mechanism progressively removes in vitro frames, black screen frames, blurred frames, water quality interference frames, still frames, logically abnormal frames, and size abnormal frames from scene level, quality level, logical level to statistical level. This ensures that subsequent proxy modules only process high-quality frame sequences, reducing the waste of GPU resources from invalid computation and avoiding interference from low-quality frames on detection and recognition accuracy.

[0140] At the first proxy module level, the object detection branch and the instance segmentation branch share the same feature extraction backbone network architecture. This eliminates the need for the two tasks to run feature extractors independently, effectively reducing GPU memory usage during model runtime and enabling efficient operation on edge computing devices with limited hardware resources. During the training phase, the gradient of the instance segmentation mask loss is backpropagated to the shared backbone network, forming a "segmentation-driven detection" gradient propagation mechanism: the segmentation task requires the network to make accurate judgments on lesion edge pixels, forcing the backbone network to learn more refined pixel-level feature representations. These refined features, in turn, enhance the object detection branch's ability to perceive features of small, low-contrast lesions, thereby improving the recall rate of small target lesions. The two-level confidence threshold setting ensures that doctors can see all suspected targets during operation with a lower real-time detection threshold, while a higher high-quality screening threshold ensures that only high-confidence detection results are used in physical size calculations and final report statistics, achieving a balance between real-time auxiliary observation and ensuring the reliability of report data.

[0141] At the level of lesion physical size estimation, the above embodiments calculate the pixel-level equivalent diameter based on the total number of foreground pixels of the instance segmentation mask. Compared with the coarse conversion scheme based on the area of ​​the rectangular detection frame, this can accurately fit the contour of irregular lesions and eliminate the interference of a large amount of background noise in the rectangular frame on the area estimation. On this basis, a dynamic scaling factor K(Z) based on camera intrinsic parameters and the real-time distance from the lens to the mucosa is introduced to map pixels to physical size, overcoming the problem that the fixed scaling factor cannot adapt to the dynamic changes in the focal length and observation distance of the endoscope lens. The temporal statistical filtering mechanism, which removes outliers from the estimated size of multiple consecutive frames within a sliding window and takes the median of the preset confidence interval, further eliminates the measurement error caused by single-frame jitter and angle tilt, making the final output physical size clinically reliable. The above three links—precise surface extraction of the segmentation mask, dynamic distance compensation, and temporal statistical filtering—are stacked step by step to form a complete accuracy guarantee link from pixel-level measurement to physical-level output.

[0142] At the second agent module level, the spatiotemporal graph convolutional network maintains the inherent adjacency relationships of the digestive tract lumen surface topology by performing spatial graph convolution and temporal convolution on a non-Euclidean graph structure. Compared to traditional grid convolution methods, it can capture the global topological pattern of anatomical structures more accurately. An anatomical topological finite state machine introduced at the decision layer uses the cumulative score of the confidence accumulation buffer after time-weighted summation as the decision signal. It judges the legality of state transition candidates according to the anatomical topological constraint table, allowing only transitions between adjacent topological nodes and blocking illegal jumps. This mechanism allows the temporal smoothing information provided by the spatiotemporal graph convolutional network to cascade with the topological constraints imposed by the finite state machine, effectively suppressing classification jump noise generated in the transition regions of traditional frame-by-frame classification methods. Combined with enhancement mechanisms such as optical flow sensing of camera movement direction to dynamically adjust allowed transition directions, marking transition states at low confidence levels, and keyframe anchoring, the robustness and smoothness of part recognition under various complex operational scenarios are further ensured. The architecture design of a shared backbone network and independent classification heads allows the same model skeleton to be adapted to both colonoscopy and gastroscopy by loading different classification heads, thus avoiding classification confusion between different examination types while reusing common digestive tract mucosal feature representations.

[0143] At the third agent module level, a locally deployed large language model, without updating network parameters, retrieves historical cases most similar to the current case from a vector database as context examples for few-shot inference. These cases are then combined with the current structured metadata to assemble a structured prompt word template for inference. This avoids the risk of overfitting and high computational and maintenance costs associated with traditional model fine-tuning, while fully utilizing the extensive pre-trained knowledge of the base model. The effectiveness of this retrieval enhancement generation mechanism directly benefits from the quality of the structured metadata output by the upstream first and second agent modules—high-precision lesion target attributes and highly temporally consistent site segmentation results provide reliable query conditions for vector retrieval, enabling the retrieval results to highly match the current case, thereby guiding the large language model to generate more accurate and standardized medical descriptions. An optional knowledge graph verification mechanism, by extracting entities from the generated text and verifying consistency with preset association rules, can detect missing necessary related elements in the description text and trigger regeneration, providing additional assurance for the completeness of the medical description.

[0144] At the fourth agent module level, time-stamp-based spatiotemporal alignment matching binds the lesion targets detected by the first agent module to the corresponding anatomical sites identified by the second agent module, achieving a three-layer linkage of "lesion-site-description". Due to the system's high frame rate and the parallel processing of the same frame by the two agent modules, the timestamp deviation is minimal, ensuring the reliability of spatiotemporal alignment matching. In the four-segment report structure, the preoperative preparation segment automatically assesses intestinal cleanliness based on the identification results of the water quality detection model during frame quality screening, eliminating the need for manual scoring by doctors; the operation process segment automatically describes the insertion and withdrawal paths of the endoscope by timestamp; the observation results segment presents the detection results and medical descriptions in a linked group by anatomical site; and the recommended measures segment automatically matches clinical recommendations based on lesion type and physical size using a rule engine. This structured report generation method ensures the completeness and standardization of the report content, reducing the risk of information omission.

[0145] At the level of upper-computer visualization and closed-loop feedback, multi-color segmented progress bars intuitively indicate the time interval distribution of each anatomical site with different colors. Combined with the user-click jump function, this allows doctors to quickly locate the target intestinal segment without manually replaying the video. The abnormal event peak graph uses the cumulative number of lesion occurrences as its data source, and after smoothing and attenuation processing, it is superimposed and rendered on the time axis, making high-confidence abnormal intervals immediately apparent. The label correction operation triggers the regeneration of the third and fourth agent modules, forming a complete closed-loop control loop from perception to cognition, from manual feedback to adaptive reconstruction, ensuring that the final report accurately reflects the professional judgment of clinicians. The function of one-click export of keyframe images and GIF animations with overlaid annotation boxes and category labels provides a convenient tool for data organization in clinical research.

[0146] In summary, the above embodiments, through the computing power allocation and privacy protection mechanism formed by the layered deployment architecture of edge computing devices and local servers, the technical improvements of the four agent modules, and the cascading collaborative relationship established among them through message middleware, structured metadata data transmission, and spatiotemporal alignment matching, jointly realize the intelligent analysis of endoscopic videos from raw acquisition to structured report output. Significant technical improvements have been achieved in multiple dimensions, including analysis efficiency, detection accuracy, reliability of size estimation, robustness of part identification, description standardization, and report completeness.

[0147] It should be noted that in this patent application, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one" does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element. In this patent application, if it refers to performing an action according to an element, it means performing the action at least according to that element, including two cases: performing the action only according to that element, and performing the action according to that element and other elements. Expressions such as "multiple," "repeatedly," and "various" include two, two times, two kinds, and more than two, more than two times, and more than two kinds.

[0148] All documents mentioned in this application are considered to be incorporated in their entirety into the disclosure of this application so that they can serve as a basis for modifications if necessary. Furthermore, it should be understood that after reading the foregoing disclosure of this application, those skilled in the art can make various alterations or modifications to this application, and these equivalent forms also fall within the scope of protection claimed in this application.

Claims

1. An intelligent analysis system for endoscopic video based on multi-agent collaboration, characterized in that, This includes an edge computing device, a local server, and a host computer that are connected in sequence via communication. The edge computing device is configured to: acquire a video stream collected by an endoscope device; perform frame extraction and frame quality filtering on the video stream at a preset frequency to obtain a video frame sequence; and execute a first proxy module and a second proxy module in parallel on the video frame sequence. The first proxy module adopts a model that shares a backbone network for target detection and instance segmentation, extracts the segmentation mask of the lesion target in the video frame, and calculates the physical size of the lesion target in combination with the real-time distance of the lens, and outputs the target detection result with a timestamp. The second proxy module uses a spatiotemporal graph convolutional network to extract spatiotemporal features from multiple consecutive frames, and combines a preset anatomical topology finite state machine to perform state transition constraints, outputting part segmentation results with timestamps. The edge computing device encapsulates the target detection results and the part segmentation results into structured metadata, and sends them to the local server through a message middleware, while the original video stream is stored on the edge computing device. The local server is configured to receive the structured metadata and execute the third proxy module and the fourth proxy module in sequence. The third agent module uses the aforementioned structured metadata as search criteria to retrieve similar cases from the pre-built vector database as prompt word examples, and combines the structured metadata with the input to the large language model to generate medical description text; The fourth proxy module performs spatiotemporal alignment matching between the target detection results and the site segmentation results based on the timestamp, binds the lesion target to the corresponding anatomical site, and integrates the structured metadata and the medical description text to generate a medical examination report; The host computer is configured to: receive the structured metadata and medical examination report pushed by the local server, perform visual rendering and display, and respond to the user's label correction operation on the interface, send the correction instruction back to the local server to trigger the third agent module and the fourth agent module to regenerate the corresponding medical description text and medical examination report.

2. The system according to claim 1, characterized in that, The first agent module calculates the physical size of the lesion target based on the real-time distance of the lens, specifically including: Count the total number of foreground pixels in the segmentation mask. And calculate the pixel-level equivalent diameter ; Obtain the real-time distance from the endoscope lens to the digestive tract mucosa. According to the formula Calculate the single-frame physical size of the lesion target, where Based on camera intrinsic parameters and the real-time distance The calculated dynamic scaling factor represents the physical length of each pixel at a specific distance, expressed in millimeters per pixel, and varies with distance. Increase and increase; A sliding window of a preset length is maintained on the time axis. After removing outliers from the physical size of a single frame in multiple consecutive frames, the median of a preset confidence interval is calculated as the final output physical size.

3. The system according to claim 1, characterized in that, The second proxy module uses a spatiotemporal graph convolutional network to extract spatiotemporal features from multiple consecutive frames, and combines this with a pre-defined anatomical topology finite state machine for state transition constraints, specifically including: A single video frame is divided into multiple spatial regions as spatial nodes. The feature vectors of the spatial nodes are extracted. Spatial edges are established between the spatial nodes that are physically adjacent. Spatial nodes with the same spatial position in multiple consecutive frames are connected along the time axis to establish time edges, so as to construct a dynamic spatiotemporal graph. Spatiotemporal features are aggregated along the spatial and temporal edges using multi-layer spatiotemporal graph convolutional layers. After processing by global average pooling and a fully connected classifier, the probability distribution of each anatomical part category at the current time step is output. The probability distribution of multiple consecutive frames is stored in a confidence accumulation buffer and then summed over time to obtain a cumulative score. When the cumulative score of the target anatomical part exceeds the preset switching threshold and maintains the trend for multiple consecutive time steps, a state transition candidate is triggered. The anatomical topology finite state machine makes a determination based on the built-in anatomical topology constraint table. If the candidate target anatomical part and the current anatomical part are not legal adjacent topology nodes, the transition signal is intercepted and the current anatomical part state is forcibly maintained.

4. The system according to claim 3, characterized in that, The second proxy module, in conjunction with the preset anatomical topology finite state machine, also includes the following state transition constraints: The lens motion vector calculated by optical flow method is used to dynamically sense the advance and retreat directions of the camera, so as to dynamically adjust the allowed transition directions of the anatomical topology finite state machine. If the camera retreats, switching to the distal anatomical site is prohibited. When the highest confidence level of the output is lower than the preset low confidence level threshold, or the difference between the two highest confidence levels is lower than the preset difference threshold, the current time step is marked as a transitional state at the junction of anatomical sites. When multiple consecutive frames are in the transition state, the feature sequence is backtracked to the historical direction of the time axis for local smoothing filtering. Before the preset anatomical marker of the next anatomical site is detected and the confidence level is greater than the key frame threshold, the original anatomical site state is maintained.

5. The system according to claim 1, characterized in that, The edge computing device performs frame extraction and frame quality filtering on the video stream at a preset frequency, specifically including: The preset frequency is set to a fixed frame rate that matches the physical advance speed of the endoscope; The extracted video frames are preprocessed, including: extracting the highlight detection mask for areas where the brightness exceeds the set high brightness threshold and performing adaptive histogram equalization repair processing, and applying Gamma correction processing to areas where the brightness is below the set low brightness threshold to improve the contrast of dark areas. The preprocessed video frames are subjected to a four-level frame quality screening process: external frames and black screen frames are filtered by scene-level model; blurred frames, water quality interference frames, and still frames are filtered by quality-level model; logically abnormal frames are filtered by combining timing counters and anatomical path constraints; and video frames with abnormal size fluctuations are filtered by calculating confidence intervals when calculating the physical size of the lesion target. Only video frames that pass the four-level frame quality screening are input to the first proxy module and the second proxy module in parallel.

6. The system according to claim 1, characterized in that, The specific process by which the third agent module generates medical description text includes: The image feature descriptions and diagnostic conclusions in the approved historical medical examination reports are converted into feature vectors to construct the vector database. During the reasoning phase, the lesion targets and anatomical features contained in the structured metadata are used as query conditions to retrieve similar cases from the vector database as prompt word examples for few-sample reasoning. The character setting instructions, the current structured metadata, and the retrieved prompt word examples are assembled into a structured prompt word template; Without updating network parameters, the structured prompt word template is input into the locally privately deployed large language model for context learning and reasoning, and the medical description text is output.

7. The system according to claim 1, characterized in that, The fourth proxy module performs spatiotemporal alignment matching between the target detection result and the site segmentation result based on the timestamp, binding the lesion target to the corresponding anatomical site, specifically including: For the lesion target detected by the first agent module at the target timestamp, in the part segment sequence with start and end times output by the second agent module, search for candidate anatomical parts that satisfy the condition that the absolute value of the time difference is less than a preset time threshold. If there are multiple candidate anatomical sites at the junction of anatomical sites, the candidate anatomical site with the highest cumulative confidence is selected for spatiotemporal binding. When generating the medical examination report, the fourth agent module automatically matches corresponding clinical recommendations based on the category and physical size of the lesion target according to a preset rule engine.

8. The system according to claim 1, characterized in that, The model adopted by the first agent module, which uses a shared backbone network for target detection and instance segmentation, includes: A shared feature extraction backbone network is used to extract multi-scale feature maps from the input frame image; The target detection branch receives the feature map output by the shared feature extraction backbone network, performs lesion target detection, and outputs the lesion category and detection box coordinates. The instance segmentation branch receives feature maps of multiple scales output by the shared feature extraction backbone network, performs pixel-level instance segmentation, and outputs the segmentation mask; During the training phase, a composite loss function is constructed, which includes target detection box localization loss, classification confidence loss, and instance segmentation mask loss. The gradient of the instance segmentation mask loss is backpropagated to the shared feature extraction backbone network, thereby improving the feature perception capability of the lesion target through the instance segmentation task.

9. The system according to claim 7, characterized in that, The medical examination report generated by the fourth agent module adopts a four-segment structure, including: Preoperative preparation phase: Intestinal cleanliness score is assessed based on the identification results of the water quality detection model during the frame quality screening process; Operation process segment: Sort by the timestamps of each record in the segmented results of the described part, and describe the insertion and withdrawal paths of the endoscope; Observation results section: Based on the binding results of the spatiotemporal alignment matching, the target detection results of the lesion targets corresponding to each anatomical location are grouped according to the anatomical location and presented in conjunction with the medical description text; Recommended Actions Section: Based on the rule engine, clinical recommended actions are automatically matched according to the category of the lesion target and the physical size.

10. A method for intelligent analysis of endoscopic videos based on multi-agent collaboration, applied to the system according to any one of claims 1 to 9, characterized in that, Includes the following steps: Video preprocessing steps: The edge computing device acquires the endoscope video stream through the acquisition card, performs frame extraction and frame quality screening at a preset frequency, and obtains a video frame sequence; Parallel recognition steps: The first agent module and the second agent module on the edge computing device process the video frame sequence in parallel; the first agent module uses a model with a shared backbone network for target detection and instance segmentation to extract the segmentation mask of the lesion target, and calculates the physical size of the lesion target in combination with the real-time distance of the lens, and outputs the target detection result with a timestamp; the second agent module uses a spatiotemporal graph convolutional network to extract the spatiotemporal features of multiple consecutive frames, and combines an anatomical topological finite state machine to perform state transition constraints, and outputs the part segmentation result with a timestamp. Meta data transmission steps: The edge computing device encapsulates the target detection result and the part segmentation result into structured metadata, and sends it to the local server through a message middleware, while the original video stream is stored in the edge computing device; Description generation steps: The third agent module on the local server uses the structured metadata as search conditions to retrieve similar cases from the vector database as prompt word examples, and combines the structured metadata with the input to the large language model to generate medical description text; Report generation steps: The fourth proxy module on the local server performs spatiotemporal alignment matching between the target detection results and the site segmentation results based on the timestamp, binds the lesion target to the corresponding anatomical site, and integrates the structured metadata and the medical description text to generate a medical examination report; Visualization and closed-loop feedback steps: The host computer receives the structured metadata and the medical examination report, performs visualization rendering and display, and responds to the user's tag correction operation by sending the correction instruction back to the local server, triggering the third agent module and the fourth agent module to regenerate the corresponding medical description text and medical examination report.