Image detection, computer-aided diagnosis and model training method, electronic equipment, storage medium and program product

By employing a multi-round iterative interleaved reasoning method, combined with the visual and location features of medical images, the accuracy problem of single-step reasoning in the detection of complex medical images is solved, achieving more accurate lesion detection and comprehensive information provision.

CN122199378APending Publication Date: 2026-06-12ALIBABA DAMOYUAN (BEIJING) TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ALIBABA DAMOYUAN (BEIJING) TECH CO LTD
Filing Date
2026-01-23
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing single-step inference methods struggle to accurately detect lesions when processing complex medical images, especially 3D medical images, resulting in inaccurate detection results.

Method used

A multi-round iterative interleaved reasoning method is adopted. By acquiring the visual features and location features of medical images and combining them with the detection task text, multi-round iterative interleaved reasoning guided by location is carried out to gradually determine candidate target locations. Generative base models are then used for training to generate more accurate detection results.

🎯Benefits of technology

It improves the accuracy of lesion detection, avoids omissions, and provides richer and more comprehensive information on the extent of lesion impact.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122199378A_ABST
    Figure CN122199378A_ABST
Patent Text Reader

Abstract

Embodiments of the present application provide an image detection, computer-aided diagnosis and model training method, electronic equipment, storage medium and program product, wherein an image detection method comprises: obtaining a medical image to be detected and a detection task text for the medical image; performing visual feature extraction and part segmentation on the medical image to obtain corresponding image visual features and part features of a plurality of parts included in the medical image; based on the detection task text, the image visual features and the part features, performing part-guided multi-round iterative interleaved reasoning through an image detection model to obtain a detection result of the medical image; wherein in the multi-round iterative interleaved reasoning, a candidate target part is determined for a next round of reasoning by a previous round of reasoning, so that the image detection model performs the next round of reasoning based on at least the part features of the candidate target part. Through the embodiments of the present application, more abundant and comprehensive information can be provided for lesion detection, and the accuracy of the detection result is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to an image detection method, a computer-aided diagnosis method, a model training method, an electronic device, a computer storage medium, and a computer program product. Background Technology

[0002] Medical images are crucial tools in clinical practice for disease detection and diagnosis, frequently used for disease screening and diagnosis. In recent years, artificial intelligence (AI) technology has been widely applied in medical image analysis, such as through large language models for cancer screening. However, most large language models currently used in medical image analysis rely on single-step inference to complete downstream tasks. In scenarios like cancer screening, where processing is required based on complex images such as 3D medical images, the high structural and informational complexity of these images makes single-step inference insufficient for accurate judgments, resulting in inaccurate detection results. Summary of the Invention

[0003] In view of this, embodiments of this application provide an image detection, computer-aided diagnosis, and model training scheme to at least partially solve the above-mentioned problems.

[0004] According to a first aspect of the embodiments of this application, an image detection method is provided, comprising: acquiring a medical image to be detected and a detection task text for the medical image; performing visual feature extraction and part segmentation on the medical image to obtain corresponding image visual features and part features of multiple parts included in the medical image; and performing part-guided multi-round iterative interleaved reasoning through an image detection model based on the detection task text, the image visual features, and the part features to obtain a detection result of the medical image; wherein, in the multi-round iterative interleaved reasoning, the previous round of reasoning determines candidate target parts for the next round of reasoning, so that the image detection model performs the next round of reasoning at least based on the part features of the candidate target parts.

[0005] According to a second aspect of the embodiments of this application, another image detection method is provided, comprising: displaying a first human-computer interaction interface and receiving a medical image input by a user and a detection task text for the medical image through the first human-computer interaction interface; sending the medical image and the detection task text to a server, so as to obtain a detection result of the medical image by performing multi-round iterative interleaved reasoning guided by the location through an image detection model in the server, at least based on the image visual features of the medical image and the detection task text; receiving the detection result returned by the server, and displaying the detection result through a second human-computer interaction interface.

[0006] According to a third aspect of the embodiments of this application, a computer-aided diagnostic method is provided, comprising: acquiring a three-dimensional medical image to be detected and a detection task text for the three-dimensional medical image; performing visual feature extraction and site segmentation on the three-dimensional medical image to obtain corresponding image visual features and site features of multiple biological sites included in the three-dimensional medical image; performing site-guided multi-round iterative interleaved reasoning through an image detection model based on the detection task text, the image visual features, and the site features to obtain a detection result of the three-dimensional medical image; wherein, in the multi-round iterative interleaved reasoning, the previous round of reasoning determines candidate target sites for the next round of reasoning, so that the image detection model performs the next round of reasoning at least based on the site features of the candidate target sites; and generating auxiliary diagnostic suggestions corresponding to the three-dimensional medical image based on the detection results.

[0007] According to a fifth aspect of the embodiments of this application, an image detection model training method is provided, comprising: obtaining training samples, the training samples including: multiple medical image-text sample pairs and thought chain visual question-answering instruction samples corresponding to the medical image samples in the medical image-text sample pairs, wherein each medical image-text sample pair includes a medical image sample and a detection task text sample and a detection result text sample corresponding to the medical image sample; using the training samples, performing fine-tuning training on a generative base model based on location-guided multi-round iterative interleaved inference to obtain an image detection model for medical image detection; wherein, in the multi-round iterative interleaved inference, the previous round of inference determines candidate target locations for the next round of inference, so that the generative base model performs the next round of inference at least based on the location features of the candidate target locations.

[0008] According to a fifth aspect of the embodiments of this application, an electronic device is provided, including: a processor, a memory, a communication interface, and a communication bus, wherein the processor, the memory, and the communication interface communicate with each other through the communication bus; the memory is used to store at least one executable instruction, wherein the executable instruction causes the processor to perform an operation corresponding to the method described in the first, second, third, or fourth aspects.

[0009] According to a sixth aspect of the embodiments of this application, a computer storage medium is provided that stores a computer program thereon, which, when executed by a processor, implements the method described in the first, second, third, or fourth aspect.

[0010] According to a seventh aspect of the embodiments of this application, a computer program product is provided, including computer instructions that instruct a computing device to perform an operation corresponding to the method described in the first, second, third, or fourth aspects.

[0011] According to the solution provided in the embodiments of this application, in the detection process of medical images, especially complex images such as three-dimensional medical images, the visual features corresponding to the medical image and the features corresponding to multiple parts contained in the medical image are obtained respectively. Based on this, combined with the detection task text, a multi-round iterative interleaved reasoning method is adopted. The candidate target parts associated with the parts in the previous round of reasoning are predicted through the previous round of reasoning. Since a lesion in a certain part may not only affect the lesion itself, but also affect other parts associated with it, the influence range of the lesion can be effectively determined by determining multiple candidate target parts, avoiding detection omissions, providing richer and more comprehensive information for lesion detection, and improving the accuracy of detection results. Attached Figure Description

[0012] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in the embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings.

[0013] Figure 1 A schematic diagram of an exemplary system to which the embodiments of this application are applicable; Figure 2A This is a flowchart illustrating the steps of an image detection model training method according to an embodiment of this application. Figure 2B This is a schematic diagram of the structure of an exemplary image detection model according to a practical example of this application; Figure 3A This is a flowchart illustrating the steps of an image detection method according to an embodiment of this application; Figure 3B For use Figure 2B The diagram shows the image detection process performed by the image detection model shown. Figure 4A This is a flowchart of another image processing method according to an embodiment of this application; Figure 4B for Figure 4A A schematic diagram of an exemplary scenario of the illustrated embodiment; Figure 5 This is a flowchart illustrating the steps of a computer-aided diagnostic method according to an embodiment of this application. Figure 6 This is a schematic diagram of the structure of an electronic device according to an embodiment of this application. Detailed Implementation

[0014] To enable those skilled in the art to better understand the technical solutions in the embodiments of this application, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art should fall within the protection scope of the embodiments of this application.

[0015] The specific implementation of the embodiments of this application will be further described below with reference to the accompanying drawings.

[0016] Figure 1 An exemplary system applicable to embodiments of this application is shown. For example... Figure 1 As shown, the system 100 may include a cloud server 102, a communication network 104, and / or one or more user devices 106. Figure 1 The example in the text shows multiple user devices.

[0017] The cloud server 102 can be any suitable device for storing information, data, programs, and / or any other suitable type of content, including but not limited to distributed storage system devices, server clusters, computing cloud server clusters, etc. In some embodiments, the cloud server 102 can perform any suitable function. For example, in some embodiments, the cloud server 102 can be used to perform image detection processing of medical images. As an optional example, in some embodiments, the cloud server 102 can acquire the medical image to be detected and the detection task text for the medical image; perform visual feature extraction and site segmentation on the medical image to obtain the corresponding image visual features and site features of multiple sites included in the medical image; based on the detection task text, image visual features, and site features, perform site-guided multi-round iterative interleaved reasoning through an image detection model to obtain the detection result of the medical image; wherein, in the multi-round iterative interleaved reasoning, the previous round of reasoning determines the candidate target site for the next round of reasoning, so that the image detection model performs the next round of reasoning at least based on the site features of the candidate target sites. As another example, in some embodiments, the cloud server 102 can be used to receive a medical image to be detected sent by the user equipment 106; and after obtaining the detection result corresponding to the medical image, send the result to the user equipment 106. As another example, in some embodiments, an image detection model can also be deployed in the cloud server 102 to perform the above-mentioned image detection processing through the image detection model. In other optional embodiments, the cloud server 102 can also perform model training to obtain the image detection model.

[0018] In some embodiments, the communication network 104 can be any suitable combination of one or more wired and / or wireless networks. For example, the communication network 104 can include any one or more of the following: the Internet, an intranet, a wide area network (WAN), a local area network (LAN), a wireless network, a digital subscriber line (DSL) network, a frame relay network, an asynchronous transfer mode (ATM) network, a virtual private network (VPN), and / or any other suitable communication network. The user equipment 106 can be connected to the communication network 104 via one or more communication links (e.g., communication link 112), and the communication network 104 can be linked to the cloud server 102 via one or more communication links (e.g., communication link 114). The communication link can be any communication link suitable for transmitting data between the user equipment 106 and the cloud server 102, such as a network link, a dial-up link, a wireless link, a hardwired link, any other suitable communication link, or any suitable combination of such links.

[0019] User device 106 may include one or more user devices suitable for interacting with a user through a human-computer interaction interface and having the ability to present data and information, such as presenting medical images or detection results of medical images. In some embodiments, user device 106 may include any suitable type of device. For example, in some embodiments, user device 106 may include mobile devices, tablet computers, laptop computers, desktop computers, wearable computers, and / or any other suitable type of user device.

[0020] Based on the above system, this application provides an image detection and model training scheme, which will be described below through several embodiments.

[0021] To facilitate understanding, the training process of the image detection model will be explained first, and then the image detection method and computer-aided diagnosis method of the embodiments of this application will be explained based on this.

[0022] The following, combined with Figure 2A and Figure 2B This application describes an image detection model training method according to an embodiment. Figure 2A The flowchart of the image detection model training method is shown, which includes the following steps: Step S202: Obtain training samples.

[0023] In this embodiment, the training samples include: multiple pairs of medical image-text samples and visual question-and-answer instruction samples corresponding to the medical image samples in the pairs. Each pair of medical image-text samples contains both image data and corresponding text data, thus forming a paired image-text sample. Specifically, each pair of medical image-text samples in this embodiment includes: a medical image sample, a detection task text sample corresponding to the medical image sample, and a detection result text sample.

[0024] The detection task text sample in the medical image text sample pair describes the detection requirements for the medical image sample in the sample pair, such as "Please detect whether there are lesions in the lungs" or "Does the CT image show any liver lesions?", etc. The detection result text sample is the image detection result (which can be an image detection report, etc.) corresponding to the detection task text sample for that medical image sample.

[0025] Unlike traditional training samples, in this embodiment, for each medical image-text sample pair, the medical image sample also has a corresponding thought chain visual question answering instruction sample. The thought chain visual question answering instruction is an instruction that introduces the reasoning ability of a Large Language Model (LLM) (Thought Chain CoT) into a visual understanding task (VQA). CoT guides the LLM to generate intermediate reasoning steps step by step, and then derives the result; VQA can output the corresponding answer through the LLM based on a given image and a task or question about the image. By introducing CoT into VQA, when faced with visual problems, the LLM no longer "guesses the answer at a glance," but rather, like a human, first identifies key elements in the image, then performs logical deduction, and finally generates the answer. This allows the LLM to handle more complex reasoning tasks, is more interpretable, reduces illusions, and makes the answer generation process more accurate, transparent, and logical. During the model training phase, the aforementioned thought chain visual question answering instructions are represented as thought chain visual question answering instruction samples corresponding to medical image samples. They are paired with medical image samples and provide step alignment reasoning and cross-modal alignment information along the trajectory of "representation → impression → pathology". This enables standardized evaluation of the accuracy of LLM-generated detection results and the consistency of reasoning in the generation process.

[0026] In one alternative approach, four types of thought chain visual question answering tasks can be identified based on the text samples of the detection results: localization, lesion attributes, abnormal staging (e.g., T (Tumor), N (Node), and M (Metastasis) staging for cancer-related diseases) prediction, and thought chain report generation. Each task type is further subdivided into sub-tasks based on the substructure of the target site (e.g., the target organ) and the abnormal grade (e.g., tumor grade). The task design can range from simple to complex, employing formats such as multiple-choice and open-ended questions to construct thought chain visual question answering instruction samples. This ensures that each thought chain visual question answering instruction sample contains a traceable reasoning process and summary, thereby improving the reasoning interpretability and cross-modal consistency of LLM through this organizational framework. Corresponding to the above task, in one optional approach, the visual question-answering instruction samples of the thought chain include: question-answering instruction samples for locating parts in medical image samples, question-answering instruction samples for performing abnormal attribute analysis on abnormal parts, question-answering instruction samples for predicting abnormal staging of abnormal parts, and question-answering instruction samples for instructing the generation of thought chain reports. Through these instruction samples, a thought chain for model training can be constructed.

[0027] In one example of an embodiment of this application, a staggered validation mind chain data engine is designed to construct mind chain visual question-answering instruction samples to form mind chains for model training. Furthermore, a knowledge base storing a diagnostic knowledge graph is established. This knowledge graph is constructed based on diagnostic guidelines, textbooks, and typical cases, forming a triplet-based ("entity-relationship-entity") knowledge graph whose content covers anatomical structures, examination results, diagnostic opinions, histopathology, and risk factors. All site (organ) segmentation criteria follow international conventions, enabling hierarchical substructure analysis and anomaly (e.g., tumor) grading analysis. During the construction of the mind chain, relevant nodes and relationships can be retrieved from this knowledge graph to ensure logical traceability and reliability, and to minimize factual and logical errors.

[0028] Guided by this knowledge graph, and based on the execution of the aforementioned tasks, diverse, fine-grained visual question-answering sample instructions can be generated. For example, this may include the following process: Process 1. Processing based on raw data.

[0029] In this process, the original training samples can first be cleaned to standardize the sample terminology. Further, medical image samples from the training samples can be segmented to obtain region (organ) masks; structured features can be extracted from medical image samples (such as CT scans) and detection result text samples (such as pathology reports). For example, region masks can be obtained through a segmentation model, and structured features (including but not limited to clinical information, lesion attributes (location, shape, edge, density, quantity), etc.) can be extracted using a structured feature extractor. Abnormal staging information such as TNM staging information and the final diagnostic conclusion can also be extracted from pathology reports to ensure the completeness and accuracy of the subsequent thought process.

[0030] Process 2. Importing the diagnostic knowledge graph.

[0031] As mentioned earlier, a diagnostic knowledge graph for each part (organ) can be constructed first (e.g., a diagnostic knowledge graph for the five digestive organs). During the construction of the thought chain, relevant nodes and relationships can be retrieved from the diagnostic knowledge graph as clinical prior knowledge for constructing the thought chain data, ensuring the traceability and high reliability of the thought chain logic.

[0032] Process 3. Construction and validation of mind chain data based on mind chain visual question answering instruction samples.

[0033] For example, this process can use three agents: a CoT Reasoner, a Logic Calibrator, and a Summarizer. These three agents can use the same or different model structures, but to leverage the advantages of different base models, different base models are assigned to them in this example. For instance, the CoT Reasoner can employ a model with strong language and medical capabilities; in some embodiments, a lightweight model can be used. It can receive multidimensional structured features extracted from pathology reports (text samples of test results) such as CT reports in process 1, and combine them with diagnostic knowledge graphs (KGs) in process 2 to associate radiological examination results with diagnostic opinions, generating visual question-and-answer instruction samples at the site (organ) and lesion levels that conform to medical guidelines. These include question-and-answer instruction samples for locating sites in medical image samples and question-and-answer instruction samples for analyzing abnormal attributes of abnormal sites. The Logic Calibrator is used to automatically verify the completeness and accuracy of the reasoning process and can use models with high reasoning capabilities. If the logic calibrator detects a potential problem, it will randomly employ one of two prompting strategies—expanding the organ region or providing a suspected cause—to guide the image detection model to re-evaluate its thought chain. The summarizer, which may also employ a lightweight model, receives structured information extracted from the pathology report and constructs thought chain visual question-answering (CoT-VQA) instruction samples. It first summarizes and validates all previously generated thought chains. If the thought chain matches the pathology conclusion, it generates question-answer instruction samples for each site (organ) predicting abnormal staging (e.g., TNM staging) (i.e., question-answer instruction samples for predicting abnormal staging of sites with abnormalities) and question-answer instruction samples for the thought chain-formatted report (i.e., question-answer instruction samples for instructing the generation of the thought chain report). Otherwise, it triggers upstream nodes to re-perform thought reasoning.

[0034] For example, a question-and-answer instruction sample for locating a site in a medical image sample is: "Please locate the liver in the image" - "(0.08, 0.27, 0.09, 0.39, 0.74, 0.97)"; a question-and-answer instruction sample for performing abnormal attribute analysis on abnormal sites is: "Please describe the imaging characteristics of the lesion in the right lobe of the liver" - "An irregular low-density lesion is seen in segment VIII of the right lobe of the liver, approximately 4.5 × 3.8 × 4.2 cm in size, with indistinct borders and a lobulated appearance"; a question-and-answer instruction sample for predicting abnormal staging of abnormal sites is: "Please perform TNM staging of this liver tumor based on its imaging findings." - "Based on imaging findings: [T Staging]: A single mass in the right lobe of the liver, with a maximum diameter of approximately 4.5 cm, and no clear signs of invasion of major blood vessels, is judged as T2 stage (single tumor, maximum diameter >2 cm but ≤5 cm, without invasion of major blood vessels). [N Staging]: No significantly enlarged lymph nodes (short diameter <1 cm) are seen in the porta hepatis and retroperitoneum, and is judged as N0 stage. [M Staging]: No clear metastatic lesions are seen in the lungs, bones, peritoneum, etc., and is judged as M0 stage."; A sample question-and-answer instruction for generating a thought chain report is as follows: "Please generate an imaging diagnosis report for this patient, including the complete reasoning process" - "First, ...; Second, ...; Finally, .... Based on the above findings, ...."

[0035] Based on the aforementioned question-and-answer instruction samples and combined with the diagnostic knowledge graph, the model can perform multiple rounds of reasoning to achieve the aforementioned tasks of localization, lesion attribute prediction, abnormal staging prediction, and thought chain report generation. During the execution of these multiple tasks, if an abnormality is found after analyzing the lesion attribute of the localized site in the current round according to the diagnostic knowledge graph, the model will determine the associated sites based on the diagnostic knowledge graph, and further perform localization and lesion analysis tasks on the associated sites until the diagnostic knowledge graph reaches the final node. Finally, if the results of multiple inferences are consistent with the pathological conclusion, it will generate an abnormal staging prediction question-and-answer and thought chain format report for the target site (organ).

[0036] In an alternative approach, the quality of the thought chain constructed based on the above-mentioned question-and-answer instruction sample can also be evaluated, and if the evaluation result meets the preset standard such as the preset score (which can be set by those skilled in the art according to actual needs, for example, 45 points, etc.), the constructed thought chain can be determined as the thought chain to be used.

[0037] For example, the evaluation of the thought chain can be divided into two stages. In the first stage, structured (subject, relation, object) triples are extracted from the text sample of the detection results as the factual basis for the thought chain reasoning. In the second stage, the quality of the reasoning chain of the generated thought chain is evaluated based on these triples.

[0038] The following is a detailed explanation.

[0039] Phase 1: Extraction of the subject-relationship-object triplet.

[0040] This stage is a prerequisite for the evaluation of the reasoning chain. It is used to convert unstructured test result text samples into structured knowledge representations and extract all (subject, relation, object) triples that can reflect medical facts, thereby providing accurate and clear factual units for subsequent reasoning chain evaluation.

[0041] The subject is usually an organ (such as the liver), anatomical structure (such as the blood vessels around the colon), or lesion (such as enlarged lymph nodes); the relationship describes the medical semantic relationship between the subject and the object using verbs or state terms (such as not observed, suggestive, non-existent, supportive); the object is generally a medical discovery, lesion (such as wall thickening, abnormal density), symptom (such as mass-like changes), or other related organs.

[0042] Phase Two: Mind Chain Assessment.

[0043] Evaluation can be accomplished by comparing the model's predicted inference chain (Pred) with the reference true inference chain (GT).

[0044] To better facilitate evaluation, in one alternative approach, to achieve fine-grained evaluation, both the predicted inference chain (Pred) and the referenced true inference chain (GT) can be divided into three levels: 1. Discovery Chain (FC): First-level discoveries, which are formed by basic facts extracted directly from the text sample of the detection results (e.g., no wall thickening observed, soft tissue mass detected, etc.). These represent independent, objective observations that exist without reasoning.

[0045] 2. Impression Chain (IC): Secondary findings, comprising intermediate medical impressions or preliminary suggestions derived from a chain of findings of multiple primary findings (e.g., suggesting local inflammatory changes). This level reflects simple preliminary reasoning.

[0046] 3. Long Chain of Reasoning (LRC): A three-level discovery that represents a high-level medical reasoning and conclusion that integrates all findings (first-level discoveries) and impressions (second-level discoveries) (e.g., consistency with imaging features of malignant tumors).

[0047] The reasoning chains of the above three levels of thinking can demonstrate complete logical deduction and clinical diagnostic value.

[0048] Based on this, the constructed thought chain can be scored. In one example, scoring can be performed using an LLM (Local Level Model). For instance, the true inference chain (GT) and the model-predicted inference chain (Pred) can be input into an LLM for scoring. An evaluation prompt is constructed, embedding a set of predefined scoring rules. Guided by this prompt, the LLM compares the two inference chains and performs detailed analysis. Furthermore, the LLM can generate a structured JSON object containing the categorized inference chains and the numerical score for each metric. The final thought chain evaluation score is then generated. It is a weighted average of the scores of the multiple sub-chains (reasoning steps) included in the thought chain, thus balancing the relative importance of different levels of reasoning. An example calculation formula is as follows: Formula (1) in, Indicates sample The score in the FC rating, Indicates sample The score in the IC rating, Indicates sample The score in the LRC rating, Represents the number of samples.

[0049] The weighting coefficients satisfy: Formula (2) in, This can be used to adjust the importance of the reasoning chain during the evaluation process. In one example, Set it to 0.3. Set it to 0.3. Setting it to 0.4 emphasizes the importance of advanced logical reasoning (LRC) while giving equal basic weight to fact finding (FC) and medical impressions (IC).

[0050] Step S204: Using training samples, fine-tune the generative base model through multi-round iterative interleaved inference based on location guidance to obtain an image detection model for medical image detection.

[0051] The generative base model is a pre-trained model possessing basic language understanding and generation capabilities. Based on this, further fine-tuning is performed using the aforementioned training samples to enable it to perform multi-round iterative interleaved inference for medical images, as described in this application, for accurate medical image detection. In the multi-round iterative interleaved inference, the previous round of inference determines candidate target locations for the subsequent round, ensuring that the generative base model performs subsequent inference based at least on the location features of the candidate target locations.

[0052] Before model training, one option is to preprocess the medical image samples in the training dataset. For example, taking CT images, the soft tissue window level and window width can be extracted for each CT image, followed by voxel intensity normalization to reduce differences between multiple slices of the corresponding CT image, which helps in robust feature extraction. Regarding the heterogeneity of CT image shapes, dimensional inconsistencies can be avoided through precise cropping and zero-padding strategies. Furthermore, to meet the input requirements of the visual encoder, all processed CT images can be resized to a fixed size (e.g., 256×256×32, i.e., height × width × depth) to ensure that all input CT images have the same spatial resolution, improving model compatibility and computational efficiency. Preprocessed images effectively guarantee the training efficiency and quality of the model.

[0053] Since the image detection model is trained from the generative base model, they share the same structure. For ease of explanation, the following will use... Figure 2B The diagram shows an exemplary structure of a generative base model or image detection model, illustrating the model training process.

[0054] like Figure 2B As shown, the generative base model or image detection model is a hybrid model, and its exemplary structure includes: a visual encoder, a segmentation model, an auxiliary classification model, a multilayer perceptron (MLP), and a large language model (LLM).

[0055] in: A visual encoder is used to encode medical image samples to obtain the corresponding visual features of the medical image samples. These visual features represent the overall visual characteristics of the medical image samples. Figure 2B In this context, the visual encoder is referred to as "3DVision Encoder," which can be implemented in any suitable model structure capable of visual feature extraction.

[0056] The segmentation model is used to segment medical image samples to obtain the location features of multiple parts included in the medical image sample. For example, a whole organ segmenter (e.g., a machine learning model or model component with segmentation capabilities) can be used to segment 117 organs of the human body and merge them into 56 organs, thereby enabling structured region of interest (ROI) localization for subsequent visual question answering tasks.

[0057] The auxiliary classification model is used to perform preliminary anomaly detection on the features of multiple parts, and obtain preliminary detection results corresponding to each part (the preliminary detection results are classification results that can indicate whether the corresponding part has an anomaly, such as "normal" or "abnormal"). The parts corresponding to the detection results that indicate the presence of anomalies in the preliminary detection results are determined as candidate target parts to be detected, and the preliminary detection results are fused with the image visual features.

[0058] MLPs are used to transform fused image visual features into visual tokens that are consistent with the feature space of the generative base model. For example, a two-layer MLP projector can be used to align the extracted visual features with the LLM input space.

[0059] LLM (Locally Modulated Learning Model) is used for fine-tuning training of location-guided multi-round iterative interleaved inference based on text samples, visual tokens, and location features for detection tasks. In this location-guided multi-round iterative interleaved inference approach, the inference process is not completed all at once, but rather in multiple rounds. The result of each round of inference influences the input of the next round, forming an iterative optimization inference chain. The "interleaving" aspect is reflected in the alternating use and mutual verification of text and visual features to achieve accurate analysis of complex medical images. In this multi-round iterative interleaved inference, a multi-round thought chain step is used in each generation process. The previous round of inference determines candidate target locations for the next round, ensuring that the image detection model performs the next round of inference based at least on candidate target locations. Therefore, in the next round of inference, not only the overall visual token of the medical image (or medical image sample) to be detected and the detection task text (or detection task text sample) are considered, but also the location features of the candidate target area, which are the output of the previous round of inference. Optionally, to enable the model to pay more attention to the candidate target area in the new round of inference, enhanced cue words can be generated to indicate further attention to the candidate target area, thus achieving progressive and comprehensive detection. In one feasible approach, the candidate target area can be represented by a Region of Interest (ROI). For example, the result of the previous thought chain step can indicate the ROI region to be explored in the next thought chain step. The next thought chain step uses the result of the previous thought chain step as a reference, and this process is iterated until no new ROI regions appear. Thus, the output of each LLM is used to optimize the selection of the ROI (corresponding to the candidate target area), and then the features corresponding to the obtained ROI (location features of the candidate target area) are fed back into the next round of inference. This process is repeated continuously, enabling LLM to perform multi-round iterative interleaved inference and possessing a built-in self-verification function, thereby maintaining the alignment of visual and textual information in each iteration.

[0060] In this embodiment, for ease of description, multi-round iterative interleaved reasoning is divided into first-round reasoning and non-first-round reasoning. During the training phase, the first-round reasoning is based on the detection task text sample; while during the reasoning phase, the first-round reasoning is based on the user-input detection task text. Since this embodiment is used to illustrate the model training process, the first-round reasoning in this step is based on the detection task text sample. Therefore, this first-round reasoning can be based on the image visual features of the detection task text sample and the medical image sample, and through generative base model, perform reasoning generation operations to obtain an initial reasoning result. This initial reasoning result carries candidate target locations associated with the initial target location indicated in the detection task text sample. In practical applications, this process combines a diagnostic knowledge graph to determine the candidate target locations associated with the initial target location. For example, if the detection task text sample indicates that the liver is to be detected, and an abnormality is detected, the location associated with the liver is first determined to be the spleen based on the diagnostic knowledge graph stored in the knowledge base. Combined with this information, the generative base model performs reasoning generation to obtain the spleen as the candidate target location, and this information is passed to the next round of reasoning in the thought chain.

[0061] For subsequent rounds of reasoning, the process builds upon the candidate target locations obtained in the previous round to acquire their location features. Enhanced prompt word samples are then constructed based on this information. Using the visual features corresponding to the detection task text samples and medical image samples, the reasoning results from the previous round, the location features of the candidate target locations, and the enhanced prompt word samples, a generative base model is used to train the current round of reasoning, resulting in the current round's reasoning result sample. The enhanced prompt word samples guide the next round of reasoning towards the candidate target locations; for example, phrases like "Please pay more attention to the spleen." Building upon this, the generative base model, such as LLM, combines the visual features corresponding to the detection task text samples and medical image samples, the reasoning results from the previous round, and the location features of the candidate target locations to perform reasoning on the "spleen" thought chain. This round of reasoning may yield the final result or generate new candidate target locations.

[0062] The following is a specific computational example to illustrate the above process of model training based on multi-round iterative interleaved inference.

[0063] Step 1: First round of reasoning steps.

[0064] The first inference step labels the visual features corresponding to the medical image samples encoded by the visual encoder as... The detection task text samples corresponding to the medical image samples are labeled as This is then passed into the LLM. This step can be represented as: Formula (3) in, This represents the initial inference result output by the LLM in its first inference step.

[0065] Step 2: Based on the results of the previous reasoning steps, conduct interleaved self-reflection and candidate target site localization (organ localization).

[0066] include: (i) Keyword extraction and organ identification: For example, the local region of interest (ROI) where the candidate target is located and its corresponding mask can be determined based on the output of the previous inference step and the following formula (4). (Local features, obtained through segmentation models) to obtain visual labels for local areas of interest (ROIs). .

[0067] Formula (4) in, This indicates a candidate target location matching operation. This indicates that the size extracted based on the mask is... Local visual markers The operation, among which express _ tags, with an embedding dimension of _ . This indicates a constraint condition. This represents the mask of the organ segmented by the segmentation model; Indicates according to from The mask for the candidate target parts matched in the image.

[0068] It should be noted that the above formula (4) is suitable for extracting the visual identifiers of candidate target parts in each round of reasoning. Therefore, the specific round of extraction can be indicated by superscript, as shown above. This indicates the extraction of visual identifiers based on the first round, or initial reasoning step. This represents the part mask obtained based on the first round of reasoning.

[0069] (ii) Enhanced prompt word sample construction.

[0070] After extracting the location features of the candidate target regions, enhanced prompt word samples can be constructed. An enhanced prompt word sample is as follows: = ("Please pay more attention") ) Formula (5) Among them, in formula (4) To distinguish the capitalization, in formula (5) The two are lowercase, and they have different meanings. Indicates from The identified ROI is the region where the candidate target is located.

[0071] (iii) Generate new input for the next round of iterative reasoning.

[0072] Image visual features can be labeled as Detection task text sample The result of the previous round of reasoning (such as the initial reasoning result mentioned above). ), and the location features of the obtained candidate target locations (such as those mentioned above in uppercase). ), and enhanced prompt word samples (such as the aforementioned lowercase ones) These inputs are combined to form new inputs for the next round of thought chain reasoning in LLM. For example, the input for the second round of thought chain reasoning... , can be represented as: Formula (6) Step 3: Iterative causal reasoning.

[0073] After completing the above self-reflection, the model can continue to perform the next round of iterative reasoning based on the newly generated input. If the next round of iterative reasoning introduces additional relevant candidate target parts, the ROI localization and feature extraction process will be repeated to sequentially obtain the features of all candidate target parts involved in the thought chain for further verification.

[0074] Suppose the reasoning response of the entire thought process is represented as This can be represented as: Formula (7) in, All input labels representing LLM can be considered as image visual feature labels. The text samples for the detection task are marked as Location characteristics of candidate target areas and its corresponding enhanced prompt word samples The combination of . Among them, , indicating the inference from LLM An index of relevant candidate target locations.

[0075] Based on this, referring to the above formulas (6) and (7), the iterative interleaved reasoning rounds other than the first round of reasoning can be expressed as: Formula (8) For example, in a case where a patient is suspected of having cirrhosis and imaging findings show splenomegaly and altered liver lobe proportions, the model automatically locates and extracts the spleen's features. Before inputting the original question (i.e., the detection task text sample) and the previous answer (i.e., the result of the previous thought chain step inference) into the LLM, a relevant hint, i.e., an enhanced hint word sample, is added. The model performs multiple rounds of causal verification on each candidate target location derived from multiple inferences until all relevant candidate target locations have undergone feature extraction and causal reflection.

[0076] To achieve coordinated optimization among the multiple components of the model, one alternative approach is to jointly optimize the auxiliary classification model and the LLM using a loss function. This loss function... It can be represented as: Formula (9) in, This indicates the length of the text in the model's predicted output. Indicates the number of training samples. (Multi-round iterative interleaved reasoning) It is a loss weight. and These are the predicted and actual values ​​(labeled values ​​in the training samples) of the classification results (normal / abnormal) corresponding to the candidate target parts, respectively. During inference, the segmentation model uses the ROI location (the location of the candidate target part) provided by the LLM to extract the visual features (part features) of the candidate target parts, and then feeds these features back to the LLM for self-validation.

[0077] As can be seen from the above process, the segmentation model in the entire model can locate candidate target parts through ROIs, continuously providing accurate local features for the inference process of the entire model; the auxiliary classification model ensures discriminative learning of local features during the entire training and optimization process, enabling the visual encoder to effectively distinguish between abnormal and normal patterns, and preventing subtle anomalies from being masked during LLM training; LLM integrates the inference results and uses the segmentation model for iterative decision-making. Therefore, compared with two-dimensional image detection, the solution of this application embodiment achieves finer-grained alignment in visual representation and text description.

[0078] The supervised fine-tuning training of the generative base model using training samples is performed iteratively multiple times until the training termination condition is met. After training is complete, an image detection model for image detection is obtained.

[0079] As can be seen, this embodiment effectively constructs and defines the thought chain, enabling the obtained image detection model to effectively detect and process medical images, especially 3D medical images. During training, the segmentation model can generate spatial organ masks (localization of candidate target sites); a lightweight auxiliary classification model built based on the features of the localized local organs (site features) can enhance the visual encoder's anomaly detection capability at the organ level; and the Large Language Model (LLM) can perform multimodal integration and advanced clinical reasoning. Thus, it is possible to achieve topologically aware global-local information fusion, reduce hallucinations, and improve end-to-end pathological assessment from examination results to initial impressions and then to pathological diagnosis.

[0080] Based on the image detection model obtained through the above training method, the following, combined with... Figure 3A and Figure 3B The image detection method provided in the embodiments of this application will be described.

[0081] First, refer to Figure 3A The diagram illustrates a flowchart of an image detection method according to an embodiment of this application. The image detection method includes the following steps: Step S302: Obtain the medical image to be detected and the detection task text for the medical image.

[0082] In this embodiment, the medical image to be detected can be any form of medical image uploaded by the user through the user device, especially a three-dimensional medical image. A three-dimensional medical image refers to volumetric data composed of multiple continuous or non-continuous two-dimensional slices, represented in three-dimensional space as a regular or irregular voxel grid. It can reflect the spatial distribution and physical / physiological characteristics of the internal structure of the human body in three dimensions, including but not limited to: CT, MRI, PET, and ultrasound volumetric imaging. Examples include CT images and MRI images. Simultaneously, the user will input detection task text in natural language, matching the medical image and describing the detection target / location / task type in natural language or structured text, to indicate which part, organ, or disease to detect (initial target location), etc., of the medical image. These medical images and detection task texts are uploaded by the user device to the server through an interface provided by the server. The server obtains the medical image to be detected and its corresponding detection task text uploaded by the user device through this interface.

[0083] An example is as follows Figure 3B As shown, in Figure 3BIn the example, the medical image to be detected is a CT image, which includes multiple image slices. The corresponding detection task text for this CT image is "Please analyze whether there is a liver lesion in this CT image?". It is evident that the detection task text carries the initial target location, namely "liver".

[0084] Step S304: Perform visual feature extraction and site segmentation on the medical image to obtain the corresponding visual features of the image and the site features of multiple sites included in the medical image.

[0085] The visual feature extraction of medical images involves extracting features from the image's dimensions to obtain the overall visual features of the medical image. In one alternative approach, methods such as... Figure 3B The visual encoder shown ( Figure 3B The diagram shows a "3D Vision Encoder" used to encode medical images in order to extract their visual features.

[0086] Since a medical image typically contains more than one tissue or organ, and a lesion in one tissue or organ can potentially affect other adjacent tissues and organs, this embodiment of the application first performs site segmentation on the medical image to obtain site features of multiple sites. In one optional approach, methods such as... Figure 3B The segmentation model shown ( Figure 3B The diagram illustrates the "Segmentation Expert Model" for segmenting medical images, obtaining the feature characteristics of multiple parts included in the medical image, such as feature characteristics obtained through part masks. Figure 3B (Illustrated as "Organ Mask List"). For example, the segmentation model can identify the precise location of each part in three-dimensional space based on the part mask (binarization or probabilistic map); and based on the part mask, extract the local visual features corresponding to each part from the image visual features, i.e., part features.

[0087] Step S306: Based on the detection task text, image visual features, and location features, perform multi-round iterative interleaved reasoning guided by location through the image detection model to obtain the detection results of the medical image.

[0088] In this embodiment of the application, the image detection model can be a hybrid generative model that includes multiple parts, such as... Figure 3B As shown, it includes a visual encoder ( Figure 3B The diagram in the image is labeled "3D Vision Encoder" and the segmentation model ( Figure 3B The diagram shows "Segmentation Expert Model" and auxiliary classification model ( Figure 3B The diagram in the image is "Auxiliary Classification Model" (MLP) and LLM (i.e., ...). Figure 3B The "Large Language Model" (MLM) comprises multiple components that work collaboratively to achieve medical image detection. During detection, a multi-round iterative interleaved inference guided by the image location is employed. In this approach, the inference process is not completed all at once but rather in multiple rounds. The result of each round of inference influences the input of the next, forming an iteratively optimized inference chain. The "interleaving" aspect is reflected in the alternating use and mutual verification of textual and visual features to achieve accurate analysis of complex medical images. In this multi-round iterative interleaved inference, the previous round of inference determines candidate target locations for the next round, ensuring that the image detection model performs the next round of inference based at least on these candidate target locations.

[0089] During the inference process, location guidance is based on the location indicated in the detection task text (initial target location). It uses the features of anatomical locations that the lesion may involve and that are related to the basic location as guiding information, so that the image detection model focuses on the relevant location for further analysis, in order to achieve comprehensive and complete detection of the lesion.

[0090] Furthermore, to enhance the visual encoder's ability to discriminate abnormalities at the site (organ) level and improve lesion detection efficiency, in one optional approach, after obtaining site features corresponding to multiple sites, preliminary abnormality detection can be performed based on these site features to obtain preliminary detection results for each site; the sites corresponding to the detection results indicating abnormalities in the preliminary detection results are then identified as candidate target sites to be detected. For example, in... Figure 3B In the example shown, an auxiliary classification model was used ( Figure 3B The model (illustrated as "Auxiliary Classification Model") performs preliminary anomaly detection to identify abnormal areas. The preliminary detection results are then passed to the visual encoder "3DVision Encoder" so that it can focus on these abnormal areas when extracting features through encoding.

[0091] In addition, such as Figure 3B As shown, in one alternative approach, before performing multi-round iterative interleaved inference guided by the image detection model based on the detection task text, image visual features, and part features, the preliminary detection results can be fused with the image visual features; a multilayer perceptron (MLP) can be used. Figure 3B The diagram in the image is labeled "MLP" (Multi-Level Visual Tokenization), which converts the fused image visual features into a global visual token consistent with the image detection model space. Figure 3B(Illustrated as "GlobalTokens"). Therefore, based on the detection task text, image visual features, and part features, multi-round iterative interleaved reasoning guided by the image detection model can be implemented as follows: based on the detection task text, visual tokens, and part features, multi-round iterative interleaved reasoning guided by the image detection model.

[0092] For multi-round iterative interleaved reasoning, in this embodiment, it is divided into first-round reasoning and non-first-round reasoning. Non-first-round reasoning is usually executed multiple times, but in some cases it may only be executed once. It can be seen that non-first-round reasoning has the previous round of reasoning.

[0093] For the first round of inference, in one optional approach, the first round of inference can be generated based on the visual features of the detection task text and the medical image using an image detection model, yielding an initial inference result. This initial inference result carries candidate target locations associated with the initial target location indicated in the detection task text. Figure 3B In the example shown, the first round of reasoning is illustrated as "Step I" (reasoning step I), by... Figure 3B As can be seen, the input for this first round of inference includes: the detection task text. Visual features of medical images (and medical images) Figure 3B The diagram illustrates the global visual features (“Global Tokens”). However, it should be noted that in practical applications, the location features of the initial target area can also be considered during the first round of inference. Based on the visual features of the detection task text, the medical image, and the location features of the initial target area, the initial inference result is generated through the image detection model.

[0094] For subsequent rounds of inference, one optional approach is to obtain the location features of candidate target parts based on the candidate target parts obtained in the previous round of inference, and then construct enhanced cue words based on the information of the candidate target parts. Based on the detection task text, image visual features, the inference results of the previous round, the location features of the candidate target parts, and the enhanced cue words, the current round of inference is generated using an image detection model, yielding the inference result for the current round. The enhanced cue words effectively guide the LLM to infer on candidate target parts, improving inference efficiency.

[0095] Based on the preliminary anomaly detection, step S306 can optionally be implemented as follows: based on the detection task text, the image visual features of the medical image, and the location features of the location determined from the candidate target locations to be detected in the previous round of reasoning, the location-guided multi-round iterative interleaved reasoning is performed through the image detection model to obtain the detection results of the medical image.

[0096] exist Figure 3BIn this process, the non-first-round reasoning includes "Step II" (reasoning step 2), "Step III" (reasoning step 3), ..., until the last reasoning step, "LastStep". As shown in the diagram, the initial reasoning result can be obtained through the first round of reasoning. Based on this Build enhanced prompts In the second round of reasoning, based on enhanced cue words... and according to The location features of the identified candidate target sites (liver in this example) are used to construct new inputs, which serve as the inputs for the second round of LLM inference; after the second round of inference, the corresponding inference results are obtained. Figure 3B The middle part is shown as Based on this Build enhanced prompts In the second round of reasoning, based on enhanced cue words... and according to The location features of the identified candidate target sites (spleen in this example) are used to construct new inputs, which serve as the inputs for the third round of LLM inference; after the third round of inference, the corresponding inference results are obtained. Figure 3B The middle part is shown as Based on this Build enhanced prompts ...and so on, until the final round of reasoning is completed, and the final detection results for the medical images are obtained.

[0097] In one example, the detection result can be presented as full text. However, it is not limited to this. For easier understanding and viewing, in an alternative approach, the medical image can be labeled with corresponding body parts based on the detection results, so as to view the condition of at least one body part involved in the lesion.

[0098] In addition, to further enhance the user experience and facilitate a more intuitive understanding of the model's reasoning process, one optional approach is to perform interleaved reasoning for each iteration. After obtaining the reasoning result of the current iteration, the medical image is labeled with the corresponding body part to more clearly represent the model's reasoning process.

[0099] The following section provides a further explanation of the algorithm for the above-mentioned multi-round iterative interleaved reasoning using examples of the calculation process.

[0100] In this example, with Represents a 3D vision encoder, with Representing the segmentation model, with Indicates auxiliary classification model, MLP is represented by Representing a large language model, with The text representing the detection task is... Representing medical images, in Representing the response of the thought chain, with This represents the model output. Then, based on the detection task text... and medical images Image detection It can be represented as: Formula (10) Among them, through This achieves global and local visual alignment. For global visual markers, i.e., image visual features, the 3D visual encoder... by As input, a 3D CT image comprising multiple slices is encoded into a series of visual labels, denoted as... Then, all visual markers are aligned to the LLM space via the MLP, i.e. ,in Represents global visual markers (image visual features), with a size of ( (Number of tags, with an embedding dimension of K).

[0101] For local visual markers, i.e., location features, segmentation models First, the part (organ) mask is divided into... This provides fine-grained location information for multi-round iterative interleaved inference. Based on the cue words in each round of inference (such as the detection task text in the first round of inference and the enhanced cue words in subsequent rounds of inference), multi-round iterative interleaved inference matches the parts (organs) to... Masks that identify candidate target sites (organs) And extract local visual markers For example, the aforementioned formula (4) can be used to extract local visual markers. .

[0102] In addition, in order to strengthen the protection of In addition to the learning process, this example also introduces an auxiliary classification model. The model will The mapping to the log odds of two categories can be expressed by the formula: .pass This allows for the differentiation of whether each local part (organ) is normal or abnormal, thus significantly enhancing the discriminative ability of the visual encoder. When visual features and cues are input into the LLM together, the LLM's focus on key information and abnormal areas can be further improved.

[0103] After acquiring all visual and textual tokens, they are combined into error tokens, which serve as input to the LLM to further generate the reasoning results of the Chain of Thought (CoT). This involves interleaved reasoning. In one example, the reasoning result... It can be obtained using the aforementioned formula (7).

[0104] Through the above process, in a single inference process, the collaborative interaction between the large model (LLM) and the small models (visual encoder, segmentation model, auxiliary classification model) is utilized to achieve the staggered alignment of global visual features, detection task text (or enhanced prompt words) and specific local (candidate target parts) visual features.

[0105] As can be seen, through this embodiment, in the detection scenario of medical images, especially three-dimensional medical images, the visual features corresponding to the medical image and the features corresponding to multiple parts contained in the medical image are obtained respectively. Based on this, combined with the detection task text, a multi-round iterative interleaved reasoning method is adopted. The candidate target parts associated with the parts in the previous round of reasoning are predicted through the previous round of reasoning. Since a lesion in a certain part may not only affect the lesion itself, but also affect other parts associated with it, the scope of influence of the lesion can be effectively determined by determining multiple candidate target parts, avoiding detection omissions, providing richer and more comprehensive information for lesion detection, and improving the accuracy of detection results.

[0106] The following, combined with Figure 4A and Figure 4B The image detection method provided in the embodiments of this application will be described from the perspective of the user equipment.

[0107] like Figure 4A As shown, the image detection method includes the following steps: Step S402: Display the first human-computer interaction interface and receive the medical image input by the user and the detection task text for the medical image through the first human-computer interaction interface.

[0108] To facilitate user access to the image detection model of this application embodiment, a human-computer interaction interface is provided on the user device side. Users can upload medical images and detection task text for those medical images through this interface, which are then processed by the server. The medical images include, but are not limited to, three-dimensional medical images.

[0109] Step S404: Send the medical image and detection task text to the server so that the detection result of the medical image can be obtained by the image detection model on the server, based at least on the image visual features of the medical image and the detection task text, through multi-round iterative interleaved reasoning guided by the location.

[0110] After the user device sends the medical image and detection task text to the server, the aforementioned image detection model deployed on the server can perform multi-round iterative interleaved inference guided by location based on the received data to obtain the detection result of the medical image. The specific implementation of this process can be referred to the relevant description in the foregoing embodiments, and will not be repeated here.

[0111] After receiving the test results, the server will return the test results to the user's device.

[0112] Step S406: Receive the detection results returned by the server and display the detection results through the second human-computer interaction interface.

[0113] After receiving the detection results for the medical images returned by the server, the user device can display them to the user through a human-computer interaction interface, i.e., a second human-computer interaction interface. These detection results include, but are not limited to, the size, location, and abnormal characteristics of the affected area.

[0114] In one optional approach, if the server also performs site labeling, the user device will also receive site labeling information from the server based on the detection results in this step. Based on this, and combined with the received detection results, the user device will display the detection results and the site-labeled medical image through a second human-computer interaction interface. This provides a more intuitive presentation of the abnormal sites in the medical image to the user.

[0115] Furthermore, in some cases, to facilitate user understanding of the inference process of the image detection model, the user device may optionally receive multi-round iterative interleaved inference process information returned by the server. Based on this, detection results and process information can also be displayed through a second human-computer interaction interface.

[0116] The following, combined with Figure 4B The above process will be illustrated by example. Figure 4B As shown, the first human-computer interaction interface of the user device includes settings for users to upload medical images and settings for users to upload detection task text for medical images. Figure 4B The diagrams show an image input box where users can upload medical images by dragging and dropping, a "Select Image" button for users to select medical images by browsing files, and a text input box.

[0117] In this example, assume the user drags the medical image to be detected into the image input box and enters "Please detect whether there is a liver lesion in the image" (detection task text) into the text input box. After completing the image and text input, the user clicks the "Detect" button, and the medical image and detection task text in the image input box are uploaded to the server. In this example, the server, upon receiving the medical image and detection task text, directly invokes the image detection model deployed on the server. Based on the user-input medical image and detection task text, and combined with the diagnostic knowledge graph stored in the knowledge base, the image detection model performs multiple rounds of iterative and interleaved reasoning to obtain the detection result for the medical image. The server then returns the detection result to the user's device. Upon receiving the detection result, the user's device displays it through a second human-computer interaction interface.

[0118] This embodiment facilitates the use and interaction of the image detection model, reduces the operational complexity and burden for users in medical image detection, improves the efficiency of medical image detection, and enhances the user experience.

[0119] This application also provides a computer-aided diagnostic method, such as... Figure 5 As shown, the method includes the following steps: Step S502: Obtain the three-dimensional medical image to be detected and the detection task text for the three-dimensional medical image.

[0120] In this embodiment, a CT image for tumor detection is used as an example. The detection task text indicates which area (initial target area) needs to be detected. However, as mentioned earlier, the lesion may involve other areas associated with the initial target area. This embodiment can effectively detect all areas involved in the lesion.

[0121] Step S504: Perform visual feature extraction and site segmentation on the three-dimensional medical image to obtain the corresponding image visual features and site features of multiple biological sites included in the three-dimensional medical image.

[0122] For example, through such Figure 2B The visual encoder and segmentation model shown in the figure extract visual features and segment parts of three-dimensional medical images to obtain corresponding image visual features and part features of multiple parts.

[0123] Step S506: Based on the detection task text, image visual features, and location features, perform multi-round iterative interleaved reasoning guided by the image detection model to obtain the detection results of the three-dimensional medical image.

[0124] In the first round of reasoning, reasoning is performed by the image detection model based at least on the visual features of the detection task text and image. In the multiple rounds of non-first round reasoning, the previous round of reasoning determines the candidate target parts for the next round of reasoning, so that the image detection model can perform the next round of reasoning based at least on the part features of the candidate target parts.

[0125] Step S508: Based on the detection results, generate auxiliary diagnostic suggestions corresponding to the three-dimensional medical images.

[0126] Once the test results are obtained, corresponding auxiliary diagnostic suggestions can be generated based on these results to help with actual clinical diagnosis.

[0127] The following uses tumor detection as an example to illustrate the effectiveness of the image detection model provided in this application embodiment compared with other models in benchmark tests using specific experimental data, as shown in the table below:

[0128] In the table above, bold and underlined text represent best and second-best performance, respectively; B is used to indicate parameter size, for example, 3B represents 3 billion parameters and 7B represents 7 billion parameters; IIR represents iterative interleaved inference.

[0129] As can be seen from the table above, the image detection model of this application embodiment performs well in various sub-tasks based on the thought chain, achieving a high accuracy in the basic localization task; and compared with other models, it also achieves better scores in the more challenging pathological prediction and report generation tasks.

[0130] The image detection model in this solution demonstrates high accuracy in generating traceable reasoning (fact checking, information integration) and complex reasoning chains (long-range reasoning chains), and its clinical reasoning ability is significantly superior to general-purpose and medical models. Furthermore, the clinically traceable reasoning ability of the image detection model in this solution is stronger than that of commercial models.

[0131] As can be seen, the method provided in this embodiment can be effectively and widely applied to the auxiliary diagnosis of various clinical diseases, improving diagnostic efficiency and accuracy.

[0132] Reference Figure 6 The diagram shows a structural schematic of an electronic device according to Embodiment 5 of this application. The specific embodiments of this application do not limit the specific implementation of the electronic device.

[0133] like Figure 6As shown, the electronic device may include: a processor 602, a communications interface 604, a memory 606, and a communications bus 608.

[0134] in: The processor 602, communication interface 604, and memory 606 communicate with each other via communication bus 608.

[0135] Communication interface 604 is used for communication with other electronic devices or servers.

[0136] The processor 602 is used to execute program 610, specifically to execute the relevant steps of any of the above-described method embodiments.

[0137] Specifically, program 610 may include program code that includes computer operation instructions.

[0138] The processor 602 may be a CPU, a GPU (Graphics Processing Unit), an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of this application. The electronic device includes one or more processors, which may be processors of the same type, such as one or more CPUs; or they may be processors of different types, such as one or more CPUs and one or more ASICs.

[0139] Memory 606 is used to store program 610. Memory 606 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk storage device.

[0140] Program 610 may include multiple computer instructions. Specifically, program 610 may use multiple computer instructions to cause processor 602 to perform the operation corresponding to any of the methods described in the foregoing multiple method embodiments.

[0141] The specific implementation of each step in program 610 can be found in the corresponding steps and units described in the above method embodiments, and has corresponding beneficial effects, which will not be repeated here. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the devices and modules described above can be referred to the corresponding process descriptions in the foregoing method embodiments, and will not be repeated here.

[0142] This application also provides a computer storage medium storing a computer program thereon, which, when executed by a processor, implements the method described in any of the foregoing method embodiments. The computer storage medium includes, but is not limited to, compact disc read-only memory (CD-ROM), random access memory (RAM), floppy disk, hard disk, or magneto-optical disk.

[0143] This application also provides a computer program product, including computer instructions that instruct a computing device to perform an operation corresponding to any of the methods in the above-described multiple method embodiments.

[0144] Furthermore, it should be noted that the user-related information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to sample data used for training the model, data used for analysis, stored data, displayed data, experimental data, etc.) involved in the embodiments of this application are all information and data authorized by the user or fully authorized by all parties. Moreover, the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse.

[0145] It should be noted that, depending on the implementation needs, the various components / steps described in the embodiments of this application can be broken down into more components / steps, or two or more components / steps or parts of the operation of components / steps can be combined into new components / steps to achieve the purpose of the embodiments of this application.

[0146] The methods described in the embodiments of this application can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium (such as a CD-ROM, RAM, floppy disk, hard disk, or magneto-optical disk), or as computer code downloaded over a network that is originally stored in a remote recording medium or a non-transitory machine-readable medium and will be stored in a local recording medium. Thus, the methods described herein can be stored on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware (such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA)). It is understood that the computer, processor, microprocessor controller, or programmable hardware includes storage components (e.g., Random Access Memory (RAM), Read-Only Memory (ROM), Flash Memory, etc.) capable of storing or receiving software or computer code, which, when accessed and executed by the computer, processor, or hardware, implements the methods described herein. Furthermore, when a general-purpose computer accesses code used to implement the methods shown herein, the execution of the code transforms the general-purpose computer into a dedicated computer for executing the methods shown herein.

[0147] Those skilled in the art will recognize that the units and method steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of the embodiments of this application.

[0148] The above embodiments are only used to illustrate the embodiments of this application, and are not intended to limit the embodiments of this application. Those skilled in the art can make various changes and modifications without departing from the spirit and scope of the embodiments of this application. Therefore, all equivalent technical solutions also fall within the scope of the embodiments of this application, and the patent protection scope of the embodiments of this application should be defined by the claims.

Claims

1. An image detection method, comprising: Acquire the medical image to be detected and the detection task text for the medical image; Visual features are extracted and body parts are segmented from the medical image to obtain the corresponding visual features and body part features of the multiple body parts included in the medical image. Based on the detection task text, the image visual features, and the location features, a location-guided multi-round iterative interleaved reasoning is performed through an image detection model to obtain the detection results of the medical image; In the multi-round iterative interleaved reasoning, the previous round of reasoning determines the candidate target parts for the next round of reasoning, so that the image detection model performs the next round of reasoning based at least on the part features of the candidate target parts.

2. The method according to claim 1, wherein, The process of obtaining the detection result of the medical image by performing multi-round iterative interleaved inference guided by the location through an image detection model, based on the detection task text, the image visual features, and the location features, includes: For non-first round reasoning in the multi-round iterative interleaved reasoning, based on the candidate target parts obtained in the previous round of reasoning, the location features of the candidate target parts are obtained, and enhanced prompt words are constructed based on the information of the candidate target parts; Based on the detection task text, the image visual features, the reasoning result of the previous round of reasoning, the location features of the candidate target parts, and the enhanced prompt words, the current round of reasoning is generated through the image detection model to obtain the reasoning result of the current round of reasoning.

3. The method according to claim 2, wherein, The method further includes: For the first round of reasoning in the multi-round iterative interleaved reasoning, the first round of reasoning is generated based on the detection task text and the image visual features through the image detection model to obtain the initial reasoning result, which carries candidate target parts associated with the initial target part.

4. The method according to any one of claims 1-3, wherein, The method further includes: Preliminary anomaly detection is performed based on the location features of multiple locations, and preliminary detection results are obtained for each location. The locations corresponding to the abnormal detection results in the preliminary detection results are identified as candidate target locations to be detected.

5. The method according to claim 4, wherein, The process of obtaining the detection result of the medical image by performing multi-round iterative interleaved inference guided by the location through an image detection model, based on the detection task text, the image visual features, and the location features, includes: Based on the detection task text, the image visual features, and the location features of the location determined from the candidate target locations to be detected in the previous round of reasoning, the image detection model performs multiple rounds of iterative interleaved reasoning guided by location to obtain the detection results of the medical image.

6. The method according to claim 4, wherein, Before performing multi-round iterative interleaved inference guided by the image detection model based on the detection task text, the image visual features, and the part features, the method further includes: fusing the preliminary detection results with the image visual features; and using a multilayer perceptron to convert the fused image visual features into visual tokens consistent with the image detection model space. The step of performing multi-round iterative interleaved reasoning guided by the image detection model based on the detection task text, the image visual features, and the part features includes: performing multi-round iterative interleaved reasoning guided by the image detection model based on the detection task text, the visual token, and the part features.

7. The method according to any one of claims 1-3, wherein, The method further includes: For each round of iterative interleaved reasoning, after obtaining the reasoning result of the current round, the medical image is labeled with the corresponding body parts; or, Based on the detection results of the medical image, the corresponding body parts are labeled in the medical image.

8. An image detection method, comprising: The system displays a first human-computer interaction interface and receives medical images input by the user and detection task text for the medical images through the first human-computer interaction interface. The medical image and the detection task text are sent to the server so that the detection result of the medical image can be obtained by the image detection model in the server, based at least on the image visual features of the medical image and the detection task text, through multi-round iterative interleaved reasoning guided by the location. The system receives the detection results returned by the server and displays the detection results through a second human-computer interaction interface.

9. The method according to claim 8, wherein, The method further includes: receiving process information of the multi-round iterative interleaved reasoning returned by the server; The step of displaying the detection results through a second human-computer interaction interface includes: displaying the detection results and the process information through a second human-computer interaction interface.

10. The method according to claim 8 or 9, wherein, The method further includes: receiving annotation information from the server that marks the medical image based on the detection results; The step of displaying the detection results through a second human-computer interaction interface includes: displaying the detection results and the medical image with site annotations based on the annotation information through the second human-computer interaction interface.

11. A computer-aided diagnostic method, comprising: Acquire the three-dimensional medical image to be detected and the detection task text for the three-dimensional medical image; Visual features are extracted and parts are segmented from the three-dimensional medical image to obtain the corresponding visual features of the image and the part features of multiple biological parts included in the three-dimensional medical image; Based on the detection task text, the image visual features, and the location features, a location-guided multi-round iterative interleaved reasoning is performed through an image detection model to obtain the detection result of the three-dimensional medical image; wherein, in the multi-round iterative interleaved reasoning, the previous round of reasoning determines the candidate target location for the next round of reasoning, so that the image detection model performs the next round of reasoning based at least on the location features of the candidate target location; Based on the detection results, auxiliary diagnostic suggestions corresponding to the three-dimensional medical image are generated.

12. An image detection model training method, comprising: Obtain training samples, which include: multiple medical image-text sample pairs and mind chain visual question-answering instruction samples corresponding to the medical image samples in the medical image-text sample pairs, wherein each medical image-text sample pair includes a medical image sample and a detection task text sample and a detection result text sample corresponding to the medical image sample; Using the training samples, the generative base model is fine-tuned through multi-round iterative interleaved inference guided by location to obtain an image detection model for medical image detection; wherein, in the multi-round iterative interleaved inference, the previous round of inference determines the candidate target location for the next round of inference, so that the generative base model performs the next round of inference based at least on the location features of the candidate target location.

13. The method according to claim 12, wherein, The step of using the training samples to fine-tune the generative base model through part-guided multi-round iterative interleaved inference includes: For non-first round reasoning in the multi-round iterative interleaved reasoning, based on the candidate target parts obtained in the previous round of reasoning, the part features of the candidate target parts are obtained, and enhanced prompt word samples are constructed based on the information of the candidate target parts; Based on the detection task text sample, the image visual features corresponding to the medical image sample, the reasoning result of the previous round of reasoning, the location features of the candidate target location, and the enhanced prompt word sample, the reasoning generation training of the current round is performed through the generative basic model to obtain the reasoning result sample of the current round of reasoning.

14. The method according to claim 13, wherein, The generative basic models include: visual encoder, segmentation model, auxiliary classification model, multilayer perceptron (MLP), and large language model (LLM); in: The visual encoder is used to encode the medical image sample to obtain the image visual features corresponding to the medical image sample; The segmentation model is used to segment the medical image sample into parts and obtain the part features of multiple parts included in the medical image sample. The auxiliary classification model is used to perform preliminary anomaly detection on the feature characteristics of the multiple parts, and obtain preliminary detection results corresponding to the multiple parts respectively; the parts corresponding to the detection results indicating anomalies in the preliminary detection results are determined as candidate target parts to be detected, and the preliminary detection results are fused with the image visual features; The MLP is used to convert the fused image visual features into visual tokens consistent with the generative base model space; The LLM is used for fine-tuning training of location-guided multi-round iterative interleaved inference based on the detection task text samples, the visual tokens, and the location features.

15. The method according to any one of claims 12-14, wherein, The mind chain visual question-and-answer instruction samples include: question-and-answer instruction samples for locating parts in the medical image samples, question-and-answer instruction samples for performing abnormal attribute analysis on abnormal parts, question-and-answer instruction samples for predicting abnormal staging of abnormal parts, and question-and-answer instruction samples for instructing the generation of mind chain reports.

16. An electronic device comprising: The processor, memory, communication interface, and communication bus are provided, wherein the processor, memory, and communication interface communicate with each other via the communication bus. The memory is used to store at least one executable instruction that causes the processor to perform the operation corresponding to the method as described in any one of claims 1-15.

17. A computer storage medium having a computer program stored thereon, which, when executed by a processor, implements the method as described in any one of claims 1-15.

18. A computer program product comprising computer instructions that instruct a computing device to perform an operation corresponding to any one of the methods described in claims 1-15.