A reasoning enhanced visual-language large model training and image processing method

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By embedding clinical knowledge and guidelines into the visual-language model, and generating image descriptions and instructions with enhanced reasoning, the problem of insufficient clinical knowledge guidance and reasoning ability of the visual-language model in the detection of diabetic retinopathy is solved, achieving higher recognition accuracy and interpretability, and assisting in more accurate medical diagnosis.

CN120654766BActive Publication Date: 2026-06-19SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT
Filing Date: 2025-05-14
Publication Date: 2026-06-19

Application Information

Patent Timeline

14 May 2025

Application

19 Jun 2026

Publication

CN120654766B

IPC: G06N3/09; G06N3/045; G06N5/04; G06F40/284; G06V40/18; G06V10/764; G06V10/82; G16H50/20; G16H50/30; G16H50/50

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing visual-language models lack clinical knowledge guidance and have insufficient reasoning ability in the detection of diabetic retinopathy, resulting in poor interpretability of results and difficulty in meeting the needs of efficient and accurate medical diagnosis.

⚗Method used

By designing prompts that embed clinical knowledge and guidelines, the model generates image descriptions and instructions that enhance reasoning, integrates multimodal data, improves the interpretability of the model's decisions and the accuracy of its recognition, and employs a visual-language model for lesion recognition and DR grading of ultra-wide-angle fundus images.

🎯Benefits of technology

This improved the interpretability and accuracy of the model's decision-making process, enhanced the model's recognition accuracy in DR detection and its fit with clinical practice, and assisted in more accurate and reliable medical diagnosis.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN120654766B_ABST

Patent Text Reader

Abstract

This invention relates to a reasoning-enhanced visual-language large-scale model training and image processing method. The training method includes the following steps: acquiring an ultra-wide-angle fundus image as the input image; using manually labeled DR grade, manually labeled lesion type, and clinical background from the ultra-wide-angle fundus image as prompt words; generating a reasoning-enhanced image description and the inferred DR grade and lesion type using a visual-language model with reasoning capabilities; using the generated image description and the inferred DR grade and lesion type as instructions; constructing a reasoning-enhanced instruction dataset by combining the ultra-wide-angle fundus image; and fine-tuning the reasoning-enhanced visual-language large-scale model. Compared with existing technologies, this invention has the advantages of effectively integrating clinical knowledge, high recognition accuracy, and strong interpretability.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of medical image processing technology, and in particular to a reasoning-enhanced visual-language large model training and image processing method. Background Technology

[0002] Diabetic retinopathy (DR), a common and serious complication of diabetes, is a leading cause of blindness in adults. With the continued rise in the number of people with diabetes worldwide, the need for accurate early detection of DR is becoming increasingly urgent. Fundus photography is the primary diagnostic tool for DR. Traditional DR detection relies mainly on ophthalmologists manually interpreting retinal images such as fundus photographs. DR manifests as a series of lesions on the retina; identifying the type and number of lesions according to clinical guidelines can further infer the severity of DR. However, this manual detection method has many limitations. First, manual interpretation depends on the doctor's professional experience and subjective judgment, and diagnostic results may vary significantly between different doctors, making it difficult to guarantee diagnostic consistency. Second, given the ever-increasing number of people with diabetes, manual interpretation is extremely inefficient and cannot meet the needs of large-scale screening. Furthermore, in areas with relatively scarce medical resources, the number of professional ophthalmologists is limited, further exacerbating the difficulty of timely diagnosis of diabetic retinopathy.

[0003] In recent years, with the rapid development of artificial intelligence technology, basic models trained using techniques such as self-supervised learning and visual language pre-training have been increasingly applied to the field of retinal image analysis, providing new ideas and methods for the detection of diabetic retinopathy. These techniques aim to utilize large-scale data for pre-training to learn general image features and semantic information, and then transfer this knowledge to specific downstream tasks to improve model performance and generalization ability. Currently, the relevant existing technical solutions mainly include:

[0004] (1) Pre-trained models based on self-supervised learning: such as RETFound, which uses a large number of unlabeled color fundus photographs and optical coherence tomography images for masked autoencoder pre-training. Through this self-supervised pre-training method, the model can learn the latent features and structural information in retinal images. In subsequent diabetic retinopathy detection tasks, these pre-trained features can be fine-tuned to adapt to specific task requirements. However, when transferring the pre-trained model to specific downstream classification tasks, this method often neglects the integration of multimodal data and the interpretability of model decisions.

[0005] (2) Models using the CLIP-like method: KeepFIT and FLAIR employ a CLIP-like approach (contrastive language-image pre-training), using paired retinal images and text to pre-train the visual encoder for diabetic retinopathy detection tasks. This method enhances the model's ability to recognize and diagnose retinal images by combining image and text information. However, when transferring to downstream classification tasks, the interpretability of the model's decision-making process is not fully considered. The model can only identify the visible lesion type and its grade separately, posing a risk that the co-occurrence relationship between lesions and grades may not conform to clinical guidelines, thus limiting its performance and reliability in practical applications.

[0006] For example, CN117671422A discloses a method for constructing a large ophthalmic model based on ophthalmic visual models and fine-tuning of instruction sets. The method includes constructing and training an ophthalmic visual model set, which includes an ophthalmic disease classification model and an ophthalmic lesion segmentation model; constructing an ophthalmic visual feature extraction and fusion module based on a pre-acquired visual text encoding module and the trained ophthalmic visual model set; constructing a large language model and connecting the ophthalmic visual feature extraction and fusion module with the large language model to obtain a large ophthalmic model; acquiring an ophthalmic instruction set and fine-tuning the large ophthalmic model based on the ophthalmic instruction set to obtain the final trained large ophthalmic model.

[0007] However, the existing methods have the following drawbacks: (1) Lack of clinical knowledge guidance: Clinical knowledge and guidelines are not fully integrated when generating image descriptions and instructions. In the diagnosis of diabetic retinopathy, the grading and identification of lesions need to be based on specific medical standards and clinical experience. However, such image-text pair-based visual-language models cannot accurately integrate this clinical knowledge into the model's decision-making process, which may cause the output results to be out of touch with actual clinical needs. For example, the correlation between the lesions detected by the model and their grading may not conform to diagnostic guidelines and clinical experience. (2) Insufficient reasoning ability leads to poor interpretability of results: Existing visual-language models mainly focus on the matching and generation of images and text. They have limited ability to handle complex medical reasoning tasks and it is difficult to clearly explain their decision-making basis and reasoning process to users. For example, in the detection of diabetic retinopathy, doctors cannot understand which features and logic the model uses to determine the presence and grading of lesions. This reduces doctors' trust in the model results and limits the widespread application of the model in clinical practice. Summary of the Invention

[0008] The purpose of this invention is to overcome the shortcomings of the existing technology by providing a reasoning-enhanced visual-language large-scale model training and image processing method. This method leverages the advantages of visual-language models in combining visual understanding and natural language generation to achieve automated generation of fundus image descriptions with reasoning logic, enabling them to closely reflect the decision-making process of ophthalmologists. Simultaneously, by utilizing fundus images and generated text, a higher-performance reasoning-enhanced visual-language large-scale model for DR detection is developed, achieving effective integration of multimodal data and improving the model's accuracy, interpretability, and clinical relevance in DR grading and lesion identification tasks, thus assisting in more accurate and reliable medical diagnosis.

[0009] The objective of this invention can be achieved through the following technical solutions:

[0010] The present invention aims to: (1) achieve effective integration of multimodal data. By designing prompts embedded with clinical knowledge and guidelines, and generating inference-enhanced image descriptions and explanations based on DR grading and lesion identification, data from multiple modalities (such as fundus images, medical text knowledge, etc.) are effectively integrated. When constructing an inference-enhanced visual-language large model for diabetic retinopathy detection, synthetic inference-enhanced instruction texts and fundus images are used for training, allowing the model to fully utilize the advantages of multimodal data and improve its performance in diabetic retinopathy grading and lesion identification tasks. (2) enhance the interpretability of model decisions. By utilizing the advanced visual understanding and language reasoning capabilities of the visual-language model, image descriptions and explanations that closely reflect the decision-making process of ophthalmologists are generated. In this way, the decision-making process and basis of the model in diabetic retinopathy grading and lesion identification can be clearly presented, enhancing doctors' trust in the model results, improving the model's practicality and acceptability in clinical practice, and assisting doctors in making more accurate and reliable diagnoses.

[0011] According to a first aspect of the present invention, a method for training an inference-enhanced vision-language large model is provided, wherein the inference-enhanced vision-language large model is used for lesion identification and DR classification of ultra-wide-angle fundus images, the method comprising the following steps:

[0012] An ultra-wide-angle fundus image is acquired as the input image. The manually labeled DR grade, manually labeled lesion type, and clinical background of the ultra-wide-angle fundus image are used as prompt words. A visual-language model with reasoning ability is used to generate a reasoning-enhanced image description and the reasoned DR grade and lesion type. The clinical background includes the grading criteria and lesion interpretation.

[0013] The generated image descriptions, along with the inferred DR grade and lesion type, are used as instructions. Combined with the ultra-wide-angle fundus images, an inference-enhanced instruction dataset is constructed, and the inference-enhanced visual-language large model is fine-tuned.

[0014] As a preferred technical solution, the visual-language model performs three tasks in a clinical tone according to the prompt words: image description, lesion identification, and DR grading. It adopts a structured and logic-driven format. First, it performs descriptive analysis based on the spatial location and appearance of clinically relevant visual features in the ultra-wide-angle fundus image to generate image descriptions. Then, based on the image descriptions and predefined clinical backgrounds, it performs lesion classification and severity grading in sequence.

[0015] As a preferred technical solution, the instructions include a one-stop instruction and a task-specific instruction. The one-stop instruction is a set of question-and-answer pairs corresponding to a certain ultra-wide-angle fundus image. The question in the question-and-answer pair is a predefined comprehensive question, and the answer is an image description generated by a visual-language model, as well as the DR grade and lesion type obtained through inference. The task-specific instruction is three sets of question-and-answer pairs corresponding to a certain ultra-wide-angle fundus image. The first set of question-and-answer pairs is a predefined image description question, and the answer is an image description generated by a visual-language model. The second set of question-and-answer pairs is a predefined lesion classification question, and the answer is the lesion type obtained through inference by a visual-language model. The third set of question-and-answer pairs is a predefined DR grade question, and the answer is the DR grade obtained through inference by a visual-language model.

[0016] As a preferred technical solution, the inference-enhanced vision-language large model includes a visual encoder, a multilayer perceptron, a text segmenter, and a large language model. The ultra-wide-angle fundus image is preprocessed and then input into a pre-trained visual encoder to obtain visual tags. The visual tags are mapped to a language tag space through the multilayer perceptron to obtain projected visual tags. The instructions are processed by the text segmenter to obtain language tags. The language tags and the projected visual tags are connected to obtain instruction tags, which are then input into the pre-trained large language model.

[0017] As a preferred technical solution, the image preprocessing specifically includes:

[0018] The ultra-wide-angle fundus image is evenly divided into multiple non-overlapping first local images according to a preset layout;

[0019] Centered on the centroid of the ultra-wide-angle fundus image, the ultra-wide-angle fundus image is cropped according to a preset side length ratio to obtain multiple second local images corresponding to different side length ratios;

[0020] The first and second local images are resized to a uniform preset size and then stitched together as input to the visual encoder.

[0021] According to a second aspect of the present invention, an image processing method is provided, the method comprising the following steps:

[0022] Acquire ultra-wide-angle fundus images;

[0023] The ultra-wide-angle fundus image and a predefined question are input into a reasoning-enhanced visual-language large model trained using the method described above, which outputs DR classification and lesion type.

[0024] According to a third aspect of the present invention, an electronic device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the program to implement the training method / image processing method described above.

[0025] According to a fourth aspect of the present invention, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the training method / image processing method described above.

[0026] Compared with the prior art, the present invention has the following beneficial effects:

[0027] (1) Existing technologies lack effective integration of clinical knowledge. This invention utilizes the advanced visual understanding and language reasoning capabilities of a visual-language model to achieve the automated generation of reasoning-enhanced image descriptions and instructions. By designing and embedding cue words related to clinical knowledge and guidelines concerning DR grading, lesion identification, and their interrelationships, the generated descriptions and instructions can closely simulate the decision-making process of ophthalmologists. This makes the generated content more professional, logical, and clinically practical, and can more accurately reflect the reasoning process in medical diagnosis, rather than a simple description of image content, giving it a greater advantage in the field of medical image analysis.

[0028] (2) In DR detection tasks, other basic models have shortcomings in terms of grading accuracy, lesion recognition ability, interpretability of output results, and consistency with clinical practice. The reasoning-enhanced visual-language large model of the present invention is trained using ultra-wide-angle fundus images and synthesized reasoning-enhanced instructions. Compared with other basic models, this model shows a comprehensive improvement in DR grading and lesion recognition tasks. Specifically, the present invention achieves higher recognition accuracy, enabling more precise detection and grading of lesions; it has better interpretability, making the model's decision-making process and basis easier to understand; and it performs excellently in terms of consistency between output results and clinical practice, enabling more effective assistance to doctors in clinical decision-making, meeting the needs of clinical diagnosis, and being more valuable and reliable in practical applications. Attached Figure Description

[0029] Figure 1 This is a flowchart of the training method of the present invention;

[0030] Figure 2This is a comparison chart showing the predictive accuracy and consistency with clinical standards of the reasoning-enhanced visual-language large model of the present invention with those of existing technologies. Detailed Implementation

[0031] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0032] Obviously, the accompanying drawings described below are merely some examples or embodiments of this application. Those skilled in the art can apply this application to other similar scenarios based on these drawings without any inventive effort. Furthermore, it is understood that although the efforts made in this development process may be complex and lengthy, for those skilled in the art related to the content disclosed in this application, any changes to design, manufacturing, or production based on the technical content disclosed in this application are merely conventional technical means and should not be construed as insufficient disclosure of the content of this application.

[0033] Details of one or more embodiments of this application are set forth in the following drawings and description to make other features, objects and advantages of this application more readily apparent.

[0034] In this application, the reference to "embodiment" means that a specific feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment that is mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described in this application may be combined with other embodiments without conflict.

[0035] Unless otherwise defined, the technical or scientific terms used in this application shall have the ordinary meaning understood by one of ordinary skill in the art to which this application pertains. The terms “a,” “an,” “an,” “the,” and similar words used in this application do not indicate quantity limitation and may indicate singular or plural. The terms “comprising,” “including,” “having,” and any variations thereof used in this application are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or device that includes a series of steps or modules (units) is not limited to the listed steps or units, but may also include steps or units not listed, or may include other steps or units inherent to these processes, methods, products, or devices. The terms “connected,” “linked,” “coupled,” and similar words used in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. “Multiple” used in this application refers to two or more. “And / or” describes the relationship between related objects, indicating that three relationships may exist; for example, “A and / or B” can represent: A alone, A and B simultaneously, and B alone. The character " / " generally indicates that the preceding and following objects are in an "or" relationship. The terms "first," "second," and "third" used in this application are merely to distinguish similar objects and do not represent a specific ordering of the objects.

[0036] Example 1

[0037] This embodiment provides a method for training an inference-enhanced vision-language large model. This inference-enhanced vision-language large model is used for lesion identification and DR classification in ultra-wide-angle fundus images, such as... Figure 1 As shown, the method includes the following steps:

[0038] Step 1) Obtain an ultra-wide-angle fundus image as the input image. Use the manually labeled DR grade, manually labeled lesion type, and clinical background of the ultra-wide-angle fundus image as prompt words. Use a visual-language model with reasoning ability to generate a reasoning-enhanced image description and the reasoned DR grade and lesion type.

[0039] To ensure that DR diagnosis is consistent with clinical decision-making, this embodiment designs a visual-language model that generates clinically interpretable reports by simulating the two-step diagnostic process used by ophthalmologists: i) systematically identifying retinal abnormalities; ii) systematically synthesizing the findings into a standardized severity grading based on rules.

[0040] In this embodiment, the visual-language model is GPT-4o. In other embodiments, any large model that supports visual-language multimodal input and has natural language generation capabilities can be used to achieve the purpose of this invention.

[0041] In this step, GPT-4o's advanced visual understanding and language reasoning capabilities are leveraged to automate the generation of reasoning-enhanced image descriptions and instructions. By designing a prompt word that incorporates classification labels for DR lesion grading and lesion identification, along with clinical knowledge and guidelines explaining the image features and interrelationships of each category, the generated image descriptions and instructions closely simulate the decision-making process of ophthalmologists. This provides more professional and logical input information for subsequent model training and applications.

[0042] In one embodiment, the designed prompt words can be:

[0043] Imagine that you are a doctor. Analyze the ultra-wide-field fundusimage and perform the following tasks:1)Describe the image in detail.2)Given the image description, identify and explain any abnormalities from the list: context:...(clinical criteria and explanations)..

[0044] In the above prompts, "microaneurysms" and "Mild NPDR" are given labels, and "(clinicalcriteria and explanations)" represents the embedded clinical context. This embodiment prompts GPT-4o with detailed clinical requirements, enabling it to perform three tasks (image description, lesion identification, and DR grading) in a clinical tone. To ensure accuracy, the prompt template t(·,·,·) is used for each fundus image u. i The grade g was added from the doctor's annotation. i ∈{0,1,2,3,4} and lesion type l i∈{0,1} 7 Information such as clinical background (including grading criteria and lesion interpretation) is generated, where 0 represents no lesion, 1 represents a lesion, and the superscript 7 indicates that there are 7 lesion types. An average description of 364.8 words is generated. i =GTP4o(t(c,g i ,l i ),u i Using a structured, logic-driven format, the system first performs descriptive analysis based on the spatial location and appearance of clinically relevant visual features in ultra-wide-angle fundus images to generate image descriptions. Then, based on the image descriptions and predefined clinical backgrounds, it sequentially classifies lesions and grades their severity.

[0045] In one embodiment, the results generated by GPT-4o include three aspects: image description, lesion type, and DR grade, as shown in the following example:

[0046] Image description:..The optic disc appears...The macula exhibits..Theperipheral retina..Small,round,red dots consistent with microaneurysms arevisible in the central to mid-peripheral retina.These are isolated and notwidespread.

[0047] Lesion identification:Microancurysm:Small,round,red dots observed on the retina.These are indicative of localized outpouchings of capillary walls,a hallmark of early diabetic retinopathy.

[0048] DR level classification:..(definition of Mild NPDR)..Based on the presence of microaneurysms and the absence of other abnormalities, the DRlevel can be classified as Mild NPDR,...(conclusion)...

[0049] Step 2) Use the generated image description and the inferred DR grade and lesion type as instructions, and combine them with the ultra-wide-angle fundus image to construct an inference enhancement instruction dataset, and fine-tune the inference enhancement visual-language large model.

[0050] Step 21) Construct the inference enhancement instruction dataset

[0051] After generating the inference results of image description and lesion type and DR grade in step 1), a hybrid instruction dataset is constructed to fine-tune the inference-enhanced visual-language large model.

[0052] The instructions include one-stop instructions and task-specific instructions.

[0053] A one-stop instruction is a set of question-and-answer pairs corresponding to a specific ultra-wide-angle fundus image. Its purpose is to provide the model with a comprehensive task description, enabling the model to provide complete analysis results based on information from the entire image. For the i-th image, its corresponding one-stop instruction is denoted as (q, r). i ), where q represents a one-stop command, r i This represents the image description generated in step 1) and the inferred DR grade and lesion type. The question-and-answer pair consists of predefined comprehensive questions, such as:

[0054] Q:Describe this UWF image in detail.Provide the DR level of the imageand reason.

[0055] The answer consists of the image description generated by the visual-language model and the DR grade and lesion type obtained through inference; that is:

[0056] A:…(All results generated by the visual-language model)….

[0057] Language models utilize prior contextual information when making incremental predictions. However, directly training the model to generate complete image descriptions in an autoregressive manner can lead to error propagation. For example, misidentifying microaneurysms during lesion detection can severely impact subsequent lesion grading, causing non-DR lesions to be misidentified as mild DR lesions. Therefore, to mitigate the impact of error propagation, the inference description content r of the i-th image is enhanced. i It is divided into three sub-tasks, namely r i =r i1 ·r i2 ·r i3 The "·" symbol represents string concatenation, thus constructing task-specific instructions. This allows the model to focus on specific tasks, reducing over-reliance on previous steps and enhancing its ability to independently handle complex tasks. The task-specific instructions are three sets of question-and-answer pairs corresponding to a specific ultra-wide-angle fundus image, denoted as (q...). i ,r ij ),j∈{1,2,3}, where,

[0058] The first set of question-and-answer pairs consists of predefined image description questions, and the answers are image descriptions generated by a visual-language model, for example:

[0059] Q:Describe the findings on this UWF image.

[0060] A:...(image description)...

[0061] The second set of question-and-answer pairs consists of predefined lesion classification questions, and the answers are the lesion types obtained through visual-language model reasoning, for example:

[0062] Q:Detect abnormalities on this UW'F image.

[0063] A: (lesion identification)

[0064] The third set of question-and-answer pairs consists of predefined DR (Depth of Detail) classification questions, and the answers are the DR classifications obtained through visual-language model inference, for example:

[0065] Q:What is the DR level of this UWF image?

[0066] A:...(DR level classification)...

[0067] The final reasoning-enhanced instruction dataset D consists of a hybrid of two types of visual question-answering (QA) pairs: one-stop instructions and task-specific instructions. Where N is the index set of the dataset, u i This represents the i-th image.

[0068] Step 22) Fine-tuning the reasoning-enhanced visual-language large model

[0069] First, an inference-enhanced visual-language large model (UWF-VLMR) for DR detection is constructed based on a multimodal large model framework (such as InternVL). This model utilizes ultra-wide-field fundus photography (UWF) images and the inference-enhanced instructions generated in step 21) for low-rank adaptive (LoRA) fine-tuning. After training, the model can analyze the input fundus images and output corresponding image feature descriptions, detailed diagnostic inference processes, DR classifications, and lesion type diagnostic conclusions.

[0070] like Figure 1 As shown, the reasoning-enhanced visual-language large model includes a visual encoder, a multilayer perceptron, a text segmenter, and a large language model. Ultra-wide-angle fundus images, after image preprocessing, are input into the pre-trained visual encoder to obtain visual tags. These visual tags are then mapped to the language tag space through the multilayer perceptron to obtain projected visual tags. Instructions are processed by the text segmenter to obtain language tags. Connecting the language tags and the projected visual tags yields instruction tags, which are then input into the pre-trained large language model. In other embodiments, the large language model can be replaced with any open-source large model supporting multimodal visual-language input, including InternVL, without affecting the achievement of the invention's objective. During the training of the reasoning-enhanced visual-language large model, the parameters of the visual encoder are frozen, and only the parameters of the multilayer perceptron and the large language model are fine-tuned.

[0071] In the image preprocessing stage, in order to preserve the detailed information in the ultra-wide-angle fundus image to the greatest extent, while taking into account key regions and global context, a hybrid patch stitching strategy was designed, which includes the following steps:

[0072] S1, the ultra-wide-angle fundus image is evenly divided into multiple non-overlapping first local images according to a preset layout; in this embodiment, they are arranged in a 4×3 layout, resulting in a total of 12 first local images.

[0073] S2, with the centroid of the ultra-wide-angle fundus image as the center, the ultra-wide-angle fundus image is cropped according to a preset side length ratio to obtain multiple second local images corresponding to different side length ratios; in this embodiment, four square regions of different sizes are cropped around the centroid of the image, with side lengths of 1.0 times, 0.75 times, 0.5 times and 0.25 times the short side length of the original image, respectively, to obtain 4 second local images.

[0074] S3. The sizes of the first local image and the second local image (a total of 16 image blocks) are adjusted to a uniform preset size of 448×448 and then stitched together as the input of the visual encoder.

[0075] To incorporate multimodal information, a pre-trained visual encoder was integrated into the backbone of the large language model and fine-tuned using Low-Rank Adaptive (LoRA) for 15 rounds. The training process is similar to the general LoRA fine-tuning process, with training data input in mini-batches. At each training step, a batch of question-answer pairs (x, y, x) with corresponding ultra-wide-angle (UWF) images is uniformly sampled from the constructed inference augmentation instruction dataset D. text ,y,x image )~Uniform(D) is the model input, where Uniform represents uniform sampling, x text Let y represent the question and answer, and x represent the correct answer. imgge This represents the corresponding ultra-wide-angle image. The visual encoder is frozen, and a multilayer perceptron (MLP) mapper M maps the visual tags to the linguistic tag space. Subsequently, the linguistic tags z are... text and the projected visual marker z image Connecting them together gives the instruction marker z = [z text ,M(z image )]=[f tok (x text ),M(f ViT (x image ))], where f tok It is a text segmenter, f ViT It is a visual encoder. Finally, these tags are input into a pre-trained large language model f with a low-rank adaptive (LoRA) module. LLM The process involves adaptation and adjustment. Fine-tuning is supervised by the cross-entropy loss function.

[0076] Example 2

[0077] This embodiment provides an image processing method, which includes the following steps:

[0078] Acquire ultra-wide-angle fundus images;

[0079] The ultra-wide-angle fundus image and a predefined question are input into the inference-enhanced visual-language large model trained using the method described in Example 1 above, and the DR grade and lesion type are output.

[0080] Based on the above methods, this embodiment verifies the performance of the reasoning-enhanced visual-language large model proposed in this invention as follows:

[0081] 1) Classification performance verification: This embodiment conducted DR grading and lesion classification performance experiments on an ultra-wide-angle fundus image dataset, such as... Figure 2 As shown, the vertical axis represents the accuracy of DR grading. The results demonstrate that this invention outperforms basic models such as RETFound, KeepFIT, and FLAIR in both grading and lesion classification tasks. Notably, the zero-shot learning GPT-4o exhibits lower performance, indicating a lack of prior knowledge regarding DR detection on ultra-wide-angle fundus images, and suggesting that the performance improvement of this invention is not solely due to the refinement of the GPT-4o model. Furthermore, the results of inference using both one-stop and multi-task instructions are reported, with no significant difference between the two, indicating that the model is applicable to both one-stop DR diagnosis and multi-task DR identification.

[0082] 2) Consistency verification with clinical standards: The relevant methods were further evaluated by assessing the proportion of DR grading and lesion prediction that met clinical standards, as well as the corresponding grading accuracy. Figure 2 As shown, the horizontal axis represents the proportion of DR grading and lesion prediction results in the test set that conform to clinical standards. Although basic models such as RETFound for fundus images exhibit strong performance, they do not fully encode the co-occurrence relationship between lesions and grades, resulting in 8.4%–20.9% of sample predictions being inconsistent with clinical standards. In contrast, this invention effectively captures these correlations, achieving 100% consistency with clinical standards across all test samples, demonstrating superior performance and the highest consistency accuracy among the comparative methods.

[0083] In one embodiment, the electronic device includes a computing unit that can perform various appropriate actions and processes based on a computer program stored in a read-only memory (ROM) or a computer program loaded from a storage unit into a random access memory (RAM). The RAM may also store various programs and data required for device operation. The computing unit, ROM, and RAM are interconnected via a bus. Input / output (I / O) interfaces are also connected to the bus.

[0084] Multiple components in an electronic device are connected to an I / O interface, including: input units such as keyboards and mice; output units such as various types of displays and speakers; storage units such as disks and optical discs; and communication units such as network interface cards (NICs), modems, and wireless transceivers. The communication unit allows the device to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0085] The computing unit can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of computing units include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit performs the various methods and processes described above, such as image processing methods and / or model training methods. For example, in some embodiments, the image processing methods and / or model training methods can be implemented as computer software programs tangibly contained in a machine-readable medium, such as a storage unit. In some embodiments, part or all of the computer program can be loaded and / or installed on the device via ROM and / or a communication unit. When the computer program is loaded into RAM and executed by the computing unit, one or more steps of the image processing methods and / or model training methods described above can be performed. Alternatively, in other embodiments, the computing unit can be configured to perform image processing methods and / or model training methods by any other suitable means (e.g., by means of firmware).

[0086] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0087] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0088] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.

[0089] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present invention, and these modifications or substitutions should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A method for training a reasoning enhanced visual-language large model, comprising: The inference-enhanced vision-language large model is used for lesion identification and DR classification in ultra-wide-angle fundus images. The method includes the following steps: An ultra-wide-angle fundus image is acquired as the input image. The manually labeled DR grade, manually labeled lesion type, and clinical background of the ultra-wide-angle fundus image are used as prompt words. A visual-language model with reasoning ability is used to generate a reasoning-enhanced image description and the reasoned DR grade and lesion type. The clinical background includes the grading criteria and lesion interpretation. The generated image descriptions and the inferred DR grade and lesion type are used as instructions. Combined with the ultra-wide-angle fundus images, an inference-enhanced instruction dataset is constructed, and the inference-enhanced visual-language large model is fine-tuned. The visual-language model performs three tasks in clinical tone according to prompts: image description, lesion identification, and DR grading. It adopts a structured and logic-driven format. First, it performs descriptive analysis based on the spatial location and appearance of clinically relevant visual features in ultra-wide-angle fundus images to generate image descriptions. Then, based on the image descriptions and predefined clinical backgrounds, it performs lesion classification and severity grading in sequence. The reasoning-enhanced vision-language large model includes a visual encoder, a multilayer perceptron, a text segmenter, and a large language model. The ultra-wide-angle fundus image is preprocessed and then input into a pre-trained visual encoder to obtain visual tags. The visual tags are mapped to the language tag space through the multilayer perceptron to obtain projected visual tags. The instructions are processed by the text segmenter to obtain language tags. The language tags and the projected visual tags are connected to obtain instruction tags, which are then input into the pre-trained large language model.

2. The method of claim 1, wherein the method further comprises: The instructions include one-stop instructions and task-specific instructions. The one-stop instructions are a set of question-and-answer pairs corresponding to a certain ultra-wide-angle fundus image. The question in the question-and-answer pair is a predefined comprehensive question, and the answer is an image description generated by a visual-language model, as well as the DR grade and lesion type obtained through inference. The task-specific instructions are three sets of question-and-answer pairs corresponding to a certain ultra-wide-angle fundus image. The first set of question-and-answer pairs is a predefined image description question, and the answer is an image description generated by a visual-language model. The second set of question-and-answer pairs is a predefined lesion classification question, and the answer is the lesion type obtained through inference by a visual-language model. The third set of question-and-answer pairs is a predefined DR grade question, and the answer is the DR grade obtained through inference by a visual-language model.

3. The method of claim 1, wherein the method further comprises: The image preprocessing specifically includes: The ultra-wide-angle fundus image is evenly divided into multiple non-overlapping first local images according to a preset layout; Centered on the centroid of the ultra-wide-angle fundus image, the ultra-wide-angle fundus image is cropped according to a preset side length ratio to obtain multiple second local images corresponding to different side length ratios; The first and second local images are resized to a uniform preset size and then stitched together as input to the visual encoder.

4. An image processing method characterized by, The method includes the following steps: Acquire ultra-wide-angle fundus images; The ultra-wide-angle fundus image and a predefined question are input into a reasoning-enhanced visual-language large model trained using the method described in any one of claims 1 to 3, which outputs DR grade and lesion type.

5. An electronic device comprising a memory and a processor, said memory having stored thereon a computer program, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1 to 3.

6. A computer-readable storage medium having stored thereon a computer program, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1 to 3.

7. An electronic device comprising a memory and a processor, said memory having stored thereon a computer program, characterized in that, When the processor executes the program, it implements the method as described in claim 4.

8. A computer-readable storage medium having stored thereon a computer program, characterized in that, When the program is executed by the processor, it implements the method as described in claim 4.

Citation Information

Patent Citations

Visual language model instruction fine tuning method and device
CN117975475A
Eye disease recognition method, device and equipment based on multiple modes and storage medium
CN118298494A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

Visual language model instruction fine tuning method and device

Eye disease recognition method, device and equipment based on multiple modes and storage medium