Multi-modal large model reliable inference method and system for medical image assisted diagnosis

By constructing multimodal thinking chain data and pure text thinking chain data for two-stage supervised fine-tuning, designing a fine-grained reward function, and optimizing the training of large models, the problems of information redundancy and credibility of multimodal large models in medical image-assisted diagnosis are solved, and efficient and reliable diagnostic assistance is achieved.

CN122290955APending Publication Date: 2026-06-26TONGJI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TONGJI UNIV
Filing Date
2026-03-26
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing multimodal large models for medical image-assisted diagnosis suffer from problems such as information redundancy, low reliability, difficulty in balancing accuracy and efficiency, unstable reinforcement learning training and difficulty in convergence, and difficulty in obtaining the multimodal thought chain required for supervised learning fine-tuning.

Method used

By constructing a multimodal thinking chain data generation mechanism, combining pure text thinking chain data for two-stage supervised fine-tuning, designing a fine-grained reward function mechanism, and using reinforcement learning objective functions to train large models, including rewards for accuracy, format, conciseness, length, and difficulty diversity, dynamically adjusting weight coefficients, and optimizing model output.

Benefits of technology

It improves the accuracy and reliability of model output, enhances inference efficiency, ensures the stability and controllability of the training process, and adapts to the technical effects of medical diagnostic tasks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122290955A_ABST
    Figure CN122290955A_ABST
Patent Text Reader

Abstract

This invention relates to a multimodal large-scale model reliable reasoning method and system for medical image-assisted diagnosis, comprising: designing a multimodal thought chain data generation mechanism; using plain text thought chain data and generated multimodal thought chain data to supervise and fine-tune the large model to achieve a cold start effect; further stimulating the reasoning ability of the large model through reinforcement learning, wherein the reward function of reinforcement learning includes multi-dimensional considerations; the calculated reward function values ​​of each dimension are connected to a dynamic reward scheduling mechanism, which adaptively adjusts the weight coefficients between the reward functions to achieve a balanced optimization of the objective function with multiple reward objectives, and outputs a reliable reasoning medical diagnosis result. This invention eliminates redundant information in the answer while accurately retaining the key information needed to solve the problem, thereby ensuring the accuracy and reliability of the answer. Compared with the prior art, this invention has the advantages of high accuracy, strong robustness, excellent reliability, and high reasoning efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of medical image-assisted diagnosis technology, and in particular to a multimodal large-model reliable reasoning method and system for medical image-assisted diagnosis. Background Technology

[0002] Intelligent assisted diagnosis of medical images is an important component of medical auxiliary diagnosis. Its core lies in utilizing technologies such as artificial intelligence and multimodal fusion to process medical image data from CT and MRI scans. AI models automatically identify, locate, and classify lesions, providing doctors with valuable information. This technology is crucial to patients' health and lives, demanding extremely high accuracy, reliability, and interpretability of the model's diagnostic reasoning.

[0003] Currently, most mainstream systems employ convolutional neural networks or Transformer architectures, which can only perform image feature extraction and simple lesion classification and detection. Their intelligence level and reliability need improvement. The development of multimodal large-scale models offers the possibility of improving diagnostic intelligence. Existing research attempts to utilize them in conjunction with reinforcement learning to optimize reasoning capabilities, thereby improving diagnostic accuracy and reliability. However, existing methods have significant drawbacks: First, most directly apply reinforcement learning to general multimodal large-scale models, neglecting supervised fine-tuning, leading to unstable training, shallow reasoning, and unreliable conclusions. Second, adding supervised fine-tuning can generate excessively long inference chains, reducing the efficiency and readability of the model's responses. Third, the large-scale multimodal medical thought chain data required for supervised fine-tuning is difficult to obtain. These shortcomings result in the inference reliability of existing systems failing to meet clinical needs, limiting the widespread application of related technologies, and necessitating corresponding technical solutions.

[0004] Chinese invention patent CN120636777A discloses a heatstroke auxiliary diagnosis system based on a DeepSeek large model, comprising: a structured knowledge base construction module, a model training module, and an auxiliary judgment module. The structured knowledge base construction module is configured to build a structured knowledge base containing basic clinical feature data and corresponding medical diagnostic text. The model training module is configured to fine-tune and optimize the DeepSeek-R1 model based on the structured knowledge base to form a DeepSeek-R1 model with heatstroke auxiliary diagnosis capabilities. The auxiliary judgment module is configured to process the input vital signs to be diagnosed using the fine-tuned and optimized DeepSeek-R1 model with heatstroke auxiliary diagnosis capabilities to obtain the heatstroke auxiliary diagnosis result. This invention relies on data-driven approaches and incorporates prior medical knowledge, improving the reliability of auxiliary diagnosis and enhancing the trust level of human-machine collaboration. Simultaneously, it enables the model to maintain high performance even in scenarios with sparse labeled data, and the accuracy of diagnostic subtyping is significantly improved compared to general models or simply supervised fine-tuning methods. Furthermore, this method guides the model to generate diagnostic evidence texts that are more in line with medical professional standards, more fluent in language, and more logically rigorous by designing a reward function, providing a clear reasoning path and improving the credibility and interpretability of the model's decisions. However, current models mainly rely on fine-tuning and optimizing pre-trained general reasoning models, and have not yet explored effective paths for reinforcement learning based on instruction-based fine-tuning. Moreover, research has largely focused on the pure text domain and has not been effectively extended to multimodal scenarios. In addition, the model output still suffers from information redundancy, leading to reduced text readability and result credibility, making it difficult to balance accuracy and efficiency. It also faces many challenges, such as the instability and convergence difficulties in reinforcement learning training, and the difficulty in obtaining the multimodal thought chains required for supervised learning fine-tuning.

[0005] In summary, there is currently a lack of a reliable reasoning method and system for multimodal large-scale models for medical image-assisted diagnosis, which can solve or partially solve the above problems. Summary of the Invention

[0006] The purpose of this invention is to overcome the shortcomings of the existing technology and provide a multimodal large-model reliable reasoning method and system for medical image-assisted diagnosis, so as to solve or partially solve the problems of information redundancy, low reliability, difficulty in balancing accuracy and efficiency in model output, instability and convergence difficulty in training by reinforcement learning, and difficulty in obtaining the multimodal thought chain required for supervised learning fine-tuning.

[0007] The objective of this invention can be achieved through the following technical solutions: According to one aspect of the present invention, a multimodal large-model reliable reasoning method for medical image-assisted diagnosis is provided, the method specifically comprising: S1. Obtain plain text thinking chain data, construct a thinking chain data generation mechanism, and generate multimodal thinking chain data, wherein the multimodal thinking chain data includes medical images, medical diagnostic questions, real answers, thinking processes, and difficulty assessments; S2. Construct an objective function and use the plain text thought chain data and the multimodal thought chain data to perform supervised fine-tuning of the preset large model; S3. Construct a fine-grained reward function mechanism, calculate the reward value of each reward function, adjust the weight coefficients of each reward function based on the dynamic reward scheduling mechanism, and construct the reinforcement learning objective function with the constraint of maximizing the sum of the reward values ​​of each reward function; S4. The large model fine-tuned by supervised learning is trained using the reinforcement learning objective function to obtain the trained large model. The trained large model is then used to generate intermediate results for medical image-assisted diagnosis, which can serve as an auxiliary reference for doctors in making medical diagnoses.

[0008] As a preferred technical solution, the steps for generating the multimodal thought chain data include: Set up medical images, medical diagnostic questions, and real answers; input the medical images and medical diagnostic questions into a large model to generate a preliminary thought process and corresponding answers. The generated answer is compared and evaluated with the real answer. If the answer is correct, the preliminary thought chain is collected as the correct thought chain. If the answer is wrong, the preliminary thought chain is iteratively optimized. Standardized data, including difficulty assessment, correct thought chain, and true answer, is generated as multimodal thought chain data.

[0009] As a preferred technical solution, the objective function in S2 is defined as: in, This serves as input for large models, including multimodal medical imaging and medical diagnostic problems. The system includes authentic labels provided by experts, encompassing standard thought processes and genuine diagnostic answers. For the expected operation, For a dataset containing N training samples, Representing a large model, For the model parameter space, This indicates that the supervised fine-tuning phase uses the dataset. For input, model parameters Set the loss function for the optimized object.

[0010] As a preferred technical solution, the supervised fine-tuning specifically includes one-stage supervised fine-tuning and two-stage supervised fine-tuning: The first stage of supervision and fine-tuning, Includes only medical diagnostic issues. The output includes standard thought processes and accurate diagnostic answers; The two-stage monitoring and fine-tuning, This includes a combination of multimodal medical imaging and medical diagnostic problems. The output includes standard thought processes and real diagnostic answers.

[0011] As a preferred technical solution, the reward function includes accuracy reward, format reward, simplicity reward, length reward and difficulty diversity reward.

[0012] As a preferred technical solution, the accuracy reward evaluates the effectiveness and correctness of the output answer; Whether the format reward verification model response conforms to the preset structure specification; The conciseness reward is calculated by comprehensively measuring the relevance of the content, surface repetition, and semantic redundancy, and finally outputting a reward value. The length reward is achieved by defining the problem difficulty level, calculating the dynamic target length, delineating the difficulty-adapted length range, and calculating the reward value according to the actual length, so that the model generates an inference chain that is adapted to both length and depth. The difficulty diversity reward assesses the difficulty of a problem by evaluating the diversity and balance of difficulty labels predicted by the model in response groups generated for the same problem through multi-dimensional indicators, combined with the length reward value.

[0013] As a preferred technical solution, the dynamic reward scheduling mechanism specifically includes: quantifying the training difficulty of each reward, and dynamically adjusting the weight parameters of each reward based on the training difficulty and taking into account the importance of rewards based on human priors.

[0014] As a preferred technical solution, the construction of the reinforcement learning objective function specifically includes: Using the maximization of the sum of each reward function as a constraint, the GRPO algorithm is used to guide the model to learn the correct reasoning behavior. Based on the response group corresponding to the question, the total reward value is calculated based on each fine-grained reward function. The sum and standard deviation of all reward values ​​are calculated to obtain the normalized advantage. The reinforcement learning objective function is constructed based on the objective function and the normalized advantage.

[0015] As a preferred technical solution, the reinforcement learning objective function Represented as: in, For the expected operation, For the dataset, The number of resamples for each sample. For the first The response obtained from the second sampling. For the question, This is the old strategy. As the current strategy, For the advantages of normalization, The pruning threshold for policy updates. For the parameter space of the policy model, This indicates a numeric clipping operation, preserving the numeric value. .

[0016] According to another aspect of the present invention, a multimodal large model reliable reasoning system for medical image-assisted diagnosis is provided, for executing the multimodal large model reliable reasoning method for medical image-assisted diagnosis as described above. The system includes an input module, a supervised learning fine-tuning module, a reinforcement learning fine-tuning module, and an output module, wherein the input module is used to input thought chain data, the thought chain data including plain text thought chain data and generated multimodal thought chain data; The supervised learning fine-tuning module is used to minimize the error between the predicted output of the large model and the standard output of the samples; The reinforcement learning fine-tuning module is used to optimize the objective function by using a fine-grained reward function as a constraint. The output module is used to enable the trained and optimized large model to output reliable inference results for medical image-assisted diagnosis, which can serve as an auxiliary reference for doctors in making medical diagnoses.

[0017] Compared with the prior art, the present invention has at least one of the following beneficial effects: (1) This invention constructs a multi-dimensional reward function to constrain the model output through a fine-grained reward function mechanism, including accuracy reward, format reward, conciseness reward, length reward and difficulty diversity reward. It calculates the value of each reward function and dynamically adjusts the reward function coefficients to maximize the total reward value. It constructs a reinforcement learning objective function to train the large model. While eliminating redundant information in the answer, it accurately retains the key content for solving the diagnostic problem. It solves the problems of information redundancy, low credibility and difficulty in balancing accuracy and efficiency in the model output, and achieves the effect of improving the accuracy and credibility of the model output and the inference efficiency.

[0018] (2) This invention introduces two-stage supervised learning fine-tuning using plain text thinking chain data and multimodal thinking chain data before reinforcement learning fine-tuning, so as to minimize the error between the model's predicted output and the standard output of the samples, while constraining the model's reasoning norms, so as to achieve the effect of cold start, allowing the large model to master the normalized reasoning pattern in advance, solving the problems of instability and convergence difficulty in the prior art that rely solely on reinforcement learning training, and realizing the improvement of the stability and controllability of the overall training process.

[0019] (3) This invention constructs a thinking chain data generation mechanism, inputs medical images, medical questions and corresponding answers into a large model, outputs thinking chains and answers, and iteratively optimizes erroneous thinking chains to generate multimodal thinking chain data. This solves the problem that it is difficult to obtain the multimodal thinking chains required for supervised learning fine-tuning and that it is impossible to achieve high-quality supervised learning fine-tuning. It realizes the technical effect of providing data support for subsequent supervised learning fine-tuning and reinforcement learning fine-tuning and adapting to medical diagnosis tasks. Attached Figure Description

[0020] Figure 1 This is a schematic diagram of the main steps of the present invention; Figure 2 A schematic diagram of the overall architecture of a multimodal large-scale model reliable reasoning method for intelligent assisted diagnosis of medical images; Figure 3 This is a schematic diagram comparing the method of the present invention with existing methods. Detailed Implementation

[0021] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0022] Example 1 To address the problems existing in the aforementioned prior art, this embodiment provides a multimodal large-model reliable reasoning method for medical image-assisted diagnosis, such as... Figure 1 As shown, it includes the following steps: S1. Generate multimodal thinking chain data and perform a first-stage supervised learning fine-tuning.

[0023] Design a mind chain data generation mechanism with self-checking characteristics, which includes the following three stages: First, given image pairs () represents the input image, the question, and the actual answer. The input image... ,question Using the external large model VLM to output the initial thought chain and corresponding answers : in, This indicates that an initial response is generated based on the initial prompt word strategy.

[0024] Then, evaluate the generated answer. The correctness of the answer, if the answer is correct, that is The thought process chain is directly collected as valid data; if the answer is incorrect, then... Then, one of the four optimization strategies is randomly selected to iteratively optimize the thought chain: in, Indicates the first The strategy used in the next iteration of optimization.

[0025] The four strategies are: 1. Exploring entirely new thought chains Generate new answers 2. Backtrack to previous steps Continuing the reasoning; 3. Re-evaluating the preceding output And generate new reasoning; 4. Critique the preceding output And correct the reasoning generation .

[0026] If the model still cannot output the correct answer after three optimizations, input the question and the corresponding real answer into the model to guide it in completing the correct thought process: Finally, standardized data is generated, comprising three parts: difficulty assessment, thought process, and final answer. , , .

[0027] S2. Supervised learning fine-tuning.

[0028] The core objective of supervised fine-tuning is to minimize the error between the model's predicted output and the standard sample output, while simultaneously constraining the model's reasoning regularity to achieve a cold-start effect and lay a solid foundation for reinforcement learning. The standard sample error includes both the standard thought process and the true answer. The objective function is defined as: in, The inputs to the large model include multimodal medical imaging and medical diagnostic problems. The labels provided represent genuine expert annotations, including standard thought processes and accurate diagnostic answers. For the expected operation, For a dataset containing N training samples, Representing a large model, For the model parameter space, This indicates that the supervised fine-tuning phase uses the dataset. For input, model parameters Set the loss function for the optimized object.

[0029] The oversight and fine-tuning of the learning process will be implemented in two phases, as detailed below: First, we conduct the first phase of supervised fine-tuning using the existing massive amount of plain text thought chain data, focusing on guiding the model to master standardized reasoning logic and processes. The model input in this phase... Medical diagnostic questions containing only text should be output. Includes standard thought processes and accurate diagnostic answers.

[0030] S2. Use multimodal thinking chain data for supervised learning fine-tuning.

[0031] Using the aforementioned high-quality multimodal thought chain data, a second stage of supervised fine-tuning is performed, focusing on improving the model's understanding of multimodal data and cross-modal reasoning capabilities, laying a solid foundation for subsequent reinforcement learning. The model input in this stage... This combines multimodal medical imaging with medical diagnostic problems, and outputs... Includes standard thought processes and accurate diagnostic answers.

[0032] S3. Design a fine-grained reward function mechanism and complete the reward calculation.

[0033] First, design a fine-grained reward function from multiple dimensions, including accuracy reward, format reward, conciseness reward, length reward, and difficulty diversity reward: 1. Accuracy Bonus ( ).

[0034] Accuracy reward is a reward function used to evaluate the validity and correctness of the model's output answer, and its value is... Fine-grained learning guidance is achieved through a three-tiered reward and punishment mechanism, defined by the formula: in, Output the answer to the model. This is the true answer.

[0035] The design prioritizes incentivizing the model to output an answer rather than to have no response. The reward is 0 when there is no output. A small reward of 0.1 is given for cases where there is output but the answer is wrong. The highest reward of 1.0 is given only for cases where the output and the answer completely match the true value.

[0036] 2. Formatted Rewards ( ).

[0037] The format reward is a reward function used to verify whether the model response conforms to the preset structure specification. It adopts a binary reward and penalty mechanism, and the formula is defined as follows: A reward of 1 is given when the model output strictly follows the specified format requirements, and a reward of 0 is given when the format requirements are not met. This is to constrain the standardization of the model output and ensure the consistency of the response structure.

[0038] The format specifications here refer to generating standardized data that includes three parts: difficulty assessment, thought process, and final answer, each represented using [specific formatting method]. , , The three types of tags separate the three parts of the output content.

[0039] 3. Length bonus based on perceived difficulty ( ).

[0040] Difficulty-aware length reward is a reward mechanism designed for the model's inference chain generation process. Its core objective is to enable the model to generate inference chains of length and depth adapted to the question's difficulty. This avoids expression problems caused by answers that are too short or too long, and also prevents redundant reasoning for simple questions and simplification for complex questions, ultimately improving the simplicity, clarity, and credibility of the model's inference output. This mechanism is implemented by defining question difficulty levels, calculating dynamic target length, defining difficulty-adaptive length ranges, and calculating reward values ​​based on actual lengths. Each step is completed using quantitative formulas, as detailed below: (1) Definition of problem difficulty level.

[0041] First, the problem is divided into four discrete difficulty levels, forming a difficulty set, which is expressed by the formula: set up Difficulty Label ( The proportion of the response samples provides a weighting basis for subsequent dynamic length calculation.

[0042] (2) Length of dynamic target calculate.

[0043] The dynamic target length is calculated based on the minimum effective response length in the generated response group corresponding to the problem, combined with the difficulty ratio and the length tuning factor. The formula is: in, This is a lower limit for the length, used to avoid the model generating excessively short responses that have no reasoning significance; Difficulty Label The corresponding length tuning factors assign negative values ​​to Easy and Medium difficulty levels, and positive values ​​to Hard and Expert difficulty levels. When the proportion of easy questions in the response group is high... Decrease Reduce the complexity and guide the model to make reasoning more concise; when the proportion of complex problems is high, Increase This increases the depth of the model's reasoning.

[0044] (3) Target length range for difficulty matching Demarcation.

[0045] Based on dynamic target length For each difficulty level The adaptive target length interval is defined by the following formula: in, Used to control the position of the left endpoint of the interval, thereby distinguishing the basic length under different difficulty levels; Used to determine the width of the interval, ensuring that the reasoning length has a reasonable fluctuation range for each difficulty level, and the target length interval varies accordingly. Dynamic scaling avoids imposing rigid constraints on inference length.

[0046] (4) Length bonus value calculate.

[0047] For difficulty labels The actual reasoning length is Model response ,in accordance with and target interval The reward value is calculated based on the matching relationship, using a piecewise function design, with the following formula: The reward rules are as follows: when the actual inference length is within the target range, a full reward of 1.0 is given; when the length is lower than the lower limit of the range or higher than the upper limit, the reward value decreases as the deviation increases, thus penalizing the unsuitable length; other abnormal situations result in a reward value of 0.

[0048] Overall, the difficulty-aware length reward achieves adaptive matching of inference length to problem difficulty through a fully quantified design of difficulty quantification, dynamic length setting, interval adaptation, and segmented reward and punishment. This allows the model's inference depth to match the problem complexity, balancing the efficiency and credibility of the inference output.

[0049] 4. Diverse difficulty rewards ( ).

[0050] Difficulty diversity rewards are a reinforcement learning reward mechanism designed to address the issues of noise in difficulty labels and dynamic changes in model difficulty perception during supervised fine-tuning of cold starts, as well as to break free from dependence on static human priors. The core objective is to enable the model to achieve dynamic and self-consistent problem difficulty assessment, and to allow the difficulty assessment to effectively guide the reasoning process, ultimately improving the model's difficulty judgment ability.

[0051] This reward is not a single metric, but rather a multi-dimensional design that promotes the diversity and balance of difficulty label predictions across the model's response groups for the same question. It also incorporates a difficulty-aware length reward, adjusting the inference depth based on predicted difficulty, allowing the model to explore effective responses at different difficulty levels. Through inter-group comparisons, it learns better difficulty assessment methods, avoiding the generation of arbitrary and meaningless difficulty labels. Specific metrics include: (1) Core basic indicators, namely, the construction of basic rewards .

[0052] For a difficulty tag set containing 4 types of tags To respond to each tag in the group percentage Based on this, three indicators are designed and weighted to obtain the basic reward: entropy( ): To measure the diversity of the distribution of difficulty labels, a small constant is included. To ensure numerical stability, the calculation formula is as follows: ; Balance The penalty is applied to skewed distributions, pushing the proportion of each label to approximately equal to 0.25. The calculation formula is as follows: ; Coverage ( ): This measures the proportion of actual difficulty tags appearing in a response group relative to the total number of tags, reflecting the richness of tag coverage. The calculation formula is as follows: ; The basic reward formula is: The result was normalized to ,in The contribution used to balance balance and coverage.

[0053] (2) Targeted adjustments to basic rewards to avoid tag convergence and encourage the exploration of rare tags.

[0054] To prevent the model from over-relying on common difficulty labels and ignoring rare labels, the base reward... Two types of dynamic adjustments are made: If a certain label accounts for a certain percentage ,Right now If the dominant factor is to be introduced, then a decay factor is introduced. right Weighted and deweighted; If a certain label accounts for a certain percentage ,Right now If it is rare, then it is Add reward bonus .

[0055] (3) Diverse rewards for final difficulty ).

[0056] A rule-based final reward is set by combining label validity and answer correctness. Rewards are only calculated for responses with valid labels and correct answers; otherwise, the reward is zero. The specific rules are as follows: Label Or the answer is incorrect: ; Valid tags + correct answers + : ; Valid tags + correct answers + : This means a maximum limit of 1.0 to prevent reward overflow; Other legal circumstances: .

[0057] By promoting a balanced and diverse prediction of the difficulty label for the same problem, the model learns from reasoning processes at different difficulty levels during the reinforcement learning fine-tuning process. This achieves an effective linkage between difficulty assessment and reasoning depth, ultimately gradually improving the model's autonomous and accurate problem difficulty assessment capabilities.

[0058] 5. Simplicity Bonus ( ).

[0059] The conciseness reward is a reward mechanism designed to address the content redundancy that may arise from supervised fine-tuning learning, as well as the reward defects that length rewards may cause, such as the model generating repetitive and meaningless content to meet the length target, which damages the conciseness and credibility of the content. The core objective is to guide the model to generate more meaningful and necessary content.

[0060] This reward system integrates three core components to comprehensively measure content relevance, surface redundancy, and semantic redundancy, ultimately outputting a reasonable reward value. The specific composition and calculation method are as follows: (1) Question-related score .

[0061] Question-related scores Its core function is to penalize sentences irrelevant to the question, ensuring that the content generated by the model closely relates to the core issue. The computational premise is: first, the question and response are segmented separately using punctuation marks. One and Using the Sentence Transformers model, the embedding of the question sentence is calculated for each sentence. and response sentence embedding The specific calculation formula is as follows: in, Represents cosine similarity. To determine the lower threshold for whether a sentence is irrelevant to the question, This is an indicator function that returns 1 if the condition within the parentheses is met, and 0 otherwise. This indicates the proportion of irrelevant sentences in the response, therefore The closer the value is to 1, the stronger the relevance of the response to the problem.

[0062] (2) Surface repetition penalty .

[0063] Surface repetition penalty Its core function is to penalize repeated sentences and n-grams in the model's response, preventing the model from simply repeating content to pad the count. The computational premise is that the response sentence is further segmented into... n-grams, respectively using and This represents the set of response sentences and the set of n-grams.

[0064] The specific calculation formula is as follows: This formula normalizes the response by counting the number of repeated n-grams, where the number of repeated sentences is the first term and the number of repeated n-grams is the second term, and then normalizes the response by dividing by the total number of sentences and the total number of n-grams, respectively. The higher the value, the more severe the surface repetition, and the stronger the penalty.

[0065] (3) Semantic redundancy penalty .

[0066] Semantic redundancy penalty Its core function is to penalize semantically similar sentences in the response, preventing the model from generating redundant content with different expressions but the same meaning. Its calculation is based on the embedding vectors of the response sentences, using cosine similarity to determine the semantic similarity between sentences.

[0067] The specific calculation formula is as follows: in, This is the threshold for judging semantic redundancy in sentences; sentence pairs with a cosine similarity higher than this threshold are considered semantically redundant; the coefficients at the beginning of the formula... Used for normalization, ensuring The value is within a reasonable range. The higher the value, the more severe the semantic redundancy, and the stronger the penalty.

[0068] (4) Final simplicity reward formula.

[0069] The three components are weighted and fused to obtain the final simplicity reward. A minimum reward of 0.0 is set to avoid the unreasonable impact of negative rewards. The formula is as follows: in, , , All are positive weighting coefficients, used to adjust the contribution weights of the three components. Strengthen positive incentives related to the problem. and The negative penalties for surface repetition and semantic redundancy are strengthened respectively, and the weight adjustment can flexibly adapt to the needs of simplicity in different scenarios.

[0070] By positively incentivizing content highly relevant to the problem and negatively penalizing repetitive and redundant content, conciseness rewards effectively curb the reward defects that length rewards may cause, solve the content redundancy problem brought about by supervised fine-tuning, guide the model to generate more refined, meaningful and credible responses, and improve the overall performance of the model in synergy with other rewards such as difficulty diversity rewards.

[0071] S4. Adjust the weight coefficients of the access dynamic reward scheduling mechanism.

[0072] The dynamic reward scheduling mechanism is proposed to address the multi-objective learning challenges faced in the joint optimization of multiple reward signals. Its core function is to adaptively adjust the weights of each reward during model training to achieve balanced optimization of all rewards. The mechanism's design is inspired by the core understanding that more difficult and important rewards should receive greater attention. Specifically, it involves: first, quantifying the training difficulty of each reward; and then dynamically adjusting the weight-related parameters of each reward based on the difficulty, thus shifting the training focus towards rewards that are more difficult to optimize, while also considering the relative importance of rewards based on human prior knowledge.

[0073] Assume it exists Types of rewards and including Sampling group of samples The training difficulty of each reward is quantified through the following steps: 1. Reward Cap .

[0074] For each reward , This indicates its current upper limit, i.e., the maximum reward that can be achieved. It will be dynamically adjusted based on the training difficulty and will be constrained within a preset range. Within this range, the preset range is defined a priori by humans, reflecting the relative importance of the corresponding rewards.

[0075] 2. Normalized fractions This refers to the difficulty measurement index.

[0076] Normalized scores are used to measure rewards The training difficulty is calculated based on the number of samples in the sample group. The average value and its own upper limit The ratio is given by the following formula: in, Indicates sampling group The Middle Rewards for each sample The value of . The core meaning is how close the observed average reward is to the reward cap; its value directly reflects the difficulty of optimizing the reward. The lower the value, the greater the reward. The harder it is to reach its limit, the greater the difficulty of optimization, and the more training emphasis should be placed on it. The higher the value, the greater the reward. The easier it is to reach the upper limit, the lower the difficulty of optimization, and the less emphasis can be placed on training.

[0077] 3. Reward Cap Dynamic update rules.

[0078] The core of dynamic scheduling is to adjust the reward cap. Indirectly altering the weight of each reward: for rewards that are more difficult to optimize, i.e. Lower rewards, increase its To increase attention; for those that are easier to optimize, namely Higher rewards, lower their To reduce the level of attention given. The specific update rules fall into two categories, as shown in the following formula: in, and All are positive coefficients, used for control. The adjustment range should be small to avoid excessive adjustments that could affect training stability. This means that the reward has not reached its limit and there is room for optimization: It is a positive value, and The lower the value, the higher the difficulty. The larger, The greater the increase, the more important the reward becomes; when =1.0 means the reward has reached its current limit, and the optimization difficulty is extremely low. It is a negative value, and is fixed at 1. ,reduce This would reduce the amount of training resources required for the award.

[0079] By dynamically quantifying the training difficulty of each reward and adaptively adjusting the reward cap... This mechanism can automatically allocate training priorities, giving more attention to rewards that are more difficult to optimize, effectively solving the imbalance problem in multi-reward joint optimization, promoting balanced improvement of all rewards during training, and ultimately improving the overall performance of the model.

[0080] S5. The sum of the adjusted reward functions is used as the optimization objective for fine-tuning in reinforcement learning.

[0081] With the constraint of maximizing the sum of the reward functions, the GRPO algorithm is used to guide the model to learn the correct reasoning behavior. This algorithm is based on the response group corresponding to the question, and constructs the optimization objective function by calculating the reward value and normalizing the advantage. The core is to simplify the original formula to achieve faster convergence and more stable training. The specific design steps are as follows: 1. Response Group definition.

[0082] For each question , construct containing A group of sampled responses and their corresponding rewards The formula is expressed as: in, For the question The true label, For the first Each sampled response, for The corresponding total reward value is obtained by summing the 5 reward components.

[0083] 2. Total Reward Value calculate.

[0084] response Total reward It is the sum of the 5 reward components, and the formula is: 3. Average reward with standard deviation calculate.

[0085] For each response group Calculate the average of all reward values. and standard deviation This provides a basis for subsequent advantage normalization, and the formula is: 4. Advantages of normalization definition.

[0086] For each response within the group Based on its reward value Average group reward and standard deviation The normalization advantage is defined by the following formula: in, It is a constant, and its purpose is to avoid the denominator being zero and to ensure the stability of the calculation.

[0087] 5. GRPO Optimization Objective Function .

[0088] The GRPO algorithm optimizes the objective function update strategy. The objective function is based on the old strategy. Current strategy The formula for constructing the normalization advantage is: in, For the dataset, The pruning threshold for policy updates. For the parameter space of the policy model, This indicates a numeric clipping operation, preserving the numeric value. .

[0089] The large model is trained based on the optimization objective function to obtain the trained large model. The trained large model is then used to generate intermediate results for medical image-assisted diagnosis, which can serve as an auxiliary reference for doctors in making medical diagnoses.

[0090] Overall architecture as follows Figure 2As shown, a multimodal thinking chain data generation mechanism is designed to generate a small amount of high-quality multimodal thinking chain data. Secondly, by combining plain text thinking chain data with the generated multimodal thinking chain data, supervised fine-tuning of the large model is performed to achieve a cold start effect. Finally, reinforcement learning is used to further stimulate the reasoning ability of the large model. The reward function design for reinforcement learning incorporates multiple dimensions: accuracy rewards are given for correct answers, format rewards for following a specified format, conciseness rewards for generating concise and effective answers, length rewards for answers within a specified range, and difficulty diversity rewards for generating diverse answers from different difficulty perspectives. The calculated reward function values ​​for each dimension are integrated into a dynamic reward scheduling mechanism, which adaptively adjusts the weight coefficients between the reward functions to achieve balanced optimization of multiple reward objectives.

[0091] If the above methods are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a personal computer, server, or network device to execute all or part of the steps of the methods of the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0092] To verify the performance of the above method, the following experiment was designed in this embodiment.

[0093] This paper uses the OmniMed VQA dataset as the benchmark dataset. This dataset covers eight medical modalities and five medical diagnostic tasks, and contains a rich collection of visual question answering (VQA) samples, as detailed below: The OmniMed VQA dataset contains 82,059 medical images and 88,996 visual question-answering (VQA) pairs, providing a sufficient sample size. It includes eight medical image modalities: Computed Tomography (CT), Magnetic Resonance Imaging (MRI), X-ray Imaging (X-ray), Ultrasound (US), Dermoscopy (Dermo), Fundus Photography (FP), Optical Coherence Tomography (OCT), and Microscopy (Micro). Furthermore, all VQA pairs are categorized into five types based on task type, covering core medical image-related tasks: Anatomy Identification, Disease Diagnosis, Lesion Grading, Modality Recognition, and Other Biological Attributes.

[0094] The overall experimental results are as follows: The proposed method was compared with recent state-of-the-art methods, including: (1) general large models BLIP-2, InstructBLIP, Qwen2-VL and Qwen2.5-VL series models; (2) medical large models LLaVA-Med, RadFM, Med-Flamingo, MedVInT; and (3) baseline models trained with basic supervised fine-tuning and reinforcement fine-tuning methods.

[0095] Table 1. Comparison with state-of-the-art large models across eight medical modalities. Table 2 Comparison with state-of-the-art large models on five medical tasks. Table 1 compares the performance of the proposed method with the state-of-the-art models on eight medical modality sub-datasets of the OmnimedVQA dataset; Table 2 compares the performance of each model on five medical task sub-datasets of the same dataset.

[0096] Experimental results are as follows Figure 3As shown, existing methods do not introduce supervised fine-tuning and directly perform reinforcement learning fine-tuning on general multimodal large models, resulting in a rapid decrease in the model's inference length during training, making it difficult to guarantee inference depth and reliability. The method of this invention introduces supervised fine-tuning combined with fine-grained reward function design in the reinforcement learning stage, which can always maintain the model's inference length within a reasonable range, ensuring inference depth and stability. The figure below shows that the method of this invention has been extensively tested on eight medical modal datasets and five medical diagnostic task datasets of the OmniMedVQA dataset. The experimental results show that the overall performance of the method of this invention is significantly better than that of existing methods, further verifying the effectiveness and superiority of this invention.

[0097] Experimental results show that in all eight medical modalities and five medical tasks, the proposed method significantly outperforms the general-domain large model, the medical-specific large model, and various models fine-tuned under the Zero-shot setting. Furthermore, in the vast majority of modalities and tasks, the proposed method also significantly outperforms the Med-R1 model, which also employs reinforcement learning fine-tuning. It should be noted that the Med-R1 model directly performs reinforcement learning fine-tuning on the general-domain Qwen2 / 2.5-VL series models. Its training process does not incorporate supervised fine-tuning to provide a good initial starting point for reinforcement learning, and its reward function design only includes accuracy and format rewards, lacking targeted optimization. The above comparative results fully validate the effectiveness and universality of the proposed method.

[0098] Example 2 In view of the aforementioned embodiment of the multimodal large model reliable reasoning method for medical image-assisted diagnosis, this embodiment provides a multimodal large model reliable reasoning system for medical image-assisted diagnosis, which is used to execute the multimodal large model reliable reasoning method for medical image-assisted diagnosis, specifically including: an input module, a supervised learning fine-tuning module, a reinforcement learning fine-tuning module, and an output module.

[0099] The input module is used to input thought chain data, which includes plain text thought chain data and generated multimodal thought chain data. The supervised learning fine-tuning module is used to minimize the error between the predicted output of the large model and the standard output of the samples; The reinforcement learning fine-tuning module is used to optimize the objective function by using a fine-grained reward function as a constraint. The output module is used to enable the trained and optimized large model to output reliable inference results for medical image-assisted diagnosis, which can serve as an auxiliary reference for doctors in making medical diagnoses.

[0100] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present invention, and these modifications or substitutions should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A multimodal large-model reliable reasoning method for medical image-assisted diagnosis, characterized in that, The method specifically includes: S1. Obtain plain text thinking chain data, construct a thinking chain data generation mechanism, and generate multimodal thinking chain data, wherein the multimodal thinking chain data includes medical images, medical diagnostic questions, real answers, thinking processes, and difficulty assessments; S2. Construct an objective function and use the plain text thought chain data and the multimodal thought chain data to perform supervised fine-tuning of the preset large model; S3. Construct a fine-grained reward function mechanism, calculate the reward value of each reward function, adjust the weight coefficients of each reward function based on the dynamic reward scheduling mechanism, and construct the reinforcement learning objective function with the constraint of maximizing the sum of the reward values ​​of each reward function; S4. The large model fine-tuned by supervised learning is trained using the reinforcement learning objective function to obtain the trained large model. The trained large model is then used to generate intermediate results for medical image-assisted diagnosis, which can serve as an auxiliary reference for doctors in making medical diagnoses.

2. The multimodal large-model reliable reasoning method for medical image-assisted diagnosis according to claim 1, characterized in that, The steps for generating the multimodal thought chain data include: Set up medical images, medical diagnostic questions, and real answers; input the medical images and medical diagnostic questions into a large model to generate a preliminary thought process and corresponding answers. The generated answer is compared and evaluated with the real answer. If the answer is correct, the preliminary thought chain is collected as the correct thought chain. If the answer is wrong, the preliminary thought chain is iteratively optimized. Standardized data, including difficulty assessment, correct thought chain, and true answer, is generated as multimodal thought chain data.

3. The multimodal large-model reliable reasoning method for medical image-assisted diagnosis according to claim 1, characterized in that, The objective function in S2 is defined as follows: in, This serves as input for large models, including multimodal medical imaging and medical diagnostic problems. The system includes authentic labels provided by experts, encompassing standard thought processes and genuine diagnostic answers. For the expected operation, For a dataset containing N training samples, Representing a large model, For the model parameter space, This indicates that the supervised fine-tuning phase uses the dataset. For input, model parameters Set the loss function for the optimized object.

4. The multimodal large-model reliable reasoning method for medical image-assisted diagnosis according to claim 3, characterized in that, The aforementioned monitoring and fine-tuning specifically includes a one-stage monitoring and fine-tuning process and a two-stage monitoring and fine-tuning process: The first stage of supervision and fine-tuning, Includes only medical diagnostic issues. The output includes standard thought processes and accurate diagnostic answers; The two-stage monitoring and fine-tuning, This includes a combination of multimodal medical imaging and medical diagnostic problems. The output includes standard thought processes and real diagnostic answers.

5. The multimodal large-model reliable reasoning method for medical image-assisted diagnosis according to claim 1, characterized in that, The reward function includes accuracy reward, format reward, simplicity reward, length reward, and difficulty diversity reward.

6. The multimodal large-model reliable reasoning method for medical image-assisted diagnosis according to claim 5, characterized in that, The accuracy reward assesses the validity and correctness of the output answer; Whether the format reward verification model response conforms to the preset structure specification; The conciseness reward is calculated by comprehensively measuring the relevance of the content, surface repetition, and semantic redundancy, and finally outputting a reward value. The length reward is achieved by defining the problem difficulty level, calculating the dynamic target length, delineating the difficulty-adapted length range, and calculating the reward value according to the actual length, so that the model generates an inference chain that is adapted to both length and depth. The difficulty diversity reward assesses the difficulty of a problem by evaluating the diversity and balance of difficulty labels predicted by the model in response groups generated for the same problem through multi-dimensional indicators, combined with the length reward value.

7. The multimodal large-model reliable reasoning method for medical image-assisted diagnosis according to claim 1, characterized in that, The dynamic reward scheduling mechanism specifically includes: quantifying the training difficulty of each reward, and dynamically adjusting the weight parameters of each reward based on the training difficulty and taking into account the importance of rewards based on human priors.

8. The multimodal large-model reliable reasoning method for medical image-assisted diagnosis according to claim 1, characterized in that, The construction of the reinforcement learning objective function specifically includes: Using the maximization of the sum of each reward function as a constraint, the GRPO algorithm is used to guide the model to learn the correct reasoning behavior. Based on the response group corresponding to the question, the total reward value is calculated based on each fine-grained reward function. The sum and standard deviation of all reward values ​​are calculated to obtain the normalized advantage. The reinforcement learning objective function is constructed based on the objective function and the normalized advantage.

9. A multimodal large-model reliable reasoning method for medical image-assisted diagnosis according to claim 1, characterized in that, The reinforcement learning objective function Represented as: in, For the expected operation, For the dataset, The number of resamples for each sample. For the first The response obtained from the second sampling. For the question, This is the old strategy. As the current strategy, For the advantages of normalization, The pruning threshold for policy updates. For the parameter space of the policy model, This indicates a numeric clipping operation, preserving the numeric value. .

10. A multimodal large-model reliable reasoning system for medical image-assisted diagnosis, characterized in that, The system is used to execute the multimodal large model reliable reasoning method for medical image-assisted diagnosis as described in any one of claims 1-9. The system includes an input module, a supervised learning fine-tuning module, a reinforcement learning fine-tuning module, and an output module. The input module is used to input thought chain data, which includes plain text thought chain data and generated multimodal thought chain data. The supervised learning fine-tuning module is used to minimize the error between the predicted output of the large model and the standard output of the samples; The reinforcement learning fine-tuning module is used to optimize the objective function by using a fine-grained reward function as a constraint. The output module is used to enable the trained and optimized large model to output reliable inference results for medical image-assisted diagnosis, which can serve as an auxiliary reference for doctors in making medical diagnoses.