Application method and apparatus for multi-modal medical large model, fine-tuning method and apparatus for multi-modal medical large model, and computer device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By integrating multiple source medical models and singular value decomposition technology, a multimodal semantic alignment model and an integrated enhanced medical large model are constructed, which solves the problems of computational resources and data scarcity in multimodal medical large models and improves the diagnostic accuracy and resource utilization efficiency of the models.

WO2026123256A1PCT designated stage Publication Date: 2026-06-18SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI
Filing Date: 2024-12-11
Publication Date: 2026-06-18

Application Information

Patent Timeline

11 Dec 2024

Application

18 Jun 2026

Publication

WO2026123256A1

IPC: G16H50/20

AI Tagging

Application Domain

Medical automated diagnosis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Vessel key point position determination method and device, computer device, and storage medium
CN116030071BImage analysis Medical automated diagnosis
Methods and tools for assessing cardiometabolic health and hidden disease risk among apparently healthy individuals
US20260162828A1Metabolism disorder Health-index calculation
A multi-dimensional non-invasive combined screening method for polycystic ovary syndrome
CN122201726AImprove screening effecthigh sensitivityMedical data mining Health-index calculation
A light multi-scale enhanced polyp segmentation system based on spatial channel grouping decoupling
CN122201660AImage enhancement Image analysis
Hazard based assessment patterns
US12658318B2Health-index calculationEpidemiological alert systems

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing multimodal medical large models face challenges such as high computational resource requirements, data scarcity, and high annotation costs during training and fine-tuning, making it difficult to effectively optimize model performance in resource-limited environments.

⚗Method used

By integrating the knowledge and capabilities of multiple source medical models, a multimodal semantic alignment model and an integrated augmented medical model are constructed. Multimodal feature data is used for semantic alignment and feature fusion to optimize the use of computing resources. Singular value decomposition and parameter quantization are used to reduce the amount of model adjustment.

🎯Benefits of technology

It improves the model's generalization ability and diagnostic accuracy, reduces dependence on data resources, and enhances the processing capability and diagnostic reliability of multimodal medical data.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN2024138560_18062026_PF_FP_ABST

Patent Text Reader

Abstract

The present specification relates to the technical field of natural language processing, and particularly relates to an application method and apparatus for a multi-modal medical large model, a fine-tuning method and apparatus for a multi-modal medical large model, and a computer device. The application method comprises: acquiring multi-modal feature data; and inputting the multi-modal feature data into a pre-trained multi-modal medical large model, so as to obtain an output result, wherein the multi-modal medical large model comprises a multi-modal semantic alignment model and an integrated and enhanced medical large model, and the multi-modal semantic alignment model is obtained by means of training an initial multi-modal semantic alignment model on the basis of multi-modal feature sample vectors, representation extraction sample vectors, and semantic labels corresponding to the multi-modal feature sample vectors. The present invention integrates the knowledge and capabilities of multiple source medical models, and reduces the dependence on data resources, thereby improving the generalization capability of a model, and enhancing the accuracy and reliability of diagnosis; and multi-modal features are extracted, and the semantic alignment effect thereof is optimized, thereby avoiding information loss and semantic deviation.

Need to check novelty before this filing date? Find Prior Art

Description

Multimodal medical large model applications and fine-tuning methods, devices and computer equipment Technical Field

[0001] This invention relates to the field of natural language processing technology, and in particular to the application and fine-tuning methods, devices and computer equipment for multimodal medical large models. Background Technology

[0002] Artificial intelligence (AI) technology has demonstrated broad application prospects in the healthcare field. As a representative of AI in medicine, multimodal medical big data models have the potential to process complex data and provide precise treatment plans. They can integrate various medical data, including images, genes, and electronic medical records, to support disease diagnosis, risk prediction, and personalized treatment. With the rapid development of medical big data models, their parameter scale has expanded from hundreds of millions to tens or even hundreds of billions, enabling a deeper understanding of the mechanisms and individual differences of neuropsychiatric diseases, the discovery of new disease biomarkers, and the prediction of disease risk and prognosis. This technological advancement provides strong support for precision medicine and personalized intervention for neuropsychiatric diseases, greatly promoting the progress of medical research and clinical applications, and bringing new possibilities for the early detection, diagnosis, and treatment of neuropsychiatric diseases.

[0003] However, the training and fine-tuning process of large-scale medical models also faces unprecedented challenges. As the model size increases, the required computational and data resources grow exponentially. High-dimensional medical data typically possesses complex structures and diversity, requiring substantial computational power for processing. Especially in resource-constrained environments, acquiring sufficient high-performance computing devices, such as GPUs, becomes increasingly difficult. Furthermore, the scarcity of multimodal medical data and the high cost of annotation further complicate the construction of large-scale medical models. Therefore, optimizing the use of computational resources and effectively fine-tuning large-scale medical models without sacrificing model performance has become a crucial research direction. The continued development of multimodal medical models will bring more possibilities to precision medicine, helping to deepen the understanding of disease mechanisms, discover new diagnostic biomarkers, and advance the application of personalized medicine and intervention methods. Summary of the Invention

[0004] To address the problems in the existing technology, this specification provides methods, devices, and computer equipment for the application and fine-tuning of multimodal medical large models.

[0005] This specification provides an embodiment of a method for applying a multimodal medical large model. The method includes: acquiring multimodal feature data; inputting the multimodal feature data into a pre-trained multimodal medical large model to obtain an output result. The multimodal medical large model includes a multimodal semantic alignment model and an integrated enhanced medical large model. The multimodal semantic alignment model is obtained by training an initial multimodal semantic alignment model with multimodal feature sample vectors, representation extraction sample vectors, and semantic labels corresponding to the multimodal feature sample vectors. The integrated enhanced medical large model is obtained by training multiple source medical models based on the prediction distribution matrix obtained from predicting single-modal medical sample data and the true distribution matrix corresponding to each source medical model.

[0006] This specification provides a method for fine-tuning a multimodal medical large model. The method includes: inputting multimodal feature sample vectors and representation extraction sample vectors into an initial multimodal semantic alignment model to obtain the prediction result output by the initial multimodal semantic alignment model, wherein the representation extraction sample vectors are used to extract features from the multimodal feature sample vectors; calculating a multimodal matching loss function based on the prediction result and the semantic labels corresponding to the multimodal feature sample vectors; iteratively updating the parameters in the initial multimodal semantic alignment model based on the multimodal matching loss function to construct a multimodal semantic alignment model; and inputting the output of the multimodal semantic alignment model into a pre-trained ensemble augmented medical large model to obtain the prediction result output by the ensemble augmented medical large model, wherein the ensemble augmented medical large model is trained by a prediction distribution matrix obtained by multiple source medical models predicting single-modal medical sample data and a true distribution matrix corresponding to each source medical model.

[0007] According to one aspect of an embodiment of this specification, the method further includes: setting a fully connected linear projection layer between the multimodal semantic alignment model and the integrated augmented medical model, matching the output of the multimodal semantic alignment model with the feature dimensions of the input of the integrated augmented medical model; optimizing the multimodal semantic alignment model according to the loss function of the masked language model, and determining the output of the multimodal semantic alignment model and the prediction result of the output of the integrated augmented medical model.

[0008] According to one aspect of the embodiments of this specification, the characterization extraction sample vector is determined according to the following formula: v = u T (X-μ); where v represents the character sample extraction vector, X represents the multimodal feature sample vector, μ represents the feature center vector, and u represents the main feature direction vector.

[0009] According to one aspect of the embodiments of this specification, a multimodal matching loss function is constructed as follows: a contrastive loss function is determined by maximizing the similarity of pairs of positive multimodal feature vectors and minimizing the similarity of pairs of negative multimodal feature vectors; a cosine distance loss function is determined based on the multimodal feature vectors and their corresponding semantic labels; and a multimodal matching loss function is constructed based on the contrastive loss function and the cosine distance loss function, as shown in the following formula:

[0010] L match =α·L InfoNCE +(1-α)·L cos Among them, L match L represents the multimodal matching loss function. InfoNCE L represents the contrastive loss function for information aggregation. cos Let represent the cosine distance loss function, and α represent the regularization coefficient.

[0011] According to one aspect of an embodiment of this specification, the integrated augmented medical model is constructed as follows: Single-modal medical sample data is input into multiple source medical models to obtain the predicted distribution matrix output by each source medical model; the cross-entropy between the predicted distribution matrix of each source medical model and the corresponding true distribution matrix is calculated, and the weights of each predicted distribution matrix are determined; based on the weights, the predicted distribution matrices output by each source medical model are weighted to obtain a fusion probability distribution sample matrix; from the multiple source medical models, one source medical model is selected as the initial medical model; based on the loss value between the predicted distribution matrix output by the initial medical model, the fusion probability distribution matrix, and the true distribution matrix, the parameters of the initial medical model are iteratively trained according to the loss value to construct an initial integrated augmented medical model; the initial integrated augmented medical model is fine-tuned to obtain the integrated augmented medical model.

[0012] According to one aspect of the embodiments of this specification, the loss value between the predicted distribution matrix, the fusion probability distribution matrix, and the true distribution matrix output by the initial medical large model is calculated in the following manner: a fusion loss function is determined based on the difference between the true distribution matrix and the fusion probability distribution matrix, and the difference between the fusion probability distribution matrix and the predicted distribution matrix.

[0013] According to one aspect of an embodiment of this specification, the method further includes: using singular value decomposition to pre-freeze the weight matrix of the initial ensemble augmented medical model to obtain a decomposed singular scale vector and a singular orientation matrix; freezing the singular orientation matrix to generate multiple basic low-rank matrices and multiple intermediate matrices, controlling the number and value of the low-rank matrices, and decomposing the singular orientation matrix into at least one low-rank matrix; adjusting the parameters of the low-rank matrix to obtain a low-rank adjusted singular value matrix; and merging the singular scale vector, the adjusted singular orientation matrix, and the parameter-quantized singular orientation matrix to generate an updated weight matrix.

[0014] This specification provides an embodiment of a multimodal medical large-scale model application device. The device includes: an acquisition unit for acquiring multimodal feature data; and a prediction unit for inputting the multimodal feature data into a pre-trained multimodal medical large-scale model to obtain an output result. The multimodal medical large-scale model includes a multimodal semantic alignment model and an integrated enhanced medical large-scale model. The multimodal semantic alignment model is obtained by training an initial multimodal semantic alignment model using multimodal feature sample vectors, representation extraction sample vectors, and semantic labels corresponding to the multimodal feature sample vectors. The integrated enhanced medical large-scale model is obtained by training multiple source medical models based on the prediction distribution matrix obtained from predicting single-modal medical sample data and the true distribution matrix corresponding to each source medical model.

[0015] This specification also provides a multimodal medical large-scale model fine-tuning device, the device comprising: a first prediction result acquisition unit, used to input a multimodal feature sample vector, a representation extraction sample vector, and semantic labels corresponding to the multimodal feature sample vector into an initial multimodal semantic alignment model to obtain a prediction result output by the initial multimodal semantic alignment model, wherein the representation extraction sample vector is used to extract features from the multimodal feature sample vector; a multimodal semantic alignment model construction unit, used to calculate a multimodal matching loss function based on the prediction result and the semantic labels, and iteratively update the parameters in the initial multimodal semantic alignment model based on the multimodal matching loss function to construct a multimodal semantic alignment model; and a second prediction result acquisition unit, used to input the output of the multimodal semantic alignment model into a pre-trained integrated augmented medical large-scale model to obtain a prediction result output by the integrated augmented medical large-scale model, wherein the integrated augmented medical large-scale model is trained by a prediction distribution matrix obtained by multiple source medical models predicting single-modal medical sample data and a true distribution matrix corresponding to each source medical model.

[0016] This specification also provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the multimodal medical large model application and fine-tuning method.

[0017] This specification also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the multimodal medical large model application and fine-tuning method.

[0018] This invention reduces reliance on data resources by integrating the knowledge and capabilities of multiple source medical models, effectively improving the model's generalization ability and enhancing the accuracy and reliability of diagnosis. By extracting multimodal features and optimizing their semantic alignment, it avoids problems of information loss and semantic shift. Attached Figure Description

[0019] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are merely some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without any creative effort.

[0020] Figure 1 shows a flowchart of a multimodal medical large model application method according to an embodiment of this specification;

[0021] Figure 2 shows a flowchart of a multimodal medical large model fine-tuning method according to an embodiment of this specification;

[0022] Figure 3 shows a flowchart of an embodiment of this specification for optimizing a multimodal semantic alignment model;

[0023] Figure 4 shows a flowchart of a method for constructing a large medical model according to an embodiment of this specification.

[0024] Figure 5 shows a flowchart of a method for fine-tuning an initial integrated augmented medical model according to an embodiment of this specification.

[0025] Figure 6 shows a schematic diagram of the structure of a multimodal medical large model application device according to an embodiment of this specification;

[0026] Figure 7 shows a schematic diagram of the structure of a multimodal medical large model fine-tuning device according to an embodiment of this specification;

[0027] Figure 8 shows a schematic diagram of an embodiment of this specification for constructing an integrated and enhanced medical large model based on a medical source model;

[0028] Figure 9 shows a schematic diagram of an embodiment of this specification for updating the weight matrix of an integrated and enhanced medical large model;

[0029] Figure 10 shows a schematic diagram of a multimodal semantic alignment model based on adaptive fusion according to an embodiment of this specification;

[0030] Figure 11 shows a schematic diagram of the structure of a computer device according to an embodiment of this specification.

[0031] Explanation of symbols in the attached figures: 601, Acquisition unit; 602, Prediction unit; 701, First prediction result acquisition unit; 702, Multimodal semantic alignment model construction unit; 703, Second prediction result acquisition unit; 1102, Computer device; 1104, Processor; 1106, Memory; 1108, Drive mechanism; 1110, Input / output module; 1112, Input device; 1114, Output device; 1116, Presentation device; 1118, Graphical user interface; 1120, Network interface; 1122, Communication link; 1124, Communication bus. Detailed Implementation

[0032] To enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this specification, and not all embodiments. Based on the embodiments in this specification, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this specification.

[0033] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, apparatus, product, or device that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or devices.

[0034] This specification provides the operational steps of the methods described in the embodiments or flowcharts, but based on conventional or non-inventive labor, more or fewer operational steps may be included. The order of steps listed in the embodiments is merely one possible execution order among many and does not represent the only possible execution order. In actual system or device products, the methods shown in the embodiments or drawings can be executed sequentially or in parallel.

[0035] It should be noted that the multimodal medical large model application and fine-tuning method described in this manual can be used in the field of natural language processing technology as well as in the field of medical technology. This manual does not limit the application field of the multimodal medical large model application and fine-tuning method.

[0036] Figure 1 shows a flowchart of a multimodal medical large model application method according to an embodiment of this specification, which specifically includes the following steps:

[0037] Step 101: Obtain multimodal feature data.

[0038] In this step, the multimodal feature data is in vector form, and therefore can also be called multimodal feature vectors. This step involves acquiring multimodal medical data and pre-fusing it into multimodal feature vectors to obtain the multimodal feature data.

[0039] Step 102: Input the multimodal feature data into the pre-trained multimodal medical large model to obtain the output result. The multimodal medical large model includes a multimodal semantic alignment model and an integrated enhanced medical large model. The multimodal semantic alignment model is obtained by training an initial multimodal semantic alignment model with multimodal feature sample vectors, representation extraction sample vectors, and semantic labels corresponding to the multimodal feature sample vectors. The integrated enhanced medical large model is obtained by training a prediction distribution matrix obtained by multiple source medical models predicting single-modal medical sample data and the true distribution matrix corresponding to each source medical model.

[0040] In this step, both the multimodal semantic alignment model and the ensemble augmented medical model are pre-trained models. By integrating the knowledge and capabilities of multiple source medical models and employing a fusion strategy of predictive distribution representations, the ensemble augmented learning model is trained and constructed, significantly reducing dependence on data resources, ensuring the model's learning of complex disease characteristics, improving the generalization ability of the ensemble augmented learning model, and enhancing the accuracy and reliability of diagnosis. To enable the ensemble augmented medical model to understand multimodal feature data and perform subsequent diagnostic tasks, semantic alignment between the multimodal feature data and the input space of the ensemble augmented medical model is necessary. In this specification, the multimodal semantic alignment model can effectively separate irrelevant and redundant data in the multimodal feature data, thereby achieving accurate alignment between the multimodal feature data and its corresponding semantic labels.

[0041] Figure 2 shows a flowchart of a multimodal medical large model fine-tuning method according to an embodiment of this specification, which specifically includes the following steps:

[0042] Step 201: Input the multimodal feature sample vector and the representation extraction sample vector into the initial multimodal semantic alignment model to obtain the prediction result output by the initial multimodal semantic alignment model, wherein the representation extraction sample vector is used to extract features from the multimodal feature sample vector.

[0043] In this step, a multimodal semantic alignment model is constructed. This alignment model integrates a pre-fusion-driven multimodal semantic alignment mechanism and a semantic alignment method based on adaptive representation computation, aiming to semantically align multimodal feature sample vectors with the input space of a large medical model.

[0044] In this step, multimodal medical sample data is first acquired. This data is then pre-fused into multimodal feature sample vectors, and the corresponding multimodal semantic labels are determined. These vectors are then paired with the semantic labels and used in the pre-training of the initial multimodal semantic alignment model to learn the association between multimodal features and semantic labels. The initial multimodal semantic alignment model employs a Transformer structure, which includes a feature encoder and a label encoder. Specifically, the multimodal feature sample vectors and representation extraction vectors are input into the feature encoder. Through a cross-attention mechanism, they interact to extract and generate more semantically relevant prediction results, which constitute the multimodal feature sample representation. As shown in Figure 10, the multimodal fused feature vectors and representation extraction vectors are input into the feature encoder. These vectors participate in the computation at the cross-attention layer of the feature encoder, and the result is output at the feedforward layer.

[0045] In the embodiments of this specification, the multimodal feature sample vector can be a vector that combines different modalities of data such as text, speech, and video. For example, a liver surgery video clip containing on-site sound and text descriptions can have its corresponding semantic tags as key lesion information in the surgical video clip, including but not limited to: lesion location, lesion shape, color features, lesion name, etc.

[0046] In this embodiment, the characterization extraction sample vector is a random initial sample vector used to extract features from the multimodal feature sample vector. It can be any one or a combination of text features, speech features, or video features from the aforementioned liver surgery video clip example. Furthermore, the characterization extraction sample vector is learnable and can be optimized. Specifically, the characterization extraction sample vector is optimized using the following formula:

[0047] v = u T (X-μ); where v represents the character sample extraction vector, X represents the multimodal feature sample vector, μ represents the feature center sample vector, and u represents the main feature direction sample vector. The feature center sample vector μ is the mean of the multimodal feature sample vector X, calculated using the formula: The principal feature direction sample vector u is the first column of the left singular value matrix obtained by performing singular value decomposition on X-μ. The principal feature direction sample vector corresponds to the eigenvector of the largest eigenvalue and captures the direction with the largest variance in the data.

[0048] In this step, the characterization extraction sample vector and the multimodal feature sample vector are input together into the initial multimodal semantic alignment model to effectively separate irrelevant and redundant data in the multimodal feature sample vector.

[0049] Step 202: Calculate the multimodal matching loss function based on the prediction result and the semantic label corresponding to the multimodal feature sample vector, and iteratively update the parameters in the initial multimodal semantic alignment model based on the multimodal matching loss function to construct the multimodal semantic alignment model.

[0050] In this step, as shown in Figure 10, the multimodal semantic labels are used as input to the label encoder to generate a vector that matches the output of the feature encoder. Furthermore, a multimodal matching loss function is introduced as a constraint on the outputs of both the feature encoder and the label encoder to learn a mapping relationship that better matches the multimodal feature sample representation with the semantic labels, enabling the representation extraction vector to effectively capture the features in the multimodal sample vector.

[0051] The multimodal matching loss function combines contrastive loss and cosine distance loss. Specifically, the core of contrastive loss is to maximize the similarity of positive feature vector pairs and minimize the similarity of negative feature pairs, while further enhancing the consistency of feature vectors through cosine distance.

[0052] Specifically, the contrastive loss function is determined by maximizing the similarity of positive sample vector pairs of multimodal features and minimizing the similarity of negative sample vector pairs of multimodal features.

[0053] The cosine distance loss function is determined based on the multimodal feature sample vectors and their corresponding semantic labels;

[0054] Based on the contrastive loss function and the cosine distance loss function, a multimodal matching loss function is constructed, as shown in the following formula:

[0055] L match =α·L InfoNCE +(1-α)·L cos ,in, LmatchL represents the multimodal matching loss function. InfoNCE L represents the contrastive loss function for information aggregation. cos Let represent the cosine distance loss function, and α represent the regularization coefficient.

[0056] In this process, a multimodal semantic alignment model is constructed by optimizing the representation extraction vector. The multimodal semantic alignment model can effectively separate irrelevant and redundant data in the multimodal feature sample vector, thereby achieving accurate alignment between the multimodal feature sample vector and the semantic label.

[0057] Step 203: Input the output of the multimodal semantic alignment model into the pre-trained integrated augmented medical model to obtain the prediction result output by the integrated augmented medical model. The integrated augmented medical model is trained by the prediction distribution matrix obtained by multiple source medical models predicting single-modal medical sample data and the real distribution matrix corresponding to each source medical model.

[0058] In this step, the integrated augmented medical model can be used to diagnose multiple diseases. For a detailed description of the construction of the integrated augmented medical model, please refer to Figure 4.

[0059] In this specification, a multimodal semantic alignment model ensures efficient collaboration between multimodal data and large medical models by matching their embedding dimensions, significantly improving the performance of diagnostic tasks. This model integrates a pre-fusion-driven multimodal semantic alignment mechanism with a semantic alignment method based on adaptive representation computation. Unlike traditional semantic techniques that only support single-modal and text alignment, this application first fuses features from multimodal medical data and then performs semantic alignment on the fused features, achieving flexible support for arbitrary modal combinations. Furthermore, it demonstrates higher applicability and robustness in complex multimodal medical scenarios, significantly enhancing the ability of integrated and augmented medical models to understand and integrate multimodal information, thereby improving the accuracy and reliability of disease diagnosis.

[0060] The method described in this specification is adaptable to simultaneous input of multiple modalities, fully utilizing complementary information between modalities to enhance a comprehensive understanding of multimodal data and significantly improve model performance. This multimodal feature fusion strategy effectively captures deep semantic relationships across modalities, making the processing of multimodal data in complex medical scenarios more accurate.

[0061] Figure 3 shows a flowchart of a method for optimizing a multimodal semantic alignment model according to an embodiment of this specification, which specifically includes the following steps:

[0062] Step 301: Set a fully connected linear projection layer between the multimodal semantic alignment model and the integrated augmented medical model to match the output of the multimodal semantic alignment model with the feature dimensions of the input of the integrated augmented medical model.

[0063] In the embodiments of this specification, the output of the multimodal semantic alignment model is the aligned multimodal feature vector. In order to ensure that the aligned multimodal feature vector matches the input feature dimension of the integrated augmented medical model, the multimodal semantic alignment model and the initial integrated augmented medical model are connected through a fully connected linear projection layer. The fully connected linear projection layer is used to project the output of the multimodal semantic alignment model (i.e., the aligned multimodal feature vector) into a vector with the same embedding feature dimension as the integrated augmented medical model, so that the result of the multimodal semantic alignment model is input into the integrated augmented medical model in the form of a high-order embedding vector.

[0064] Step 302: Optimize the multimodal semantic alignment model based on the loss function of the masked language model, and determine the output of the multimodal semantic alignment model and the prediction results of the integrated enhanced medical big model.

[0065] In this specification, Masked Language Model (MLM) is a deep learning technique widely used in Natural Language Processing (NLP) tasks, and can be used in training models based on the Transformer architecture. MLM learns a deep representation of language by randomly masking or replacing parts of the input text and training the model to predict these masked or replaced words based on context, thus achieving bidirectional modeling.

[0066] In this step, the initial ensemble augmented medical model is trained by predicting masked words using the loss function of the masked language model, thereby improving the model's ability to understand multimodal data. The performance of the multimodal semantic alignment model is further optimized to enable it to work more effectively with the ensemble augmented medical model, enhancing its understanding and application of multimodal information and significantly improving diagnostic performance. The loss function for masked language modeling is as follows:

[0067] Among them, L MLM Let x represent the loss value for masked language modeling, i represent the i-th index value, M represent the set of position indices of the masked word, and x represent the loss value for masked language modeling. i Let x represent the target actual word corresponding to the i-th index value. masked p(x) represents the partially masked input word sequence. i |x masked ) indicates that the model is in response to a given input x. masked In the case of predicting the real word xi The probability of.

[0068] The loss function of a masked language model is calculated as follows: for each masked position, the masked language model calculates the probability of each word at that position and selects the word with the highest probability as the prediction. This loss function helps the model learn language representations during pre-training, thereby improving its performance in downstream tasks.

[0069] Figure 4 shows a flowchart of a method for constructing a large medical model according to an embodiment of this specification, which specifically includes the following steps:

[0070] Step 401: Input the single-modal medical sample data into multiple source medical models to obtain the prediction distribution matrix output by each source medical model.

[0071] This manual addresses multi-disease scenarios. Different pre-trained source medical models have their own advantages in specific disease domains, and these models have accumulated rich medical knowledge. Therefore, this step utilizes multiple pre-trained source medical models from various medical fields such as disease diagnosis and treatment, medical textbooks, and medical databases to train and construct an integrated augmented medical model. By leveraging the knowledge learned from these source medical models and further fine-tuning them in specific domains (such as major neuropsychiatric diseases), a more accurate integrated augmented medical model based on predictive distribution representations can be constructed. Therefore, by integrating the existing medical knowledge from multiple source medical models and utilizing multimodal fusion feature data for continuous training and fine-tuning optimization, the diagnosis of multiple diseases can be achieved.

[0072] This step first uses data from a multi-disease medical corpus as input to multiple source medical models. The data in this corpus primarily consists of unimodal medical-related data. To construct an integrated and enhanced large-scale medical model, this manual collects sample data from the multi-disease medical corpus and inputs it into multiple open-source source medical models. These open-source models include, but are not limited to, LLaVA-Med, MedPaLM, and BioGPT. The collected sample data is referred to as unimodal medical sample data. After inputting the unimodal medical sample data into the multiple source medical models, the prediction distribution matrix for each source medical model is obtained.

[0073] Because these source medical models use different word segmenters, the dimensions of the predicted distribution matrices may differ. To ensure information coherence and completeness, this specification employs a word alignment method based on minimum matching cost to standardize each predicted distribution matrix, unifying their size, format, structure, and semantics, thus providing a consistent input data structure for subsequent knowledge fusion. Specifically, the similarity between the predicted distribution matrices is measured by comparing the cosine distance between them as the matching cost. The word pairs with the lowest matching cost are selected for one-to-one alignment to ensure structural and semantic consistency of the distribution matrices. A smaller minimum matching cost indicates greater similarity between the two predicted distribution matrices; conversely, a larger distance indicates less similarity.

[0074] Step 402: Calculate the cross-entropy between the predicted distribution matrix of each source medical model and the true distribution matrix corresponding to each source medical model, and determine the weight of each predicted distribution matrix.

[0075] Based on the standardized predicted distribution matrix, this step employs a fusion strategy based on average cross-entropy to calculate the cross-entropy value between each standardized predicted distribution matrix and its corresponding label, and then determines the weight value based on the cross-entropy value. The label describes the true distribution matrix of each source medical model for the input single-modal medical sample data, and can be obtained simultaneously with the collection of the single-modal medical sample data.

[0076] Step 403: Based on the weights, the prediction distribution matrices output by each source medical model are weighted to obtain a fusion probability distribution matrix. After obtaining the weights of each prediction distribution matrix in step 302, the prediction distribution matrices of each source medical model are fused in a weighted manner to obtain a fusion probability distribution matrix. This improves the accuracy and robustness of the model in disease diagnosis tasks.

[0077] Step 404: Select one of the multiple source medical models as the initial large medical model, and calculate the loss value between the predicted distribution matrix, the fusion probability distribution matrix, and the true distribution matrix output by the initial large medical model.

[0078] Step 405: Iteratively train the parameters of the initial medical model based on the loss value to construct the initial ensemble augmented medical model.

[0079] A large model is selected from multiple source medical models as the initial large medical model. The predicted probability distribution of the initial large medical model's output at each training iteration is calculated using a loss function. This allows for the training and adjustment of the initial large medical model's parameters, enabling it to integrate the knowledge from various source models and form a unified large model with fused knowledge. Specifically, based on the predicted distribution matrix output by the initial large medical model, the fused probability distribution matrix, and the true distribution matrix, a fusion loss function is constructed and the loss value is calculated. The parameters of the initial large medical model are iteratively trained based on the loss value until the number of iterations reaches a preset value; or until the loss value converges to a preset threshold, indicating that the initial large medical model has reached the training termination condition, thus constructing the initial ensemble-enhanced large medical model.

[0080] In the embodiments of this specification, the fusion loss function aims to improve the medical semantic understanding capability of the initial large-scale medical model and enhance its performance in multi-disease medical scenarios by optimizing the loss function. Specifically, the fusion loss function is determined based on the difference between the true distribution matrix and the fusion probability distribution matrix, and the difference between the fusion probability distribution matrix and the predicted distribution matrix. The fusion loss function is constructed using the following formula, and the loss value of the initial large-scale medical model training is calculated.

[0081] The formula for the fusion loss function is as follows: L FUSE =-λE t～C [D(P t O t )]-E t～C [D(P t Q t )],

[0082] Where D(·,·) represents the difference measure between two matrices, implemented using relative entropy, and E t～C O represents the expected value of text t in a multi-disease medical corpus C. t Let Q represent the true distribution matrix of the initial large medical model for text t. t P represents the prediction distribution matrix of the initial large medical model output for text t. t This represents the fusion probability distribution matrix. The fusion loss function integrates the differences between the true distribution matrix and the fusion probability distribution matrix, as well as the differences between the fusion probability distribution matrix and the predicted distribution matrix of the initial large medical model, thereby achieving fitting of the true label distribution and learning of the fusion distribution during the optimization process.

[0083] Step 406: Fine-tune the initial integrated augmented medical model to obtain the integrated augmented medical model.

[0084] This manual employs an adaptive rank reduction and quantization fine-tuning training strategy. By performing low-rank decomposition and quantization on the model parameters, the number of parameters that need to be adjusted during the initial ensemble augmented medical model fine-tuning is reduced, thereby optimizing the ensemble augmented medical model. A detailed description of the initial ensemble augmented medical model fine-tuning in this step is shown in Figure 5.

[0085] Figure 5 shows a flowchart of a method for fine-tuning an integrated and enhanced medical large model according to an embodiment of this specification, which specifically includes the following steps:

[0086] Step 501: Use singular value decomposition to pre-freeze the weight matrix of the initial integrated augmented medical model to obtain the decomposed singular scale vector and singular direction matrix.

[0087] In this step, the weight matrix W0 of the pre-trained ensemble augmented medical model in Figure 4 is first frozen. After freezing, singular value decomposition is used to decompose the weight matrix W0 into a singular scale vector m and a singular direction matrix, as shown in the following equation:

[0088] W0 = U·diag(m)·V, where m represents the singular scale vector, and U and V represent the left and right singular value matrices obtained from singular value decomposition, respectively. The right singular value matrix is also called the singular orientation matrix. diag(·) denotes the diagonal matrix operator. This step decomposes the weight matrix of the initial ensemble augmented medical model to achieve independent adjustment of scale and orientation weights, capture the global characteristics of the weight matrix, reveal its low-rank structure and orthogonality, improve numerical stability, and enhance efficiency and reliability.

[0089] Step 502: Freeze the singular orientation matrix, generate multiple basic low-rank matrices and multiple intermediate matrices, control the number and value of the low-rank matrices, and decompose the singular orientation matrix into at least one low-rank matrix.

[0090] Before freezing the singular direction matrix in this step, the method also includes quantizing the right singular value matrix (singular direction matrix). Specifically, considering the large number of parameters in the singular direction matrix V, parameter quantization is used to reduce it, as shown in the following equation:

[0091] Where V represents the singular direction matrix. This represents the singular orientation matrix after parameter quantization, and s0 represents the scale factor, with values ranging from 1 to 10. z0 represents the bias quantization value, which is 7 in 4-bit quantization. `clamp(x,A,B)` restricts the value of x to between A and B. `[·]` is the integer part of the quantization operation. This formula indicates that... The value is limited to 0 to 2 4 Between -1 and 1.

[0092] The singular direction matrix after parameter quantization After freezing, low-rank matrix decomposition is further achieved by controlling the number of low-rank matrices n and the rank r of each low-rank matrix. This method first generates two basic low-rank matrices A and B, and then introduces n intermediate matrices X based on these. i (i=1,...,n), the decomposed matrix is much smaller than the dimension of the original weight matrix, thus exhibiting the characteristics of low rank.

[0093] Step 503: Adjust the parameters of the low-rank matrix to obtain the adjusted singular value matrix.

[0094] By adjusting A, B, and X i The parameters of the matrix can be adjusted without adjusting the entire matrix. The parameters of the matrix are reduced, thus lowering the computational cost. The adjusted singular orientation matrix is:

[0095] The parameter matrices A, B, and X of the decomposition i Parameter values are initialized using a Gaussian distribution. This decomposition method allows for flexible control of the degrees of freedom by adjusting the number of intermediate matrices, n, and significantly reduces the storage and computational costs of the weight matrix.

[0096] Step 504: Merge the singular scale vector, the low-rank adjusted singular orientation matrix, and the parameter-quantized singular orientation matrix to generate the updated weight matrix.

[0097] The singular scale vector m and the low-rank adjusted singular direction matrix are used. and the singular orientation matrix after parameter quantization The results are merged to generate the final updated weight matrix W′. The updated weight matrix not only contains the main information of the original weight matrix but also effectively incorporates parameter optimizations captured through low-rank decomposition. The specific formula for parameter update is shown below:

[0098] W′ represents the updated weight matrix. This represents the singular orientation matrix after low-rank adjustment. This represents the singular direction matrix after low-rank adjustment.

[0099] This specification employs an adaptive down-rank quantization fine-tuning training strategy. By performing low-rank decomposition and quantization on the model parameters, it reduces the number of parameters that need to be adjusted during the fine-tuning of the ensemble augmented medical model, thereby optimizing the model and significantly reducing the consumption of computational resources and storage space, while improving fine-tuning training efficiency. In resource-constrained environments, it can maintain model performance while significantly shortening fine-tuning time and computational costs. This greatly reduces the hardware requirements during the optimization process of the ensemble augmented medical model, improves its application efficiency in dynamic medical environments, and makes large-scale deployment and clinical application more sustainable.

[0100] Figure 6 shows a schematic diagram of a multimodal medical large-scale model application device according to an embodiment of this specification. The figure illustrates the basic structure of the device. The functional units and modules can be implemented in software, or using general-purpose chips or specific chips to implement the multimodal medical large-scale model application. The device specifically includes:

[0101] Acquisition unit 601 is used to acquire multimodal feature data;

[0102] The prediction unit 602 is used to input multimodal feature data into a pre-trained multimodal medical large model and obtain the output result. The multimodal medical large model includes a multimodal semantic alignment model and an integrated enhanced medical large model. The multimodal semantic alignment model is obtained by training an initial multimodal semantic alignment model with multimodal feature sample vectors, representation extraction sample vectors, and semantic labels corresponding to the multimodal feature sample vectors. The integrated enhanced medical large model is obtained by training a prediction distribution matrix obtained by multiple source medical models predicting single-modal medical sample data and the true distribution matrix corresponding to each source medical model.

[0103] Figure 7 shows a schematic diagram of a multimodal medical large-scale model fine-tuning device according to an embodiment of this specification. The figure illustrates the basic structure of the device, where functional units and modules can be implemented in software or using general-purpose or specific chips to construct the multimodal medical large-scale model. The device specifically includes:

[0104] The first prediction result acquisition unit 701 is used to input the multimodal feature sample vector, the representation extraction sample vector and the semantic label corresponding to the multimodal feature sample vector into the initial multimodal semantic alignment model to obtain the prediction result output by the initial multimodal semantic alignment model, wherein the representation extraction sample vector is used to extract features from the multimodal feature sample vector;

[0105] The multimodal semantic alignment model construction unit 702 is used to calculate a multimodal matching loss function based on the prediction result and the semantic label, and iteratively update the parameters in the initial multimodal semantic alignment model based on the multimodal matching loss function to construct a multimodal semantic alignment model.

[0106] The second prediction result acquisition unit 703 is used to input the output of the multimodal semantic alignment model into the pre-trained integrated augmented medical model to obtain the prediction result output by the integrated augmented medical model. The integrated augmented medical model is trained by the prediction distribution matrix obtained by multiple source medical models predicting single-modal medical sample data and the real distribution matrix corresponding to each source medical model.

[0107] Figure 8 illustrates an embodiment of this specification of constructing an initial integrated enhanced medical large-scale model based on a medical source model. In the figure, three pre-trained medical large-scale models—Medical Large-Scale Model A, Medical Large-Scale Model B, and Medical Large-Scale Model C—with rich accumulated medical knowledge from various medical fields such as disease diagnosis and treatment, medical textbooks, and medical databases, are used. Continuous training and fine-tuning are performed using multimodal fusion feature data to achieve the diagnosis of multiple diseases, thereby constructing a more accurate medical diagnostic model.

[0108] Figure 9 illustrates a schematic diagram of updating the weight matrix of an integrated augmented medical model according to an embodiment of this specification. Singular value decomposition (SVD) is used to decompose the initial weight matrix of the integrated augmented medical model into a singular scale vector and a singular orientation matrix. Next, two basic low-rank matrices A and B are generated, and n intermediate matrices are introduced. Low-rank matrix decomposition is achieved by controlling the number of introduced intermediate matrices and the rank of each low-rank matrix. This allows for flexible control of the decomposition degrees of freedom, reducing the storage and computational costs of the weight matrix. Finally, the singular scale vector, singular orientation matrix, and the singular orientation matrix updated with low-rank parameters are merged to generate the final updated weight matrix. The updated weight matrix not only contains the main information of the original weight matrix but also effectively incorporates parameter optimizations captured through low-rank decomposition, thereby enabling fine-tuning of the initial integrated augmented medical model.

[0109] Figure 10 shows a schematic diagram of a multimodal semantic alignment model based on adaptive fusion according to an embodiment of this specification. This model integrates a pre-fusion-driven multimodal semantic alignment mechanism and a semantic alignment method based on adaptive representation computation, enabling semantic alignment of multimodal feature vectors with the input space of an integrated and enhanced medical large model.

[0110] Figure 11 is a schematic diagram of a computer device provided in an embodiment of this specification. The multimodal medical large-scale model application and fine-tuning method described in this application can be applied to the computer device. The computer device 1102 may include one or more processors 1104, such as one or more central processing units (CPUs), each of which can implement one or more hardware threads. The computer device 1102 may also include any memory 1106 for storing any kind of information such as code, settings, data, etc. Non-limitingly, for example, the memory 1106 may include any one or more combinations of the following: any type of RAM, any type of ROM, flash memory device, hard disk, optical disk, etc. More generally, any memory can use any technology to store information. Further, any memory can provide volatile or non-volatile retention of information. Further, any memory can represent a fixed or removable component of the computer device 1102. In one case, when the processor 1104 executes associated instructions stored in any memory or combination of memories, the computer device 1102 can perform any operation of the associated instructions. The computer device 1102 also includes one or more drive mechanisms 1108 for interacting with any memory, such as a hard disk drive mechanism, an optical disk drive mechanism, etc.

[0111] Computer device 1102 may also include an input / output module 1110 (I / O) for receiving various inputs (via input device 1112) and providing various outputs (via output device 1114). A specific output mechanism may include a presentation device 1116 and an associated graphical user interface (GUI) 1118. In other embodiments, the input / output module 1110 (I / O), input device 1112, and output device 1114 may be omitted, and the device may function solely as a computer device within a network. Computer device 1102 may also include one or more network interfaces 1120 for exchanging data with other devices via one or more communication links 1122. One or more communication buses 1124 couple the components described above together.

[0112] Communication link 1122 can be implemented in any way, such as via a local area network, a wide area network (e.g., the Internet), a point-to-point connection, or any combination thereof. Communication link 1122 may include any combination of hardwired links, wireless links, routers, gateway functions, name servers, etc., governed by any protocol or combination of protocols.

[0113] Corresponding to the methods in Figures 1 to 5, embodiments of this specification also provide a computer-readable storage medium storing a computer program that, when executed by a processor, performs the steps of the methods described above.

[0114] This specification also provides computer-readable instructions, wherein when a processor executes the instructions, the program therein causes the processor to perform the methods shown in Figures 1 to 5.

[0115] It should be understood that in the various embodiments of this specification, the sequence number of each process does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this specification.

[0116] It should also be understood that, in the embodiments of this specification, the term "and / or" is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Additionally, the character " / " in this specification generally indicates that the preceding and following related objects have an "or" relationship.

[0117] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed in this specification can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of each example have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this specification.

[0118] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0119] In the several embodiments provided in this specification, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the couplings or direct couplings or communication connections shown or discussed may be indirect couplings or communication connections through some interfaces, devices, or units, or they may be electrical, mechanical, or other forms of connection.

[0120] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of the embodiments described in this specification, depending on actual needs.

[0121] Furthermore, the functional units in the various embodiments of this specification can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0122] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this specification, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this specification. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0123] This specification uses specific embodiments to illustrate the principles and implementation methods of this specification. The descriptions of the above embodiments are only for the purpose of helping to understand the methods and core ideas of this specification. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this specification. Therefore, the content of this specification should not be construed as a limitation of this specification.

Claims

1. A method for applying a multimodal medical large model, characterized in that, The method includes: Acquire multimodal feature data; Multimodal feature data is input into a pre-trained multimodal medical large model to obtain output results. The multimodal medical large model includes a multimodal semantic alignment model and an integrated enhanced medical large model. The multimodal semantic alignment model is obtained by training an initial multimodal semantic alignment model with multimodal feature sample vectors, representation extraction sample vectors, and semantic labels corresponding to the multimodal feature sample vectors. The integrated enhanced medical large model is obtained by training multiple source medical models based on the prediction distribution matrix obtained from predicting single-modal medical sample data and the true distribution matrix corresponding to each source medical model.

2. A method for fine-tuning a large multimodal medical model, characterized in that, The method includes: The multimodal feature sample vector and the representation extraction sample vector are input into the initial multimodal semantic alignment model to obtain the prediction result output by the initial multimodal semantic alignment model. The representation extraction sample vector is used to extract features from the multimodal feature sample vector. The multimodal matching loss function is calculated based on the prediction results and the semantic labels corresponding to the multimodal feature sample vectors. The parameters in the initial multimodal semantic alignment model are iteratively updated based on the multimodal matching loss function to construct the multimodal semantic alignment model. The output of the multimodal semantic alignment model is input into a pre-trained integrated augmented medical model to obtain the prediction result output by the integrated augmented medical model. The integrated augmented medical model is trained by the prediction distribution matrix obtained by multiple source medical models predicting single-modal medical sample data and the real distribution matrix corresponding to each source medical model.

3. The method according to claim 2, characterized in that, The method further includes: A fully connected linear projection layer is set between the multimodal semantic alignment model and the integrated augmented medical model to match the feature dimensions of the output of the multimodal semantic alignment model with the input of the integrated augmented medical model. Based on the loss function of the masked language model, the multimodal semantic alignment model is optimized, and the output of the multimodal semantic alignment model and the prediction results of the integrated enhanced medical big model are determined.

4. The method according to claim 2, characterized in that, The characterization extraction sample vector is determined according to the following formula: v = u T (X-μ); where v represents the character sample extraction vector, X represents the multimodal feature sample vector, μ represents the feature center vector, and u represents the main feature direction vector.

5. The method according to claim 2, characterized in that, The multimodal matching loss function is constructed as follows: The contrastive loss function is determined by maximizing the similarity of positive sample vector pairs of multimodal features and minimizing the similarity of negative sample vector pairs of multimodal features. The cosine distance loss function is determined based on the multimodal feature sample vectors and their corresponding semantic labels; Based on the contrastive loss function and the cosine distance loss function, a multimodal matching loss function is constructed, as shown in the following formula: L match =α·L InfoNCE +(1-α)·L cos Among them, L match L represents the multimodal matching loss function. InfoNCE L represents the contrastive loss function for information aggregation. cos Let represent the cosine distance loss function, and α represent the regularization coefficient.

6. The method according to claim 2, characterized in that, The integrated and enhanced medical model is constructed in the following way: Single-modal medical sample data is input into multiple source medical models to obtain the prediction distribution matrix output by each source medical model. Calculate the cross-entropy between the predicted distribution matrix of each source medical model and the true distribution matrix corresponding to each source medical model, and determine the weight of each predicted distribution matrix. Based on the weights, the prediction distribution matrices output by each source medical model are weighted to obtain the fusion probability distribution matrix; From the multiple source medical models, one source medical model is selected as the initial large medical model, and the loss value between the predicted distribution matrix output by the initial large medical model, the fusion probability distribution matrix, and the true distribution matrix is calculated. The parameters of the initial medical model are iteratively trained based on the loss value to construct the initial ensemble augmented medical model. The initial integrated augmented medical model is fine-tuned to obtain the integrated augmented medical model.

7. The method according to claim 6, characterized in that, The loss value between the predicted distribution matrix, the fusion probability distribution matrix, and the true distribution matrix output by the initial medical model is calculated in the following way: The fusion loss function is determined based on the difference between the true distribution matrix and the fusion probability distribution matrix, and the difference between the fusion probability distribution sample matrix and the predicted distribution matrix.

8. The method according to claim 6, characterized in that, The initial integrated augmented medical model is fine-tuned to obtain an integrated augmented medical model including: The singular scale vector and singular direction matrix of the initial integrated augmented medical model are obtained by using singular value decomposition to freeze the weight matrix before the model is set. Freeze the singular orientation matrix, generate multiple basic low-rank matrices and multiple intermediate matrices, control the number and value of the low-rank matrices, and decompose the singular orientation matrix into at least one low-rank matrix. Adjusting the parameters of the low-rank matrix yields the low-rank adjusted singular value matrix; The singular scale vector, the adjusted singular orientation matrix, and the parameterized singular orientation matrix are merged to generate the updated weight matrix.

9. A multimodal medical large-scale model application device, characterized in that, The device includes: Acquisition unit, used to acquire multimodal feature data; The prediction unit is used to input multimodal feature data into a pre-trained multimodal medical large model and obtain the output result. The multimodal medical large model includes a multimodal semantic alignment model and an integrated enhanced medical large model. The multimodal semantic alignment model is obtained by training an initial multimodal semantic alignment model with multimodal feature sample vectors, representation extraction sample vectors, and semantic labels corresponding to the multimodal feature sample vectors. The integrated enhanced medical large model is obtained by training multiple source medical models based on the prediction distribution matrix obtained from predicting single-modal medical sample data and the true distribution matrix corresponding to each source medical model.

10. A multimodal medical large-scale model fine-tuning device, characterized in that, The device includes: The first prediction result acquisition unit is used to input the multimodal feature sample vector, the representation extraction sample vector and the semantic label corresponding to the multimodal feature sample vector into the initial multimodal semantic alignment model to obtain the prediction result output by the initial multimodal semantic alignment model, wherein the representation extraction sample vector is used to extract features from the multimodal feature sample vector; A multimodal semantic alignment model construction unit is used to calculate a multimodal matching loss function based on the prediction result and the semantic label, and iteratively update the parameters in the initial multimodal semantic alignment model based on the multimodal matching loss function to construct a multimodal semantic alignment model. The second prediction result acquisition unit is used to input the output of the multimodal semantic alignment model into the pre-trained integrated augmented medical model to obtain the prediction result output by the integrated augmented medical model. The integrated augmented medical model is trained by the prediction distribution matrix obtained by multiple source medical models predicting single-modal medical sample data and the real distribution matrix corresponding to each source medical model.

11. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the method according to any one of claims 1 to 8.

12. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the method according to any one of claims 1 to 8.