A multi-modal classification method and system based on cross-modal alignment
By employing a cross-modal alignment method, learningable semantic tags and prototypes are used to reconstruct image and text features, addressing the issue of semantic granularity differences between modalities. This achieves more efficient multimodal classification and improves the model's accuracy and robustness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- Shanxi Taihang Laboratory Co., Ltd.
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, the semantic granularity differences between image and text modalities limit alignment accuracy, making it difficult to build structural consistency across sample categories or abstract semantic layers, thus affecting the accuracy and robustness of multimodal classification.
By employing a cross-modal alignment method, we reconstruct image and text features using learnable semantic tags and semantic prototypes. Furthermore, we optimize the alignment loss using the Sparsemax function and the Sinkhorn-Knopp clustering algorithm to achieve alignment of image and text features in a shared semantic space, thereby improving the interaction and classification performance of cross-modal features.
It improves the accuracy and robustness of multimodal classification, effectively alleviates the granularity gap between modalities, enhances the model's ability to capture important semantic concepts, and improves the accuracy and interpretability of cross-modal alignment.
Smart Images

Figure CN122241572A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of multimodal classification technology, and in particular to a multimodal classification method and system based on cross-modal alignment. Background Technology
[0002] With the rapid growth of image data and the enrichment of textual information, researchers are no longer satisfied with using deep learning to process single-modal information. Increasingly, research is focusing on combining text data with images, speech, and other data. Multimodal learning aims to integrate heterogeneous information from different modalities, constructing an effective way to perceive the world. By combining valuable information from two or more modalities and mimicking the workings of the human brain, it aims to overcome the limitations of single-modal frameworks in specific tasks. In the rich and diverse multimodal data, images and text dominate, providing information from visual and linguistic perspectives respectively, and they have a complementary relationship. By combining them, a more comprehensive and in-depth understanding of the data can be achieved. In certain contexts, the information in images and text complements each other, compensating for the shortcomings of a single modality.
[0003] Currently, multimodal classification based on images and text has become a highly focused area in computer science, aiming to identify and classify data by analyzing the features of different modalities. Multimodal classification methods improve the accuracy and robustness of classification tasks by integrating this information. The existence of multimodal image and text data provides us with rich information, but also brings new challenges. Since data from different modalities have different semantic meanings and granularities, how to effectively utilize this data for accurate classification is a major problem facing multimodal classification tasks.
[0004] Cross-modal alignment has made significant progress in the field of visual language. As a core means of bridging the semantic gap between images and text, it can provide more accurate cross-modal feature associations for multimodal classification tasks, thereby improving classification performance. Images and text differ fundamentally in how they express information. Images visually present information, while text abstractly expresses it through linguistic symbols. Therefore, directly extracting features from raw multimodal data and achieving cross-modal semantic alignment presents significant challenges. Effectively bridging the modal differences between visual and linguistic information to achieve semantic understanding and accurate correspondence between these two different types of information requires training deep learning models to align features between modalities.
[0005] Disadvantages of existing technology: 1. The inherent differences in semantic granularity between modalities limit alignment accuracy. Because language and visual information have different semantic meanings and granularities, image modalities often possess rich and redundant visual details, while text modalities tend towards highly abstract and compressed information expression. Specifically, images are typically decomposed into a series of local regions or local tokens, which primarily encode low-level visual information such as color and texture; while text is composed of words or local tokens, which carry more high-level semantic concepts. This asymmetry in semantic hierarchy and information granularity leads to ambiguity in the correspondence between text descriptions and image regions, resulting in text descriptions potentially corresponding to similar but different image regions, thus increasing the gap between modalities and affecting model performance. Traditional methods typically align these two modalities directly in the embedding space, ignoring the differences at the granularity level, thereby affecting the accuracy of semantic matching.
[0006] 2. Existing methods mostly model the relationships between modalities on a sample pair basis, which can only capture local modal associations and is difficult to construct structural consistency models at the category, topic, or abstract semantic levels across samples. The lack of explicit high-order semantic structure modeling between modalities leads to alignment results that are often superficially similar and fail to capture the global commonalities between modalities.
[0007] As explained above, existing methods suffer from inaccurate modal alignment and limitations in high-order semantic modeling when applied to real-world tasks. Therefore, it is necessary to improve current deep learning-based cross-modal alignment algorithms to enhance their reliability and accuracy in real-world applications. In conclusion, designing a multimodal classification method and system based on cross-modal alignment is essential. Summary of the Invention
[0008] To overcome the shortcomings of the prior art, the purpose of this invention is to provide a multimodal classification method and system based on cross-modal alignment.
[0009] To achieve the above objectives, the present invention provides the following solution: This invention provides a multimodal classification method based on cross-modal alignment, comprising: Step 1: Input multimodal image and text data, and perform preprocessing; Step 2: Input the image and text data into the image encoder and text encoder respectively for feature extraction; Step 3: Reconstruct image and text features using learnable semantic tags; Step 4: Calculate the similarity between image and text features and semantic prototypes through a cross-modal prototype alignment mechanism, and optimize the alignment loss; Step 5: Input the image features and text features into the multimodal classifier for classification.
[0010] Preferably, in step 1, the input multimodal image and text data are preprocessed, specifically as follows: Acquire multimodal image and text data, and construct a dataset based on it, where the dataset exists in the form of image-text pairs.
[0011] Preferably, in step 2, the image and text data are input into the image encoder and text encoder respectively for feature extraction, specifically as follows: Multimodal images are input into an image encoder for feature extraction to obtain image embedding features; Text data is input into a text encoder for text feature extraction, generating text embedding features.
[0012] Preferably, in step 3, the image features are reconstructed using learnable semantic tags, specifically as follows: Define learnable semantic tags and use them as a feature space shared by images and text; The extracted image embedding features are mapped to a shared feature space through a projection function, and the inner product similarity between the image embedding features and each semantic tag is calculated. The Sparsemax function is used to normalize the inner product similarity between the image embedding features and each semantic tag to obtain the weight of each tag; The labels are weighted and summed to obtain the reconstructed image features.
[0013] Preferably, in step 3, the text features are reconstructed using learnable semantic tags, specifically as follows: Define learnable semantic tags and use them as a feature space shared by images and text; The extracted text embedding features are mapped to a shared feature space through a projection function, and the inner product similarity between the text embedding features and each semantic tag is calculated. The Sparsemax function is used to normalize the inner product similarity between the text embedding features and each semantic tag to obtain the weight of each tag; The reconstructed text features are obtained by weighted summation of the markers.
[0014] Preferably, in step 4, the similarity between image and text features and semantic prototypes is calculated through a cross-modal prototype alignment mechanism, and the alignment loss is optimized, specifically as follows: Define K learnable cross-modal semantic prototypes and initialize these prototypes at any time; The Sinkhorn-Knopp clustering algorithm is used to soft-assign the reconstructed image and text features; Calculate the visual softmax probability and the text softmax probability; Based on visual softmax probability and text softmax probability, cross-modal prototype alignment is achieved through text-to-image cross-prediction and image-to-text cross-prediction. Cross-modal prediction and optimization of the following cross-entropy loss are performed, where the overall prototype alignment loss is the average of the two prediction losses for all image-text pairs.
[0015] Preferably, in step 5, the image features and text features are input into a multimodal classifier for classification, specifically as follows: Image features and text features are concatenated to obtain fused features; The fused features are input into a multimodal classifier to obtain multimodal prediction soft labels; The cross-entropy loss is calculated between the multimodal predicted soft labels and the true labels; The loss function is continuously optimized until the training iterations are completed.
[0016] This invention also provides a multimodal classification system based on cross-modal alignment, applied to the aforementioned multimodal classification method based on cross-modal alignment, comprising: The data loading and preprocessing module is used to acquire multimodal image and text data and perform preprocessing. The feature extraction module includes an image encoder and a text encoder, which are used to extract image embedding features and text embedding features, respectively; The semantic reconstruction module is used to reconstruct image and text features using learnable semantic tags and map them to a shared semantic space. A cross-modal prototype alignment module is used to calculate the similarity between image and text features and semantic prototypes, and to optimize the alignment loss; The multimodal classification module is used to input image features, text features, and fused features into the classifier for classification and prediction.
[0017] According to specific embodiments provided by the present invention, the present invention discloses the following technical effects: This invention provides a multimodal classification method and system based on cross-modal alignment. The method includes inputting multimodal image and text data, preprocessing them, inputting the image and text data into an image encoder and a text encoder respectively for feature extraction, reconstructing image and text features using learnable semantic tags, calculating the similarity between image and text features and semantic prototypes through a cross-modal prototype alignment mechanism, optimizing the alignment loss, and inputting the image and text features into a multimodal classifier for classification. This invention has the following advantages: 1. This invention uses learnable semantic tags to implicitly model the shared semantic concepts of images and text through cross-modal interaction, serving as a prior constraint for semantic alignment. With the help of learnable semantic tags, the features extracted from the two modalities are placed in a common manifold space, thereby realizing intermodal interaction. 2. The image feature reconstruction and text feature reconstruction proposed in this invention continuously learn and adjust through learnable tags, judge the weight of feature information in images and text, ignore the differences between modalities, find the common semantics between modalities, and map images and text into the same semantic space, so that their granularity is unified, improve the model's ability to capture important semantic concepts, thereby better performing cross-modal alignment and interaction, improving the interpretability of model representation, and effectively alleviating the granularity gap between image patches and text tags; 3. This invention uses the Sparsemax function to normalize similarity. The Sparsemax function first calculates a threshold, and then sets the weights below the threshold to zero. Compared with the Softmax function, it can explicitly assign zero probability to learnable semantic tags, and can better achieve sparsity. The role of the sparsemax function is to generate a sparse distribution by minimizing the squared Euclidean distance between the output distribution and the input value. 4. The cross-modal alignment method proposed in this invention makes up for the shortcomings of traditional models in semantic granular alignment and high-order semantic modeling. It also enhances the model's ability to capture class commonalities and inter-class differences through structured semantic prototypes, and performs modal alignment at different levels, providing a more robust semantic foundation for multimodal classification tasks. Attached Figure Description
[0018] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 This is a schematic diagram of the method flow of the present invention. Detailed Implementation
[0020] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0021] The purpose of this invention is to provide a multimodal classification method and system based on cross-modal alignment. By utilizing shared discrete semantic tags and a learnable set of semantic prototypes, it aims to unify the semantic information of images and text into a shared semantic space, align the feature representations of the same semantic concepts across different modalities, improve the model's ability to capture important semantic concepts, thereby better performing cross-modal alignment and interaction, and improving the model's performance in multimodal classification.
[0022] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0023] like Figure 1 As shown, this invention provides a multimodal classification method based on cross-modal alignment, comprising: Step 1: Input multimodal image and text data, and perform preprocessing; Step 2: Input the image and text data into the image encoder and text encoder respectively for feature extraction; Step 3: Reconstruct image and text features using learnable semantic tags; Step 4: Calculate the similarity between image and text features and semantic prototypes through a cross-modal prototype alignment mechanism, and optimize the alignment loss; Step 5: Input the image features and text features into the multimodal classifier for classification.
[0024] In step 1, the input multimodal image and text data undergo preprocessing, specifically as follows: Multimodal image and text data are used as input dataset D, and the image-text pairs in dataset D are... As input, I represents image data and T represents text data.
[0025] In step 2, the image and text data are input into the image encoder and text encoder respectively for feature extraction, specifically as follows: Image I i The image is input to an image encoder for feature extraction to obtain the image embedding. , where v i Let be the features of the i-th image patch, n be the total number of image patches, and d be the dimension of the linear layer output, where the feature vector corresponding to the [CLS] label is used as the global image representation; Text T i The text is input into a text encoder for text feature extraction, generating a text embedding. , among which, T iLet be the feature vector of the i-th word, k be the number of words in the text sentence, and d be the dimension of the linear layer output. The feature vector corresponding to the [CLS] marker is used as the global text representation.
[0026] In step 3, image features and text features are reconstructed using learnable semantic tags, specifically as follows: Define a learnable semantic tag As a feature space shared by images and text, the tags are randomly initialized, where M is the number of learnable semantic tags; The extracted image pixel block features V are mapped to a shared feature space through a projection function. The inner product between the projected image block embedding and each label is calculated, and the maximum value is selected as the similarity between the image and that label. (1) Where < and > are inner product functions used to measure the similarity between vectors. This represents the similarity between the image and the i-th label; The similarity between the image and the learnable semantic tags is normalized using a Softmax function to generate the final weight for each tag: (2) in, It is the weight of the i-th label in relation to the image; Based on the calculated weights, a weighted sum is calculated with the learnable semantic tags to obtain the reconstructed image features: (3) The extracted text word features T are mapped to a shared feature space through a projection function. The maximum value is selected as the similarity between the text and the tag by calculating the inner product between the projected text embedding and each tag. (4) In the formula, This indicates the similarity between the text and the i-th tag; Calculate the weights of the text and the i-th tag: (5) Reconstructed text features: (6) To reduce noise, focus on key semantics, and ensure consistency of each tag across image and text semantics, the Sparsemax function is used to normalize similarity to obtain sparser weights and reduce the number of non-zero weights. The Sparsemax function generates a sparse distribution by minimizing the squared Euclidean distance between the output distribution and the input values, outputting the minimum weights. (7) Where S is a vector consisting of similarity scores between an image or text and learnable semantic tags. .
[0027] In step 4, the similarity between image and text features and semantic prototypes is calculated through a cross-modal prototype alignment mechanism, and the alignment loss is optimized, specifically as follows: Define K learnable cross-modal prototypes These prototypes are randomly initialized, among which Each prototype represents a semantic category; The reconstructed image features and text features are compared As input, the iterative Sinkhorn-Knopp clustering algorithm is used to cluster f respectively. V f T Assign them to K clusters, and generate soft assignment vectors for image and text samples respectively. , .
[0028] f V The visual softmax probability is obtained by calculating the cosine similarity with all cross-modal prototypes in C. f T The text softmax probability is obtained by calculating the cosine similarity with all cross-modal prototypes in C. : (8) (9) Where k represents the k-th element of the vector, and τ is the prototype-level temperature parameter.
[0029] Cross-modal prototype alignment is achieved through text-to-image cross-prediction and image-to-text cross-prediction. Cross-modal prediction and optimization of the following two cross-entropy losses are performed: (10) (11) The overall prototype alignment loss is the average of the two predicted losses for all image-text pairs, as shown below: (12) In step 5, the image features and text features are input into the multimodal classifier for classification, specifically as follows: The raw image and text data are input into an image encoder and a text encoder to extract unimodal representations. Image pixel block features V and text word features W are extracted and used as inputs to the unimodal classifiers. The extracted image features and text features are then input into the image classifier and the text classifier, respectively, for prediction. (13) (14) In the formula, V i This represents the i-th feature of the image. W represents the soft label of the i-th image feature. i This represents the i-th feature of the text. A soft label representing the i-th text feature; Cross-entropy loss is calculated between the prediction results of the image unimodal and the ground truth label, and cross-entropy loss is also calculated between the prediction results of the text unimodal and the ground truth label. (15) (16) in, For image single-modal prediction loss, For text-based unimodal prediction loss, This represents the true label of the i-th image feature. This represents the true label of the i-th text feature.
[0030] The representations of the single modality are simply concatenated, and the fused representation is then passed to the multimodal classifier. The fused features are represented as follows: (17) The fused features are then input into a multimodal classifier to obtain multimodal predicted soft labels: (18) Similarly, the predictions from the multimodal fusion features are compared with the true labels using cross-entropy loss calculation: (19) Step 6: Continuously optimize the loss function until the training iterations are complete, specifically: Combining image unimodal cross-entropy loss, text unimodal cross-entropy loss, multimodal cross-entropy loss, and β-weighted cross-modal prototype alignment loss: (20) Step 7: Evaluate the model's classification performance using the accuracy metric.
[0031] This invention takes medical imaging disease detection as an example to elaborate on the above method, including: Step 1: Collect multimodal data in the medical field, including lung CT images of the patient to be detected and corresponding electronic medical record texts; perform preprocessing such as denoising on the CT images and text cleaning on the electronic medical record texts; input the multimodal image and text data and perform feature extraction.
[0032] Step 2: Input the preprocessed lung CT image into the image encoder to extract visual features related to lung lesions in the CT image; input the preprocessed electronic medical record text into the text encoder to extract semantic features related to the disease in the text.
[0033] Step 3: Construct learnable medical domain semantic tags. Based on the semantic tags, perform feature reconstruction on the extracted visual features of CT images and semantic features of electronic medical record text, strengthen the feature dimensions related to disease detection, and suppress irrelevant noise features.
[0034] Step 4: Construct a cross-modal semantic prototype in the medical field. Through the cross-modal prototype alignment mechanism, calculate the similarity between the visual features of the current CT image, the semantic features of the electronic medical record text, and the semantic prototype. Optimize the cross-modal alignment loss function to achieve feature alignment between the image and text modalities.
[0035] Step 5: After fusing the visual features of the aligned lung CT image and the semantic features of the electronic medical record text, input the data into a multimodal classifier and output the lung cancer detection classification result for the patient.
[0036] The specific steps are as described above and will not be detailed here.
[0037] The present invention also provides a computer-readable storage medium having a computer program stored thereon, characterized in that the computer program, when executed by a processor, performs the steps in the multimodal classification method based on cross-modal alignment described above.
[0038] The present invention also provides a multimodal classification system based on cross-modal alignment, comprising: Memory, used to store software applications. The processor executes the software application, whose architecture comprises a data loading and preprocessing module, a feature extraction module, a semantic reconstruction module, a cross-modal prototype alignment module, and a multimodal classification module. Each module performs a specific function within the system. The data loading and preprocessing module acquires and preprocesses multimodal image and text data. The feature extraction module includes an image encoder and a text encoder, used to extract image embedding features and text embedding features, respectively. The semantic reconstruction module reconstructs image and text features using learnable semantic tags and maps them to a shared semantic space. The cross-modal prototype alignment module calculates the similarity between image and text features and semantic prototypes and optimizes the alignment loss. The multimodal classification module inputs image features, text features, and fused features into a classifier for classification prediction.
[0039] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.
[0040] This document uses specific examples to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of the present invention. Furthermore, those skilled in the art will recognize that, based on the ideas of the present invention, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of the present invention.
Claims
1. A multi-modal classification method based on cross-modal alignment, characterized in that, include: Step 1: Input multimodal image and text data, and perform preprocessing; Step 2: Input the image and text data into the image encoder and text encoder respectively for feature extraction; Step 3: Reconstruct image and text features using learnable semantic tags; Step 4: Calculate the similarity between image and text features and semantic prototypes through a cross-modal prototype alignment mechanism, and optimize the alignment loss; Step 5: Input the image features and text features into the multimodal classifier for classification.
2. The method of claim 1, wherein, In step 1, the input multimodal image and text data undergo preprocessing, specifically as follows: Acquire multimodal image and text data, and construct a dataset based on it, where the dataset exists in the form of image-text pairs.
3. The method according to claim 2, characterized in that, In step 2, the image and text data are input into the image encoder and text encoder respectively for feature extraction, specifically as follows: Multimodal images are input into an image encoder for feature extraction to obtain image embedding features; Text data is input into a text encoder for text feature extraction, generating text embedding features.
4. The method according to claim 3, characterized in that, In step 3, image features are reconstructed using learnable semantic tags, specifically as follows: Define learnable semantic tags and use them as a feature space shared by images and text; The extracted image embedding features are mapped to a shared feature space through a projection function, and the inner product similarity between the image embedding features and each semantic tag is calculated. The Sparsemax function is used to normalize the inner product similarity between the image embedding features and each semantic tag to obtain the weight of each tag; The labels are weighted and summed to obtain the reconstructed image features.
5. The method according to claim 4, characterized in that, In step 3, text features are reconstructed using learnable semantic tags, specifically as follows: Define learnable semantic tags and use them as a feature space shared by images and text; The extracted text embedding features are mapped to a shared feature space through a projection function, and the inner product similarity between the text embedding features and each semantic tag is calculated. The Sparsemax function is used to normalize the inner product similarity between the text embedding features and each semantic tag to obtain the weight of each tag; The reconstructed text features are obtained by weighted summation of the markers.
6. The method according to claim 5, characterized in that, In step 4, the similarity between image and text features and semantic prototypes is calculated through a cross-modal prototype alignment mechanism, and the alignment loss is optimized, specifically as follows: Define K learnable cross-modal semantic prototypes and initialize these prototypes at any time; The Sinkhorn-Knopp clustering algorithm is used to soft-assign the reconstructed image and text features; Calculate the visual softmax probability and the text softmax probability; Based on visual softmax probability and text softmax probability, cross-modal prototype alignment is achieved through text-to-image cross-prediction and image-to-text cross-prediction. Cross-modal prediction and optimization of the following cross-entropy loss are performed, where the overall prototype alignment loss is the average of the two prediction losses for all image-text pairs.
7. The method according to claim 6, characterized in that, In step 5, the image features and text features are input into the multimodal classifier for classification, specifically as follows: Image features and text features are concatenated to obtain fused features; The fused features are input into a multimodal classifier to obtain multimodal prediction soft labels; The cross-entropy loss is calculated between the multimodal predicted soft labels and the true labels; The loss function is continuously optimized until the training iterations are completed.
8. A multimodal classification system based on cross-modal alignment, applied to the multimodal classification method based on cross-modal alignment as described in any one of claims 1-7, characterized in that, include: The data loading and preprocessing module is used to acquire multimodal image and text data and perform preprocessing. The feature extraction module includes an image encoder and a text encoder, which are used to extract image embedding features and text embedding features, respectively; The semantic reconstruction module is used to reconstruct image and text features using learnable semantic tags and map them to a shared semantic space. A cross-modal prototype alignment module is used to calculate the similarity between image and text features and semantic prototypes, and to optimize the alignment loss; The multimodal classification module is used to input image features, text features, and fused features into the classifier for classification and prediction.