A multi-modal topic modeling method based on semantic consistency driving
By introducing Dirichlet prior constraints and semantic correction contrastive learning mechanisms into a unified topic space, the problem of insufficient cross-modal semantic consistency in multimodal topic modeling is solved, achieving accurate alignment and stable modeling of text and image topics, and improving the semantic consistency and stability of multimodal topics.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-16
AI Technical Summary
Existing multimodal topic modeling methods suffer from insufficient cross-modal semantic consistency, weak topic-level alignment capabilities, and limited topic interpretability and stability, making it difficult to effectively process multimodal image and text data.
By introducing Dirichlet prior constraints, semantic correction contrastive learning constraints, and multimodal statistical alignment mechanisms into a unified topic space, joint modeling and semantic alignment of text and image modalities are achieved, thereby enhancing the semantic consistency, discriminability, and stability of topic distribution.
It significantly improves the semantic consistency, stability, and interpretability of multimodal topic modeling, achieves accurate alignment and stable modeling of text and image topics, and enhances the discriminativeness and semantic consistency of multimodal topics.
Smart Images

Figure CN121859997B_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the interdisciplinary field of natural language processing and computer vision, specifically involving a multimodal topic modeling method driven by semantic consistency. Background Technology
[0002] As internet information rapidly evolves from a single-modal to a multimodal format, data in the real world is widely presented in the form of image-text pairs, such as images and captions in social media, pictures and text descriptions in news reports, and product images and attribute descriptions on e-commerce platforms. These multimodal data, depicting the same semantic object from different perceptual channels, contain highly relevant and complementary semantic information. Unified modeling of this data can help improve the performance of tasks such as semantic understanding, content retrieval, and intelligent recommendation.
[0003] Topic models, as an important tool for unsupervised semantic modeling, have wide applications in text clustering, semantic representation learning, and knowledge discovery. Traditional topic models, such as Latent Dirichlet Allocation (LDA), are mainly geared towards single text modalities, characterizing latent topic structures through word and document distributions, making it difficult to directly handle multimodal data containing image information. To adapt to multimodal scenarios, existing research has attempted to construct multimodal topic models, typically by extracting features from text and images separately and then fusing or jointly generating them in the latent space to uncover shared topic semantics across modalities.
[0004] However, existing multimodal topic modeling methods still have significant limitations: First, different modalities have fundamental differences in representation space, statistical distribution, and semantic structure. Simple feature concatenation or linear fusion cannot guarantee a strict semantic correspondence between text topics and image topics, which can easily lead to cross-modal topic semantic inconsistencies. Second, most methods focus on sample-level or representation-level alignment and lack explicit modeling of cross-modal statistical correlations at the joint topic distribution level. They fail to simultaneously enhance the consistency of the same topic and the separability of different topics at the topic dimension, resulting in insufficient stability and limited discriminative power of the learned topic space. Third, existing methods generally lack high-level semantic constraint mechanisms, making it difficult to use large-scale semantic models with cross-modal understanding capabilities to perform semantic correction on unsupervised topic structures, thereby affecting the interpretability and semantic consistency of topics.
[0005] Therefore, it is necessary to propose a new multimodal topic modeling method that simultaneously introduces Dirichlet prior constraints, cross-modal statistical alignment mechanisms, and semantic consistency correction constraints based on high-level semantic understanding into a unified topic space. This enables precise alignment of text modalities and image modalities at the topic distribution level, thereby enhancing the semantic consistency, discriminability, and stability of joint topics. Summary of the Invention
[0006] To address the shortcomings of existing multimodal topic models, such as insufficient cross-modal semantic consistency, weak topic-level alignment capabilities, and limited topic interpretability and stability, this application proposes a semantic consistency-driven multimodal topic modeling method. This method aims to achieve joint topic modeling, semantic alignment, and statistical consistency constraints for text and image modalities within a shared topic space. By jointly modeling text and image modalities in a unified topic space, this method introduces Dirichlet prior constraints, semantic correction contrastive learning constraints, and multimodal statistical alignment constraints. At the topic distribution level, it simultaneously strengthens the semantic consistency of paired text and image samples, the intra-class compactness of samples within the same semantic cluster, and the inter-class separability of samples from different semantic clusters, thereby achieving accurate alignment and stable modeling of multimodal topics.
[0007] To achieve the above objectives, this application employs the following technical solution:
[0008] This application presents a semantic consistency-driven multimodal topic modeling method. Using paired text-image multimodal corpora as input, it achieves joint modeling and semantic alignment of text and images in a shared topic space. The multimodal topic modeling method includes the following steps:
[0009] Step 1: Perform data preprocessing on the paired text-image multimodal corpus data in the cross-modal dataset of the image-text corpus;
[0010] Step 2: Encode the text-image multimodal corpus from Step 1 using a multimodal pre-trained model to obtain vector representations of different modalities in the same vector space;
[0011] Step 3: Create a topic inference network. Input the vector representations of the different modalities obtained in Step 2 into the topic inference network to obtain the document topic distribution. Image subject distribution and multimodal topic distribution ;
[0012] Step 4: Use contrastive learning loss on multimodal topic distributions. Multimodal topic alignment loss Dirichlet prior loss of energy loss As a joint optimization objective, the topic inference network constructed in step 3 is trained to complete cross-modal topic alignment and joint topic modeling.
[0013] A further improvement of this application is that: the cross-modal dataset of the text-image corpus in step 1 is COCO, and the image data is set as the first modality. The text data is in the second modality. Data preprocessing was performed on the cross-modal dataset of text and image corpus, specifically: removing non-alphanumeric characters from text data, restoring part-of-speech tags, and performing spell checking; and removing multimodal samples with abnormal readings from image modal data.
[0014] A further improvement of this application is that step 2 specifically includes the following steps:
[0015] Step 2.1, regarding the first A text corpus The input is encoded into a multimodal pre-trained model to obtain a dimension of Document vector representation ;
[0016] Step 2.2, regarding the first Image corpus The input is encoded into a multimodal pre-trained model to obtain a dimension of Image vector representation .
[0017] A further improvement of this application is that step 3 specifically includes the following steps:
[0018] Step 3.1: Represent the document vector obtained in Step 2.1 Image vector representation The input is fed into the topic inference network, and the resulting dimensions are: Document semantic hidden representation vector and dimension are Image semantic hidden representation vector :
[0019]
[0020]
[0021] in, To infer the weight matrix of the first linear layer of the network for the topic, For bias vectors, This indicates that the spectral norm of the weight matrix is normalized. Let be the dimension of the image semantic hidden representation vector and the document semantic hidden representation vector. Let be the dimension of the document vector representation and the image vector representation. For activation functions;
[0022] Step 3.2: Convert the document semantic hidden representation vector obtained in Step 3.1 into a single vector. Image semantic hidden representation vector Mapped to Dimensional document topic distribution With image subject distribution :
[0023]
[0024]
[0025] in, This is the weight matrix of the second linear layer. For bias terms of fully connected layers, Indicates batch normalization, For the number of topics, Used to represent the semantic hidden vector of a document and image semantic hidden representation vector Transform into document topic distribution With image subject distribution ;
[0026] Step 3.3: Distribute the document topics from Step 3.2. With image subject distribution Perform element-by-element fusion, and then... Normalization yields the first Multimodal topic distribution :
[0027]
[0028] in, This indicates element-wise multiplication.
[0029] A further improvement of this application is that, in step 4, the Dirichlet prior loss of energy loss... This is used to constrain the consistency between the topic distribution and the Dirichlet prior distribution. The Dirichlet prior loss is calculated based on energy loss, and the specific calculation process is as follows:
[0030] Given A multimodal topic distribution set and from Dirichlet distribution obtained from sampling A prior distribution The Dirichlet prior loss based on energy loss is:
[0031]
[0032] in, and This represents the document number, with a value range of [value range missing]. , and The outputs of each multimodal pre-trained model are respectively the first... The multimodal topic distribution and the first A multimodal topic distribution, and They are respectively from the Dirichlet distribution The first sample obtained from the middle The prior topic distribution and the first A prior topic distribution.
[0033] A further improvement in this application is that, in step 4, the contrastive learning loss is applied to the joint topic distribution. The construction method is as follows:
[0034] Step 4.1.1: Represent the document vector Image vector representation Perform mean fusion to construct a unified fusion representation across modalities:
[0035]
[0036] Unified fusion representation for all cross-modal modes An unsupervised clustering algorithm is used to obtain the initial clustering results, thus obtaining the initial clustering labels for the samples. and the corresponding clusters;
[0037] Step 4.1.2: Perform cluster-level semantic modeling for each cluster: For each cluster... Belonging to this cluster Cross-modal unified fusion representation Select samples that are related to this cluster The front of the center A corresponding text-image multimodal corpus pair The text-image multimodal corpus is input into the multimodal pre-trained model. Cluster-level semantic descriptions are constructed using predefined prompt templates. The predefined prompt templates are in the form of: "Given the following groups of highly semantically related images and texts, please synthesize their common semantic content and abstract a semantic description that can summarize the theme of the group of samples to represent the core semantics of the semantic cluster." This yields the cluster-level semantic descriptions corresponding to each cluster. ;
[0038] Step 4.1.3: For any text-image multimodal corpus pair, the image corpus... With text corpus Image corpus Text corpus With all cluster-level semantic descriptions The multimodal pre-trained model, along with the frozen parameters, performs semantic adjudication using the following discriminative prompt: "Given an image-text sample and semantic descriptions of multiple semantic clusters, determine which semantic cluster the sample best fits semantically and output the corresponding cluster number." Based on cross-modal semantic understanding and reasoning capabilities, the multimodal pre-trained model outputs the matching results of the text-image multimodal corpus to which the sample belongs, and reassigns the clustering labels of the samples according to the maximum semantic matching principle.
[0039]
[0040] in, This represents the semantic discriminant function of the multimodal pre-trained model. The parameters of the semantic discriminant function remain frozen and are only used to execute based on the original image corpus. With text corpus The high-level semantic decision-making, after the above-mentioned cluster-level semantic description construction and full-sample semantic discrimination redistribution, yields cluster pseudo-labels. ;
[0041] Step 4.1.4: Based on the semantically corrected clustering pseudo-labels Construct contrastive learning sample pairs in the joint topic distribution space: multimodal topic distribution pairs belonging to the same semantic cluster. As positive sample pairs, multimodal topic distribution pairs belonging to different semantic clusters As negative sample pairs, among them Represents and The first sample belonging to different semantic clusters A multimodal topic distribution, with a batch size of In this case, the contrastive learning loss on a multimodal topic distribution is defined as:
[0042]
[0043] in, Represents the cosine similarity function. This is the temperature coefficient.
[0044] A further improvement in this application is that, in step 4, the cross-modal topic alignment loss... The construction method includes the following steps:
[0045] Construct a text topic distribution matrix:
[0046]
[0047] And the image subject distribution matrix:
[0048]
[0049] Batch size is The topic correlation matrix constructed using cross-modal joint second-order statistics is expressed as follows:
[0050]
[0051] in, This represents the vector outer product operation;
[0052] The multimodal topic alignment loss is defined as:
[0053]
[0054] Among them, the first item The second term is used to constrain the joint activation of the same topic across different modalities to be maximized in a statistical sense. Used to suppress cross-correlation between different topics To remove relevance weights across topics, matrix elements Indicates the first The text topic and the first The joint activation intensity of image topics on the same batch of samples.
[0055] A further improvement of this application is that, in step 4, the total training loss function of the topic inference network is defined as:
[0056]
[0057] in, This represents the contrastive learning loss over a multimodal topic distribution. This represents the Dirichlet prior loss based on energy loss. This represents the multimodal topic alignment loss. These are the weighting coefficients for the cross-modal topic alignment loss.
[0058] The beneficial effects of this application are as follows: This application targets multimodal text-image corpora, utilizes a multimodal pre-trained model to obtain a unified semantic embedding representation, and combines a topic inference network incorporating spectral normalization and nonlinear mapping to achieve stable multimodal topic distribution modeling; it ensures the sparsity and interpretability of topic distribution through Dirichlet prior constraints; it strengthens the semantic consistency and discriminativeness of joint topics through a contrastive learning mechanism based on semantic adjudication of a large multimodal model; and it explicitly models the correspondence between text topics and image topics in the topic dimension through cross-modal statistical alignment loss. Thus, a complete multimodal topic modeling process is constructed, from data preprocessing, multimodal encoding, topic inference to joint optimization of semantic consistency constraints and statistical alignment, which can significantly improve the performance of multimodal topics in terms of semantic consistency, stability, and interpretability, and has good application value and prospects for promotion. Attached Figure Description
[0059] Figure 1 This is a model structure diagram of this application.
[0060] Figure 2 This is a flowchart of this application.
[0061] Figure 3 This is the network architecture diagram for the subject inference of this application. Detailed Implementation
[0062] The present application will be further explained below with reference to the accompanying drawings and specific implementation methods. It should be understood that the following specific examples are only for illustrating the present application and are not intended to limit the scope of the present application. After reading the present application, any modifications of the present application by those skilled in the art in various equivalent forms fall within the scope defined by the appended claims.
[0063] like Figure 1 and Figure 2 As shown, this application provides a multimodal topic modeling method based on semantic consistency, which takes paired text-image multimodal corpora as input to achieve joint modeling and semantic alignment of text and images in a shared topic space. Specifically, it includes the following steps:
[0064] Step 1: The image-text cross-modal dataset is the parallel image-text multimodal corpus dataset published in the paper, named COCO. Let the image data be the first modality. The text data is in the second modality. Data preprocessing was performed on the cross-modal dataset of text and image corpus, specifically: removing non-alphanumeric characters from text data, part-of-speech tagging, spell checking, removing excessively short documents, and removing multimodal samples with abnormal readings from image modal data.
[0065] Step 2: Encode the text-image multimodal corpus from Step 1 using a multimodal pre-trained model to obtain vector representations of different modalities in the same vector space. Specifically: For the first... A text corpus The input is fed into a multimodal pre-trained model for encoding to obtain document vector representations. ; Regarding the first Image corpus The image vector representation is obtained by encoding the input into a multimodal pre-trained model. .
[0066] Step 3: Create a topic inference network. Input the vector representations of the different modalities obtained in Step 2 into the topic inference network to obtain the document topic distribution. Image subject distribution and multimodal topic distribution .
[0067] like Figure 3 The topic inference network described herein is constructed using an encoder-based neural network. The encoder of the topic inference network includes a first linear layer, a spectral normalization layer, a Softplus activation function layer, a second linear layer, a batch normalization layer, and a Softmax output layer. The first linear layer consists of a fully connected linear layer with an input dimension of embedding_size and an output dimension of hidden_size. The weights of the first linear layer are processed by spectral normalization, and the output of the first linear layer is then mapped by the Softplus activation function to obtain a semantic hidden representation vector. The second linear layer consists of a fully connected linear layer with an input dimension of hidden_size and an output dimension of num_topics. After the operation of the second linear layer, batch normalization and the Softmax output layer are used for processing to obtain a topic distribution vector of the num_topics dimension.
[0068] Step 3 specifically includes the following steps:
[0069] Step 3.1: Represent the document vector obtained in Step 2.1 Image vector representation As input to the topic inference network, the vectors are processed by a first linear layer linear mapping and a Softplus activation function to obtain the document semantic hidden representation vector. and image semantic hidden representation vector :
[0070]
[0071]
[0072] in, This is the weight matrix of the first linear layer. The bias vector of the linear layer. This indicates that the spectral norm of the weight matrix is normalized. Let be the dimension of the document semantic hidden representation vector and the image semantic hidden representation vector. Let be the dimension of the document vector representation and the image vector representation. This is an activation function used to enhance the semantic nonlinear representation capability;
[0073] Step 3.2: Convert the document semantic hidden representation vector obtained in Step 3.1 into a single vector. Image semantic hidden representation vector Mapped to Dimensional document topic distribution With image subject distribution :
[0074]
[0075]
[0076] in, This is the weight matrix of the second linear layer. For bias terms of fully connected layers, Indicates batch normalization, For the number of topics, Let be the dimension of the document semantic hidden representation vector and the image semantic hidden representation vector. Used to represent the semantic hidden vector of a document and image semantic hidden representation vector Transform into document topic distribution With image subject distribution ;
[0077] Step 3.3: Distribute the document topics from Step 3.2. With image subject distribution Perform element-by-element fusion, and then... Normalization yields the first Multimodal topic distribution :
[0078]
[0079] in, This indicates element-wise multiplication.
[0080] Step 4: Use contrastive learning loss on multimodal topic distributions. Multimodal topic alignment loss Dirichlet prior loss of energy loss As a joint optimization objective, the topic inference network constructed in step 3 is trained. By simultaneously constraining the topic consistency of paired text and image corpora in the joint topic space, the topic consistency between samples of the same semantic cluster, and the topic distinguishability between samples of different semantic clusters, the correspondence between text topics and image topics is enhanced in a statistical sense, thereby achieving semantic consistency alignment of different modal corpora in the joint topic space, thus completing cross-modal topic alignment and joint topic modeling.
[0081] In step 4, minimizing the Dirichlet prior loss is used to ensure consistency between the joint topic distribution and the Dirichlet prior distribution. The Dirichlet prior loss energy loss is calculated, and the specific calculation process is as follows: Given... A multimodal topic distribution set and from Dirichlet distribution obtained from sampling A prior distribution The Dirichlet prior loss based on energy loss is:
[0082]
[0083] in, and This represents the document number, with a value range of [value range missing]. , and The outputs of each multimodal pre-trained model are respectively the first... The multimodal topic distribution and the first A multimodal topic distribution, and They are respectively from the Dirichlet distribution The first sample obtained from the middle The prior topic distribution and the first A prior topic distribution.
[0084] In step 4, a multimodal pre-trained model with frozen parameters is introduced to perform semantic discrimination and redistribution on the unsupervised clustering results, constructing cluster pseudo-labels with semantic consistency correction, and performing comparative learning in the joint topic distribution space based on these pseudo-labels to constrain the intra-class consistency and inter-class separability of the joint topic distribution. Specifically, this includes:
[0085] Document vector representation Image vector representation Perform mean fusion to construct a unified cross-modal representation:
[0086]
[0087] Unified fusion representation for all cross-modal modes An unsupervised clustering algorithm is used to obtain the initial clustering results, thus obtaining the initial clustering labels for the samples. And the corresponding clusters.
[0088] Subsequently, a multimodal pre-trained model with frozen parameters is used to perform cluster-level semantic modeling for each cluster. For each cluster... Select the sample closest to the center of the cluster from the samples belonging to that cluster. A text-image multimodal corpus pair The text-image multimodal corpus is input into a pre-trained multimodal model. Cluster-level semantic descriptions are constructed using predefined prompt templates. The prompt format is: "Given several groups of highly semantically related images and texts, please synthesize their common semantic content and abstract a semantic description that summarizes the theme of this group of samples to represent the core semantics of this semantic cluster." This yields the cluster-level semantic description for each cluster. .
[0089] Based on this, for any pair of text-image multimodal corpora, the image corpus... With text corpus This is combined with the entire cluster-level semantic description. The multimodal pre-trained model, which is input along with the frozen parameters, makes a semantic decision using the following discriminative prompt: "Given an image-text sample and semantic descriptions of multiple semantic clusters, determine which semantic cluster the sample best fits semantically and output the corresponding cluster number."
[0090] The multimodal pre-trained model, based on cross-modal semantic understanding and reasoning capabilities, outputs matching results of samples belonging to various semantic clusters, and reassigns cluster labels of samples according to the maximum semantic matching principle:
[0091]
[0092] in, This represents the semantic discriminant function of a multimodal pre-trained model, whose parameters are kept frozen and used only to perform high-level semantic decisions based on the original image modality and text modality.
[0093] After constructing the cluster-level semantic description and redistributing the full sample semantics, cluster pseudo-labels corrected for semantic consistency with the large model are obtained. .
[0094] Based on semantically corrected clustering pseudo-labels Construct contrastive learning sample pairs in the joint topic distribution space: multimodal topic distribution pairs belonging to the same semantic cluster. As positive sample pairs, multimodal topic distribution pairs belonging to different semantic clusters As negative sample pairs, among them Represents and The first sample belonging to different semantic clusters A multimodal topic distribution, with a batch size of In this case, the contrastive learning loss on a multimodal topic distribution is defined as:
[0095]
[0096] in, Represents the cosine similarity function. This is the temperature coefficient.
[0097] By constructing contrast constraints using highly consistent clustering pseudo-labels obtained through semantic adjudication of a multimodal pre-trained model, the distribution of multimodal topics maintains high consistency within semantic clusters and significant distinction between different semantic clusters, thereby achieving intra-class consistency and inter-class separability of the multimodal topic distribution.
[0098] To enhance the matching degree between the same topic dimensions and reduce the correlation between different topic dimensions, a loss function for cross-modal topic alignment is constructed, which includes:
[0099] Construct a text topic distribution matrix:
[0100]
[0101] And the image subject distribution matrix:
[0102]
[0103] The topic association matrix constructed based on the cross-modal joint second-order statistic of within-batch samples is represented as follows:
[0104]
[0105] in This represents the vector outer product operation.
[0106] Cross-modal topic alignment loss is defined as:
[0107]
[0108] Among them, the first item The second term is used to constrain the joint activation of the same topic across different modalities to be maximized in a statistical sense. Used to suppress cross-correlation between different topics To remove relevance weights across topics, matrix elements Indicates the first The text topic and the first The joint activation intensity of image topics on the same batch of samples.
[0109] The total training loss function of the topic inference network is defined as:
[0110]
[0111] in, This represents the contrastive learning loss over a multimodal topic distribution. This represents the Dirichlet prior loss based on energy loss. This represents the multimodal topic alignment loss. Weighting coefficients for cross-modal topic alignment loss
[0112] In this step, training the topic inference network specifically includes the following steps:
[0113] Step 4.1: Construct a topic inference network and use an optimizer. The loss function of the topic inference network is the contrastive learning loss on the multimodal topic distribution, the Dirichlet prior constraint loss based on energy loss, and the multimodal topic alignment loss.
[0114] Step 4.2: Sample multimodal corpora from the dataset and obtain a joint topic distribution through a topic inference network. This topic distribution follows a parameter set to... Dirichlet distribution Medium sampling yields the prior distribution ;
[0115] Step 4.3, using the prior distribution from step 4.2 The energy function-based loss between the inferred joint topic distribution and the inferred joint topic distribution is used as the Dirichlet prior loss. And jointly represent semantic correction contrastive learning loss and representing the cross-modal topic alignment loss Stochastic gradient descent is performed to update the parameters of the topic inference network;
[0116] Step 4.4, repeat steps 4.2 and 4.3 until the topic inference network converges.
[0117] To verify this application, experiments were conducted on multimodal image and text data. The relevant experimental data on the COCO dataset are shown in Table 1:
[0118] Table 1. Experimental data of multimodal image and text data on the COCO dataset.
[0119]
[0120] As shown in Table 1, the subject consistency and diversity indices of this application tested on the COCO dataset are as follows: NPMI = 0.0657, WE = 0.2653, TD = 0.8879, IEC = 0.7382, CID = 0.4529, and ITEC = 0.2446. All indices are higher than those in the comparative experiments, with the highest values in the comparative experiments being NPMI -0.0079, WE = 0.2375, TD = 0.8365, IEC = 0.7299, CID = 0.4251, and ITEC = 0.2389. Among them, NPMI (Normalized Point Mutual Information) measures the semantic association between words within a text topic; a higher value indicates stronger word relevance and more coherent topic coherence. WE (Topic Word Embedding Similarity) reflects the consistency between text topics; a higher value indicates higher topic consistency. TD (Topic Diversity) characterizes the richness of generated topics; a higher value indicates broader topic coverage and no repetition. IEC (Image Consistency Evaluation Index) quantifies the semantic consistency level within an image topic. CID (Image Topic Distinctiveness) measures the difference between different topics. ITEC (Intermodal Topic Consistency) measures the matching and alignment between text topics and image topics; a higher value indicates better semantic consistency between text and image topics and better intermodal alignment.
[0121] The comparative models used in this experiment are Comparative Example 1, Comparative Example 2, and Comparative Example 3. Specifically, Comparative Example 1 uses the Multimodal-ZeroShot™ method; Comparative Example 2 uses the Multimodal-Contrast method; and Comparative Example 3 uses the Multimodal-BERTopic method. Each of these three comparative multimodal methods has its own focus: Multimodal-ZeroShot™ focuses on zero-shot learning, extracting pre-trained features, cross-modal alignment, and hinting at engineering transfer knowledge, but its accuracy is slightly lower for complex tasks; Multimodal-Contrast is based on cross-modal contrastive learning, using a contrastive loss function to align modalities, but it has high sample requirements; Multimodal-BERTopic integrates BERT with topic modeling, associating multimodal features to achieve topic mining, but its performance is affected by purely unstructured modal processing.
[0122] Experimental results show that the proposed method significantly outperforms existing methods in terms of intra-topic consistency, intra-topic diversity, and cross-modal topic consistency metrics (NPMI, IEC, WE, TD, CID, ITEC), verifying the comprehensive advantages of the proposed method in terms of cross-modal topic alignment capability, topic semantic consistency, and topic diversity.
[0123] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit the present invention. Any equivalent modifications or improvements made based on the technical concept of the present invention should fall within the protection scope defined by the claims of the present invention.
Claims
1. A semantic consistency driven multi-modal topic modeling method, taking a pair of text-image multi-modal corpus as input, realizing joint modeling and semantic alignment of text and image in a shared topic space, characterized in that: The multimodal topic modeling method includes the following steps: Step 1: Perform data preprocessing on the paired text-image multimodal corpus data in the cross-modal dataset of the image-text corpus; Step 2: Encode the text-image multimodal corpus from Step 1 using a multimodal pre-trained model to obtain vector representations of different modalities in the same vector space; Step 3, creating topic inference network, inputting the vector representations of different modalities corpus obtained in step 2 into the topic inference network to obtain document topic distribution , image topic distribution , and multi-modal topic distribution ; Step 4: Use contrastive learning loss on multimodal topic distributions. Multimodal topic alignment loss Dirichlet prior loss of energy loss As a joint optimization objective, the topic inference network constructed in step 3 is trained to complete cross-modal topic alignment and joint topic modeling. The contrastive learning loss on the multimodal topic distribution is Defined as: in, Represents the cosine similarity function. For temperature coefficient, For multimodal topic distribution, Representation and multimodal topic distribution The first sample belonging to different semantic clusters A multimodal topic distribution, This refers to the distribution pairs of multimodal topics belonging to the same semantic cluster based on semantically corrected clustered pseudo-labels. As positive sample pairs, multimodal topic distribution pairs belonging to different semantic clusters As a negative sample pair; Multimodal topic alignment loss Defined as: Among them, formula The formula is used to constrain the joint activation of the same topic across different modalities to be maximized in a statistical sense. Used to suppress cross-correlation between different topics To remove relevance weights across topics, matrix elements Indicates the first The text topic and the first Joint activation intensity of image topics on the same batch of samples; The Dirichlet prior loss based on energy loss is: in, and This represents the document number, with a value range of [value range missing]. , and The outputs of each multimodal pre-trained model are respectively the first... The multimodal topic distribution and the first A multimodal topic distribution, and They are respectively from the Dirichlet distribution The first sample obtained from the middle The prior topic distribution and the first A prior topic distribution.
2. The multimodal topic modeling method based on semantic consistency as described in claim 1, characterized in that: The cross-modal dataset of the text-image corpus in step 1 is COCO, and the image data is set as the first modality. The text data is in the second modality. Data preprocessing was performed on the cross-modal dataset of text and image corpus, specifically: removing non-alphanumeric characters from text data, restoring part-of-speech tags, and performing spell checking; and removing multimodal samples with abnormal readings from image modal data.
3. The multimodal topic modeling method based on semantic consistency as described in claim 2, characterized in that: Step 2 specifically includes the following steps: Step 2.1, regarding the first A text corpus The input is encoded into a multimodal pre-trained model to obtain a dimension of Document vector representation ; Step 2.2, regarding the first Image corpus The input is encoded into a multimodal pre-trained model to obtain a dimension of Image vector representation .
4. The multimodal topic modeling method based on semantic consistency as described in claim 3, characterized in that: Step 3 specifically includes the following steps: Step 3.1: Represent the document vector obtained in Step 2.1 Image vector representation The input is fed into the topic inference network, and the resulting dimensions are: Document semantic hidden representation vector and dimension are Image semantic hidden representation vector : in, To infer the weight matrix of the first linear layer of the network for the topic, For bias vectors, This indicates that the spectral norm of the weight matrix is normalized. Let be the dimension of the image semantic hidden representation vector and the document semantic hidden representation vector. Let be the dimension of the document vector representation and the image vector representation. For activation functions; Step 3.2: Convert the document semantic hidden representation vector obtained in Step 3.1 into a single vector. Image semantic hidden representation vector Mapped to Dimensional document topic distribution With image subject distribution : in, This is the weight matrix of the second linear layer. For bias terms of fully connected layers, Indicates batch normalization, For the number of topics, Used to represent the semantic hidden vector of a document and image semantic hidden representation vector Transform into document topic distribution With image subject distribution ; Step 3.3: Distribute the document topics from Step 3.
2. With image subject distribution Perform element-by-element fusion, and then... Normalization yields the first Multimodal topic distribution : in, This indicates element-wise multiplication.
5. The multimodal topic modeling method based on semantic consistency as described in claim 4, characterized in that: In step 4, the Dirichlet prior loss of energy loss This is used to constrain the consistency between the topic distribution and the Dirichlet prior distribution. The Dirichlet prior loss is calculated based on energy loss, and the specific calculation process is as follows: Given... A multimodal topic distribution set and from Dirichlet distribution obtained from sampling A prior distribution The Dirichlet prior loss based on energy loss is .
6. The multimodal topic modeling method based on semantic consistency as described in claim 5, characterized in that: In step 4, the contrastive learning loss on the multimodal topic distribution The construction method is as follows: Step 4.1.1: Represent the document vector Image vector representation Perform mean fusion to construct a unified fusion representation across modalities: Unified fusion representation for all cross-modal modes An unsupervised clustering algorithm is used to obtain the initial clustering results, thus obtaining the initial clustering labels for the samples. and the corresponding clusters; Step 4.1.2: Perform cluster-level semantic modeling for each cluster: For each cluster... Belonging to this cluster Cross-modal unified fusion representation Select this cluster from the sample The front of the center A corresponding text-image multimodal corpus pair The text-image multimodal corpus is input into the multimodal pre-trained model. Cluster-level semantic descriptions are constructed using predefined prompt templates. The predefined prompt templates are in the form of: "Given the following groups of highly semantically related images and texts, please synthesize their common semantic content and abstract a semantic description that can summarize the theme of the samples to represent the core semantics of the semantic cluster." This yields the cluster-level semantic description corresponding to each cluster. ; Step 4.1.3: For any text-image multimodal corpus pair, the image corpus... With text corpus Image corpus Text corpus With all cluster-level semantic descriptions The multimodal pre-trained model, with its parameters frozen, performs semantic adjudication using the following discriminative prompt: "Given an image-text sample and semantic descriptions of multiple semantic clusters, determine which semantic cluster the sample best fits semantically and output the corresponding cluster number." Based on cross-modal semantic understanding and reasoning capabilities, the multimodal pre-trained model outputs the matching results of the text-image multimodal corpus to which the sample belongs, and reassigns the cluster labels of the samples according to the maximum semantic matching principle. in, This represents the semantic discriminant function of the multimodal pre-trained model. The parameters of the semantic discriminant function remain frozen and are only used to execute based on the original image corpus. With text corpus High-level semantic adjudication to obtain cluster pseudo-labels ; Step 4.1.4: Based on the semantically corrected clustering pseudo-labels Construct contrastive learning sample pairs in the joint topic distribution space: multimodal topic distribution pairs belonging to the same semantic cluster. As positive sample pairs, multimodal topic distribution pairs belonging to different semantic clusters As negative sample pairs, in batches of size In the case of contrastive learning loss on multimodal topic distribution, the loss is .
7. The multimodal topic modeling method based on semantic consistency as described in claim 6, characterized in that: In step 4, the multimodal topic alignment loss The construction method includes the following steps: Construct a text topic distribution matrix: And the image subject distribution matrix: Batch size is The topic correlation matrix constructed using cross-modal joint second-order statistics is expressed as follows: in, This represents the vector outer product operation; Obtain the multimodal topic alignment loss .
8. The multimodal topic modeling method based on semantic consistency as described in claim 7, characterized in that: In step 4, the total training loss function of the topic inference network is defined as: in, This represents the contrastive learning loss over a multimodal topic distribution. This represents the Dirichlet prior loss based on energy loss. This represents the multimodal topic alignment loss. These are the weighting coefficients for the multimodal topic alignment loss.