Cross-modal data alignment method, device, equipment and storage medium

By using supervised training of cross-modal data representation models and a deep reparameterized variational inference network of Dirichlet process Gaussian mixture models, the challenges of high manpower and material costs and large-scale dataset alignment in cross-modal data alignment are solved, achieving efficient and accurate alignment and clustering of multimodal data.

CN115392366BActive Publication Date: 2026-06-16INST OF AUTOMATION CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
INST OF AUTOMATION CHINESE ACAD OF SCI
Filing Date
2022-08-19
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing technologies consume a lot of human and material resources in cross-modal data alignment, making it difficult to efficiently establish semantic relationships between multimodal data, especially on large-scale datasets where unsupervised clustering and accurate alignment are difficult to apply.

Method used

Supervised training is performed using a cross-modal data representation model. The similarity between multimodal data is used as supervision information. Unsupervised clustering is performed using a deep reparameterized variational inference network based on a Dirichlet process Gaussian mixture model to obtain pseudo-labels, establish semantic connections between misaligned samples, and optimize model training through a preset loss function.

🎯Benefits of technology

It achieves accurate alignment on large-scale multimodal datasets, improves the training effect and alignment accuracy of cross-modal data representation models, and enhances the accuracy of unsupervised clustering.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115392366B_ABST
    Figure CN115392366B_ABST
Patent Text Reader

Abstract

Embodiments of the present application provide a cross-modal data alignment method, device, equipment and storage medium, the method comprising: obtaining target multi-modal data; inputting the target multi-modal data into a trained cross-modal data representation model to obtain target multi-modal data representation of the target multi-modal data in a common semantic representation space; the cross-modal data representation model is trained based on multi-modal data and similarity between the multi-modal data; the cross-modal data representation model is used for representing the multi-modal data; the similarity between the multi-modal data is used as supervision information for training the cross-modal data representation model; and aligned multi-modal data is obtained according to the similarity between target multi-modal data representations. The method of the embodiments of the present application realizes alignment of multi-modal data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing technology, and in particular to a cross-modal data alignment method, apparatus, device, and storage medium. Background Technology

[0002] With the rapid development of the Internet and multimedia, multimodal data such as text, images, videos, and audio are experiencing explosive growth.

[0003] To establish relationships between multimodal data, cross-modal data alignment has become increasingly important. However, the internet and multimedia contain massive amounts of data, and annotating all of this data with semantic information would consume a significant amount of human and material resources. Therefore, automatically mining and aligning cross-modal sample pairs from massive amounts of online multimedia data is of great significance for learning cross-modal representations of multimodal data and establishing semantic connections between multimodal data. Summary of the Invention

[0004] To address the problems in the prior art, embodiments of the present invention provide a cross-modal data alignment method, apparatus, device, and storage medium.

[0005] Specifically, the embodiments of the present invention provide the following technical solutions:

[0006] In a first aspect, embodiments of the present invention provide a cross-modal data alignment method, comprising:

[0007] Acquire target multimodal data;

[0008] The target multimodal data is input into a trained cross-modal data representation model to obtain the target multimodal data representation in a common semantic representation space. The cross-modal data representation model is trained based on the multimodal data and the similarity between the multimodal data. The cross-modal data representation model is used to represent the multimodal data. The similarity between the multimodal data serves as supervision information for training the cross-modal data representation model.

[0009] Aligned multimodal data is obtained based on the similarity between the target multimodal data representations.

[0010] Furthermore, the cross-modal data representation model is trained in the following manner:

[0011] Obtain the clustering results of the multimodal data of the samples;

[0012] Based on the clustering results of the sample multimodal data, the similarity between the sample multimodal data is obtained;

[0013] Based on the similarity between the sample multimodal data and the sample multimodal data, the initial cross-modal data representation model is trained to obtain the trained cross-modal data representation model.

[0014] Further, the step of training the initial cross-modal data representation model based on the similarity between the sample multimodal data and the sample multimodal data to obtain the trained cross-modal data representation model includes:

[0015] The sample multimodal data is input into the initial cross-modal data representation model to obtain the sample multimodal data representation output by the initial cross-modal data representation model.

[0016] The initial cross-modal data representation model is trained based on the similarity between the multimodal data representations of the samples and the similarity between the multimodal data of the samples, to obtain the trained cross-modal data representation model.

[0017] Furthermore, the clustering results for obtaining the multimodal data of the samples include:

[0018] Acquire multimodal data of the samples;

[0019] The multimodal data of the samples are input into the initial cross-modal data representation model to obtain the multimodal data representation of the samples;

[0020] Clustering results of the sample multimodal data are obtained based on the deep reparameterized variational inference network of the Dirichlet process Gaussian mixture model and the multimodal data representation of the sample.

[0021] Furthermore, the step of obtaining the clustering results of the sample multimodal data based on the deep reparameterized variational inference network of the Dirichlet process Gaussian mixture model and the multimodal data representation of the samples includes:

[0022] The multimodal data of the samples are input into a deep reparameter variational inference network of the Dirichlet process Gaussian mixture model. The distance between the reparameter variational probability distribution and the probability distribution of the Dirichlet process Gaussian mixture model is optimized to obtain the clustering results of the multimodal data of the samples.

[0023] Furthermore, the cross-modal data representation model is trained using a preset loss function, which is:

[0024]

[0025] Among them, the The loss function is represented by N; N represents the number of multimodal samples; the... This represents the similarity between the multimodal data of the i-th sample and the multimodal data of the j-th sample; The d represents the normalization parameter;ij The similarity between the multimodal data representation of the i-th sample and the multimodal data representation of the j-th sample is represented by λ; λ represents the weight; and m represents the marginal parameter.

[0026] Secondly, embodiments of the present invention also provide a cross-modal data alignment device, comprising:

[0027] The acquisition module is used to acquire target multimodal data;

[0028] The representation module is used to input the target multimodal data into the trained cross-modal data representation model to obtain the target multimodal data representation in a common semantic representation space; the cross-modal data representation model is trained based on multimodal data and the similarity between multimodal data; the cross-modal data representation model is used to represent multimodal data; the similarity between multimodal data serves as supervision information for training the cross-modal data representation model;

[0029] An alignment module is used to obtain aligned multimodal data based on the similarity between the target multimodal data representations.

[0030] Thirdly, embodiments of the present invention also provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the cross-modal data alignment method as described in the first aspect.

[0031] Fourthly, embodiments of the present invention also provide a non-transitory computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the cross-modal data alignment method as described in the first aspect.

[0032] Fifthly, embodiments of the present invention also provide a computer program product, including a computer program that, when executed by a processor, implements the cross-modal data alignment method as described in the first aspect.

[0033] The cross-modal data alignment method, apparatus, device, and storage medium provided in this invention acquire multimodal data and input it into a trained cross-modal data representation model to obtain multimodal data representations in a common semantic representation space. Then, based on the similarity between these multimodal data representations, multimodal data alignment can be achieved. The cross-modal data representation model is trained using supervised training methods. The training samples are multimodal data, and the supervision information for training the model is the similarity between the multimodal data. That is, by using the similarity between multimodal data as the supervision information for training the cross-modal data representation model, the model can learn from the supervision information and output more accurate multimodal data representations. Therefore, based on the accurate multimodal data representations output by the cross-modal data representation model, accurate multimodal data alignment can be achieved. Attached Figure Description

[0034] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0035] Figure 1 This is one of the flowcharts illustrating the cross-modal data alignment method provided in this embodiment of the invention;

[0036] Figure 2 This is the second flowchart illustrating the cross-modal data alignment method provided in this embodiment of the invention;

[0037] Figure 3 This is the third flowchart illustrating the cross-modal data alignment method provided in this embodiment of the invention;

[0038] Figure 4 This is a schematic diagram of the cross-modal data alignment device provided in an embodiment of the present invention;

[0039] Figure 5 This is a schematic diagram of the structure of the electronic device provided in an embodiment of the present invention. Detailed Implementation

[0040] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0041] The method described in this invention can be applied to data processing scenarios to achieve cross-modal data alignment.

[0042] In related technologies, cross-modal data alignment is becoming increasingly important for establishing relationships between multimodal data. However, the internet and multimedia contain massive amounts of data, and annotating all data with semantic information would consume a significant amount of human and material resources. Therefore, automatically mining and aligning cross-modal sample pairs from massive amounts of online multimedia data is of great significance for learning cross-modal representations of multimodal data and establishing semantic connections between multimodal data.

[0043] The cross-modal data alignment method of this invention acquires multimodal data and inputs it into a trained cross-modal data representation model to obtain multimodal data representations in a common semantic representation space. Then, based on the similarity between these multimodal data representations, multimodal data alignment can be achieved. The cross-modal data representation model is trained using supervised training methods. The training samples are multimodal data, and the supervision information for training the model is the similarity between the multimodal data. That is, by using the similarity between multimodal data as the supervision information for training the cross-modal data representation model, the model can learn from the supervision information and output more accurate multimodal data representations. Therefore, based on the accurate multimodal data representations output by the cross-modal data representation model, accurate multimodal data alignment can be achieved.

[0044] The following is combined Figures 1-5 The technical solution of the present invention will be described in detail with reference to specific embodiments. The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.

[0045] Figure 1 This is a flowchart illustrating an embodiment of the cross-modal data alignment method provided by this invention. Figure 1 As shown, the method provided in this embodiment includes:

[0046] Step 101: Obtain target multimodal data.

[0047] Specifically, with the rapid development of the internet and multimedia, multimodal data such as text, images, videos, and audio are exploding. To establish relationships between multimodal data, cross-modal data alignment has become increasingly important. In order to achieve cross-modal data alignment, this embodiment of the invention first needs to acquire the target multimodal data, that is, the multimodal data that needs to be aligned. Optionally, the multimodal data can be text, images, or other modal data.

[0048] Step 102: Input the target multimodal data into the trained cross-modal data representation model to obtain the target multimodal data representation in the common semantic representation space; the cross-modal data representation model is trained based on the multimodal data and the similarity between multimodal data; the cross-modal data representation model is used to represent multimodal data; the similarity between multimodal data is used as supervision information for training the cross-modal data representation model.

[0049] Specifically, to achieve cross-modal data alignment, this embodiment of the invention, after acquiring multimodal data, inputs the multimodal data into a trained cross-modal data representation model to obtain multimodal data representations in a common semantic representation space. This means mapping the features of multimodal data such as images and text to a common semantic representation space to obtain a common representation of the multimodal data. The cross-modal data representation model is trained based on multimodal data and the similarity between multimodal data. The similarity between multimodal data serves as the supervision information for training the cross-modal data representation model. In other words, the cross-modal data representation model is trained using a supervised training method. The training samples are multimodal data, and the supervision information for training the cross-modal data representation model is the similarity between multimodal data. By using the similarity between multimodal data as the supervision information for training the cross-modal data representation model, the model can learn from the supervision information and output more accurate multimodal data representations. Therefore, based on the supervision information, a better training effect for the cross-modal data representation model can be obtained.

[0050] Step 103: Obtain aligned multimodal data based on the similarity between the target multimodal data representations.

[0051] Specifically, multimodal data is input into a trained cross-modal data representation model to obtain multimodal data representations in a common semantic representation space. Then, the multimodal data can be aligned based on the similarity between these representations. Optionally, multimodal data with high similarity between their representations are aligned to obtain aligned multimodal data.

[0052] The method described in the above embodiments acquires multimodal data and inputs it into a trained cross-modal data representation model to obtain multimodal data representations in a common semantic representation space. Then, based on the similarity between these multimodal data representations, multimodal data alignment can be achieved. The cross-modal data representation model is trained using supervised training methods. The training samples are multimodal data, and the supervision information for training the model is the similarity between the multimodal data. That is, by using the similarity between multimodal data as the supervision information for training the cross-modal data representation model, the model can learn from the supervision information and output more accurate multimodal data representations. Therefore, based on the accurate multimodal data representations output by the cross-modal data representation model, accurate multimodal data alignment can be achieved.

[0053] In one embodiment, the cross-modal data representation model is trained in the following manner:

[0054] Obtain the clustering results of the multimodal data of the samples;

[0055] Based on the clustering results of the multimodal data samples, the similarity between the multimodal data samples is obtained;

[0056] Based on the similarity between sample multimodal data and sample multimodal data, the initial cross-modal data representation model is trained to obtain the trained cross-modal data representation model.

[0057] Specifically, related technologies only establish cross-modal semantic connections through aligned training sample pairs (positive samples), ignoring the fact that unaligned training samples (negative samples) may also have semantic similarity, thus limiting their modeling capabilities. In this embodiment of the invention, by obtaining the clustering results of multimodal data samples, the multimodal data is classified. Then, based on the clustering results, the similarity between multimodal data is obtained, which can then be used as pseudo-labels for the multimodal data. These pseudo-labels are then used to better establish semantic connections between unaligned samples (negative samples). This allows the similarity of multimodal data to be used as supervisory information for training the cross-modal data representation model. Training the cross-modal data representation model based on the multimodal data and supervisory information yields better training results, enabling the model to learn from the supervisory information and output more accurate multimodal data representations, thus allowing for more accurate alignment of multimodal data.

[0058] For example, in the prior art, the process of cross-modal data alignment methods is as follows: Figure 2As shown, existing technologies only utilize aligned sample pairs (positive pairs) to establish semantic connections, but ignore the fact that there are also semantically similar sample pairs among unaligned sample pairs (negative pairs). In other words, existing technologies only establish semantic connections between aligned sample pairs (positive pairs), while ignoring the fact that there are also semantically similar sample pairs among unaligned sample pairs (false negative pairs).

[0059] The cross-modal data alignment method of this application is as follows: Figure 3 As shown, this invention not only utilizes aligned training sample pairs (positive pairs) to establish cross-modal semantic connections, but also obtains the similarity between multimodal data based on the clustering results of multimodal data. This similarity is then used as pseudo-labels and supervision information for training the cross-modal data representation model. The pseudo-labels are used to better establish semantic connections between misaligned samples (negative samples), thus fully exploring the semantic similarity between misaligned training samples (negative samples). In other words, this embodiment of the invention establishes not only semantic connections between aligned sample pairs (positive pairs) but also semantic connections between misaligned sample pairs (negative pairs), reducing the KL distance between misaligned sample pairs (negative pairs). Using these semantic connections as supervision information leads to better training results for the cross-modal data representation model. This allows the model to learn from the supervision information and output more accurate multimodal data representations, thereby enabling more accurate alignment of multimodal data.

[0060] Furthermore, in this embodiment of the invention, for unaligned (negative) cross-modal data, pseudo-labels for the samples are obtained through unsupervised clustering of cross-modal data representations. Based on the similarity between the pseudo-labels, semantic similarity between unaligned samples is established. Then, the similarity relationship between aligned samples (positive samples) and unaligned samples (negative samples) is used as supervised information for cross-modal contrastive learning, performing cross-modal representation learning and modality difference elimination to learn cross-modal representations that eliminate modality specificity. The learned cross-modal representations can then be used to improve the accuracy of unsupervised clustering. Finally, distance-based multimodal data alignment can be performed using the cross-modal representations.

[0061] The method described above obtains the clustering results of multimodal data samples and, based on these results, acquires the similarity between multimodal data. This similarity can then be used as pseudo-labels for the multimodal data, establishing semantic connections between misaligned samples (negative samples). This allows the cross-modal data representation model to learn from the supervised information of the pseudo-labels and achieve better model training results. Consequently, the trained cross-modal data representation model can output more accurate multimodal data representations, leading to more accurate alignment of multimodal data.

[0062] In one embodiment, obtaining the clustering results of sample multimodal data includes:

[0063] Acquire multimodal data of the samples;

[0064] The multimodal data of the samples are input into the initial cross-modal data representation model to obtain the multimodal data representation of the samples;

[0065] Clustering results of the sample multimodal data are obtained based on the deep reparameterized variational inference network of the Dirichlet process Gaussian mixture model and the multimodal data representation of the sample.

[0066] Specifically, current unsupervised clustering methods require the number of categories to be known or to cluster on all data, making them unsuitable for large-scale datasets with unknown categories. Existing Dirichlet process Gaussian mixture models can only classify all samples and are difficult to perform mini-batch learning or predict test samples, making it difficult to apply existing methods to clustering large-scale multimodal datasets.

[0067] This invention proposes a novel deep reparameterized variational inference network based on a Dirichlet process Gaussian mixture model to achieve unsupervised clustering of cross-modal data representations and perform classification prediction. Specifically, to accurately obtain the clustering results of multimodal data, this invention uses a deep reparameterized variational inference network based on a Dirichlet process Gaussian mixture model to acquire the clustering results. Based on these clustering results, the similarity of the multimodal data and the supervision information for training the cross-modal data representation model can be determined.

[0068] Optionally, based on the multimodal data representation of the samples and the deep reparameterized variational inference network of the Dirichlet process Gaussian mixture model, the clustering results of the multimodal data of the samples are obtained, including:

[0069] Specifically, embodiments of the present invention obtain clustering results for multimodal sample data by representing the multimodal sample data as input to a deep reparameterized variational inference network of a Dirichlet process Gaussian mixture model, optimizing the distance between the reparameterized variational probability distribution and the probability distribution of the Dirichlet process Gaussian mixture model, and then parameterizing the variables of the Dirichlet process Gaussian mixture model and designing a reparameterized method to enhance the robustness and diversity of clusters, obtaining the multivariate Gaussian distribution of each category by minimizing the KL distance between the variational probability distribution and the probability distribution of the Dirichlet process Gaussian mixture model, and then performing category prediction on the samples to achieve clustering of multimodal data.

[0070] For example, sample multimodal data is acquired and input into an initial cross-modal data representation model to obtain the sample multimodal data representation, where the multimodal data representation is h = {h1, ..., h...}. N The prior hyperparameters of the Dirichlet process Gaussian mixture model are θ = {α, μ0, c, a, b}, and the parameters of the deep reparameterized variational inference network based on the folded bar model are w = {v, η}. * ,z}。 Let q γ (w) is the probability distribution defined by the variational parameter γ. This is achieved by optimizing the variational probability distribution q. γ The relative entropy (KL distance) between (w) and the prior distribution p(w|h) overcomes the difficulty that the prior distribution cannot be directly calculated in the Gaussian mixture model of the Dirichlet process, as follows:

[0071] KL(q γ (w)||p(w|h,θ)=E q [log q γ (w)]-E q [log p(w,h|θ)]+log p(h|θ);

[0072] Among them, in q γ Let's denote it as q when using subscripts. Since log p(h|θ) is constant when h is fixed, minimizing the KL distance is equivalent to minimizing the loss:

[0073]

[0074] Using truncation techniques, parameterize the Dirichlet process and T multivariate Gaussian distributions. To enhance the diversity and robustness of Gaussian mixture components, a parameter is introduced. And use reparameterization techniques to sample μ i :

[0075]

[0076] The learnable parameters of the deep reparameterized variational inference network for the Dirichlet process Gaussian mixture model are:

[0077] in,

[0078]

[0079]

[0080]

[0081] Existing Dirichlet process Gaussian mixture models can only classify all samples, making it difficult to perform mini-batch learning or predict test samples. This limits their applicability to large-scale multimodal datasets. Furthermore, existing unsupervised clustering methods require a known number of categories or clustering on all data, making them unsuitable for large-scale datasets with unknown categories. This invention addresses this by using a deep reparameterized variational inference network based on a Dirichlet process Gaussian mixture model to achieve unsupervised clustering of cross-modal data representations with unknown categories, and to perform classification prediction on cross-modal data, thus achieving accurate clustering of multimodal data.

[0082] The method described in the above embodiments performs unsupervised clustering of cross-modal data using a deep reparameterized variational inference network of a Dirichlet process Gaussian mixture model. Specifically, it overcomes the challenge of directly calculating the prior distribution of the Dirichlet process Gaussian mixture model by optimizing the distance between the reparameterized variational probability distribution and the probability distribution of the Dirichlet process Gaussian mixture model. By minimizing the KL distance between the variational probability distribution and the probability distribution of the Dirichlet process Gaussian mixture model, it obtains the multivariate Gaussian distribution of each category of the multimodal data and performs accurate category prediction on the sample multimodal data. This results in more accurate clustering results for the cross-modal data, which in turn allows for accurate determination of the similarity of the multimodal data and the supervision information for training the cross-modal data representation model. Consequently, it enables the acquisition of better model training performance and more accurate multimodal data representation based on the supervision information.

[0083] In one embodiment, an initial cross-modal data representation model is trained based on the similarity between sample multimodal data and sample multimodal data to obtain a trained cross-modal data representation model, including:

[0084] The sample multimodal data is input into the initial cross-modal data representation model to obtain the sample multimodal data representation output by the initial cross-modal data representation model.

[0085] The initial cross-modal data representation model is trained based on the similarity between the multimodal data representations of the samples and the similarity between the multimodal data of the samples, resulting in the trained cross-modal data representation model.

[0086] Specifically, in this embodiment of the invention, after inputting sample multimodal data into an initial cross-modal data representation model to obtain the sample multimodal data representation output by the initial cross-modal data representation model, the initial cross-modal data representation model can be trained based on the similarity between sample multimodal data representations and the similarity between sample multimodal data, resulting in a trained cross-modal data representation model. Optionally, the similarity of multimodal data is used as supervision information for training the cross-modal data representation model. If the similarity of multimodal data in the supervision information is inconsistent with the similarity between the multimodal data representations output by the cross-modal data representation model, the cross-modal data representation model is iteratively trained based on the supervision information. If the similarity of multimodal data in the supervision information is consistent with the similarity between the multimodal data representations output by the cross-modal data representation model, it indicates that the cross-modal data representation model can accurately output multimodal data representations. On the other hand, in this embodiment of the invention, the similarity between multimodal data is obtained based on the clustering results of multimodal data, and the similarity of multimodal data is used as pseudo-labels and supervision information for training the cross-modal data representation model. The pseudo-labels are used to better establish semantic connections between misaligned samples (negative samples), which fully explores the semantic similarity between misaligned training samples (negative samples). This allows the training of the cross-modal data representation model based on multimodal data and supervision information to obtain better model training results. The cross-modal data representation model can learn from the supervision information and output more accurate multimodal data representations, thereby enabling more accurate alignment of multimodal data.

[0087] The method described in the above embodiments trains an initial cross-modal data representation model based on the similarity between sample multimodal data representations and the similarity between sample multimodal data, resulting in a trained cross-modal data representation model. The similarity between sample multimodal data is used as pseudo-labels and supervision information for training the cross-modal data representation model. This allows for better establishment of semantic connections between misaligned samples (negative samples) using pseudo-labels, fully exploring the semantic similarity between misaligned training samples (negative samples). This enables the cross-modal data representation model to learn from the supervision information and achieve better model training results, thereby improving the accuracy of the multimodal data representations output by the cross-modal data representation model.

[0088] In one embodiment, the cross-modal data representation model is trained using a preset loss function, which is:

[0089]

[0090] in, The loss function is represented by N; N represents the number of multimodal samples. This represents the similarity between the multimodal data of the i-th sample and the multimodal data of the j-th sample; Represents the normalization parameter; d ij Let λ represent the similarity between the multimodal data representation of the i-th sample and the multimodal data representation of the j-th sample; λ represents the weight; and m represents the marginal parameter.

[0091] Specifically, in order to make the cross-modal data representation output by the cross-modal data representation model more accurate, this embodiment of the invention evaluates the output result of the cross-modal data representation model using the aforementioned loss function. The details are as follows:

[0092] In this embodiment of the invention, the similarity of multimodal data is constructed based on the clustering results of multimodal data. Optionally, the clustering results of multimodal data are used as pseudo-labels for the multimodal data. The cluster similarity matrix is ​​as follows:

[0093] S==Z 1 (Z 1 ) T +Z 2 (Z 2 ) T

[0094] Optionally, Z 1 and Z 2 These represent the pseudo-label matrices for the image and text, respectively. Next, we regularize the similarity matrix:

[0095]

[0096] Where D is the diagonal matrix of S.

[0097] Further calculation of cluster-guided cross-modal contrastive loss:

[0098]

[0099]

[0100]

[0101] in, The loss function is represented by N; N represents the number of multimodal samples. This represents the similarity between the multimodal data of the i-th sample and the multimodal data of the j-th sample; Represents the normalization parameter; d ijLet represent the similarity between the multimodal data representations of the i-th sample and the j-th sample; λ represents the weight; m represents the marginal parameter; optionally, These are the cross-modal representations of the i-th image and the j-th text, respectively. In this embodiment of the invention, L is optimized simultaneously. LOBO and This approach can better eliminate modal differences and preserve cross-modal representations of semantic structure. Optionally, the similarity d between the multimodal data representation of the i-th sample and the multimodal data representation of the j-th sample can be considered. ij Similarity between the i-th sample multimodal data and the j-th sample multimodal data Under consistent conditions, that is, in the loss function If the similarity d between the multimodal data representation model and the multimodal data representation of the i-th sample is less than the preset threshold, it indicates that the multimodal data representation model can accurately output multimodal data representations; if the similarity d between the multimodal data representation of the i-th sample and the multimodal data representation of the j-th sample is less than the preset threshold, it indicates that the multimodal data representation model can accurately output multimodal data representations. ij Similarity between the i-th sample multimodal data and the j-th sample multimodal data Inconsistent situations, that is, in the loss function If the value exceeds a preset threshold, the cross-modal data representation model is iteratively trained based on the supervised information, so that the cross-modal data representation model can learn from the supervised information and output a more accurate cross-modal data representation.

[0102] The method described above evaluates the output of the cross-modal data representation model using a loss function, and then optimizes and adjusts the cross-modal data representation model based on the evaluation results, thereby obtaining better model training results. This allows the cross-modal data representation model obtained based on the loss function to output a more accurate representation of the cross-modal data.

[0103] An example, the flow of a cross-modal data alignment method is as follows:

[0104] Step S1: Obtain the representation of the data through a pre-trained deep learning network.

[0105] The image and text are mapped to the feature space of the image and text through feature extraction functions, resulting in the feature representation of the image and the feature representation of the text.

[0106] Step S2: Obtain the common representation of cross-modal data through the cross-modal common representation learning network.

[0107] The image and text feature representations obtained in step S1 are respectively input into the common representation learning network of image and text (initial cross-modal data representation model), which can map the image and text features into a common semantic representation space to obtain the common representation of cross-modal data (sample multimodal data representation).

[0108] Step S3: The common representation of the image and text obtained in step S2 (sample multimodal data representation) is used for unsupervised clustering using deep reparameter variational inference of the Dirichlet process Gaussian mixture model.

[0109] By incorporating the common representations of images and text (sample multimodal data representations) into a deep reparameter variational inference network of a Dirichlet process Gaussian mixture model, and optimizing the distance between the reparameter variational probability distribution and the probability distribution of the Dirichlet process Gaussian mixture model, unsupervised clustering of the common representations is performed to obtain the clustering results of the sample multimodal data. In other words, by utilizing the deep reparameter variational inference network of a Dirichlet process Gaussian mixture model, automatic unsupervised clustering using the representations of unlabeled data is achieved, resulting in the clustering results of unlabeled data.

[0110] Step S4: Using the unsupervised clustering pseudo-labels of the cross-modal common representation (sample multimodal data representation) obtained in step S3, construct the similarity relationship of cross-modal data;

[0111] In other words, it obtains the similarity between sample multimodal data based on the clustering results of sample multimodal data.

[0112] Step S5: Using the similarity relationship of the cross-modal data obtained in step S4 as pseudo-supervisory information for contrastive learning, calculate the cross-modal contrastive loss for training, and obtain the trained cross-modal data representation model.

[0113] In other words, the initial cross-modal data representation model is trained based on the similarity between the multimodal data representations of the samples and the similarity between the multimodal data samples, resulting in a trained cross-modal data representation model, which can obtain a more accurate multimodal data representation. In other words, the clustering results of unlabeled data are used as pseudo-labels to establish the similarity relationship between cross-modal data, and this is used as pseudo-supervisory information for cross-modal comparative learning to better learn and obtain a trained cross-modal data representation model.

[0114] Step S6: Perform minimum distance one-to-one matching based on the distance between cross-modal representations to obtain aligned multimodal data.

[0115] In other words, aligned multimodal data is obtained based on the similarity between multimodal data representations.

[0116] For example, the multimodal data representations output by the trained cross-modal data representation model are used, and one-to-one matching is performed based on the distance between the multimodal data representations. The specific process is as follows:

[0117] Step 1: For the next unaligned (negative sample) image 2048 unaligned (negative) texts were randomly selected as candidate samples, and the distance between the image and the multimodal data representations of all texts was calculated.

[0118] Step 2: Select and The text closest in terms of cross-modal representation distance Perform cross-modal data matching to obtain matching sample pairs Alignment of multimodal data was achieved.

[0119] For example, to evaluate this invention, we used the Scene-15, Caltech-101, Reuters, and NoisyMNIST datasets. The Scene-15 multimodal dataset contains 4485 samples each of two modalities, with 15 categories; the Caltech-101 multimodal dataset contains 9144 samples each of two modalities, with 101 categories; the Reuters multimodal dataset contains 18758 samples each of two modalities, with 6 categories; and the NoisyMNIST multimodal dataset contains 70000 samples each of two modalities, with 10 categories. Since some benchmark methods cannot be applied to large-scale datasets, we randomly selected 30000 samples for our experiments. In each dataset, we randomly fixed half of the data as aligned data and the remainder as unaligned data. We used this invention for partially aligned cross-modal representation learning and modality matching, and then evaluated the clustering results after clustering the matched sample pairs using the K-means algorithm.

[0120] Tables 1 and 2 compare the proposed method with other methods on Scene-15, Caltech-101, Reuters, and NoisyMNIST data. We use three evaluation metrics: ACC, NMI, and ARI.

[0121] Table 1

[0122]

[0123]

[0124] Table 2

[0125]

[0126]

[0127] Among them, CCA learns representations by maximizing the correlation between aligned sample pairs. KCCA extends CCA by using kernel functions to compute the relationships between sample pairs. DCCA extends CCA by employing deep neural networks. DCCAE extends CCA by using deep autoencoders. LMSC enhances the accuracy and robustness of cross-modal representations by reconstructing the original multimodal features using cross-modal representations. MvC-DMF uses deep matrix factorization and optimal partitioning representations for information fusion of multimodal data. SwMC uses restricted-order Laplacian graphs to construct semantic centers for multimodal data and performs clustering based on these centers. BMVC utilizes complementary information from multimodal data to encode multimodal data into a compact binary space. AE 2 -Nets employs cellular autoencoders for representation learning that preserves complete semantic information. PVC leverages the differentiable proxy of the Hungarian algorithm and reconstructs it as a pluggable plug-in. Vanilla CL is a general unsupervised contrastive learning algorithm that minimizes the representational differences between aligned samples and maximizes the representational differences between misaligned samples. MvCLN improves upon contrastive learning algorithms by proposing a distance-based mechanism for detecting erroneous negative samples.

[0128] As can be seen from Tables 1-2, the performance of this invention significantly outperforms other methods on all four datasets.

[0129] The cross-modal data alignment device provided by the present invention will be described below. The cross-modal data alignment device described below can be referred to in correspondence with the cross-modal data alignment method described above.

[0130] Figure 4 This is a schematic diagram of the cross-modal data alignment device provided by the present invention. The cross-modal data alignment device provided in this embodiment includes:

[0131] Acquisition module 710 is used to acquire target multimodal data;

[0132] The representation module 720 is used to input the target multimodal data into the trained cross-modal data representation model to obtain the target multimodal data representation in the common semantic representation space. The cross-modal data representation model is trained based on the multimodal data and the similarity between the multimodal data. The cross-modal data representation model is used to represent the multimodal data. The similarity between the multimodal data serves as the supervision information for training the cross-modal data representation model.

[0133] Alignment module 730 is used to obtain aligned multimodal data based on the similarity between the target multimodal data representations.

[0134] Optionally, the cross-modal data representation model is trained in the following manner:

[0135] Obtain the clustering results of the multimodal data of the samples;

[0136] Based on the clustering results of the multimodal data samples, the similarity between the multimodal data samples is obtained;

[0137] Based on the similarity between sample multimodal data and sample multimodal data, the initial cross-modal data representation model is trained to obtain the trained cross-modal data representation model.

[0138] Optionally, the representation module 720 is specifically used to: input sample multimodal data into an initial cross-modal data representation model to obtain the sample multimodal data representation output by the initial cross-modal data representation model;

[0139] The initial cross-modal data representation model is trained based on the similarity between the multimodal data representations of the samples and the similarity between the multimodal data of the samples, resulting in the trained cross-modal data representation model.

[0140] Optionally, the characterization module 720 is specifically used for: acquiring multimodal data of the sample;

[0141] The multimodal data of the samples are input into the initial cross-modal data representation model to obtain the multimodal data representation of the samples;

[0142] Clustering results of the sample multimodal data are obtained based on the deep reparameterized variational inference network of the Dirichlet process Gaussian mixture model and the multimodal data representation of the sample.

[0143] Optionally, the characterization module 720 is specifically used to: input the multimodal data of the samples into a deep reparameter variational inference network of the Dirichlet process Gaussian mixture model, optimize the distance between the reparameter variational probability distribution and the probability distribution of the Dirichlet process Gaussian mixture model, and obtain the clustering results of the multimodal data of the samples.

[0144] Optionally, the cross-modal data representation model is trained using a preset loss function, which is:

[0145]

[0146] in, The loss function is represented by N; N represents the number of multimodal samples. This represents the similarity between the multimodal data of the i-th sample and the multimodal data of the j-th sample; Represents the normalization parameter; d ij Let λ represent the similarity between the multimodal data representation of the i-th sample and the multimodal data representation of the j-th sample; λ represents the weight; and m represents the marginal parameter.

[0147] The apparatus of this invention is used to execute the method in any of the foregoing method embodiments, and its implementation principle and technical effect are similar, so they will not be described again here.

[0148] Figure 5 A schematic diagram of the physical structure of an electronic device is provided. This electronic device may include a processor 810, a communications interface 820, a memory 830, and a communication bus 840. The processor 810, communications interface 820, and memory 830 communicate with each other via the communication bus 840. The processor 810 can invoke logical instructions in the memory 830 to execute a cross-modal data alignment method. This method includes: acquiring target multimodal data; inputting the target multimodal data into a trained cross-modal data representation model to obtain a target multimodal data representation in a common semantic representation space; the cross-modal data representation model is trained based on the multimodal data and the similarity between multimodal data; the cross-modal data representation model is used to represent the multimodal data; the similarity between multimodal data serves as supervision information for training the cross-modal data representation model; and aligned multimodal data is obtained based on the similarity between the target multimodal data representations.

[0149] Furthermore, the logical instructions in the aforementioned memory 830 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, essentially, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0150] On the other hand, the present invention also provides a computer program product, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, wherein when the program instructions are executed by a computer, the computer is able to execute the cross-modal data alignment method provided by the above methods, the method comprising: acquiring target multimodal data; inputting the target multimodal data into a trained cross-modal data representation model to obtain a target multimodal data representation in a common semantic representation space; the cross-modal data representation model being trained based on multimodal data and the similarity between multimodal data; the cross-modal data representation model being used to represent multimodal data; the similarity between multimodal data being used as supervision information for training the cross-modal data representation model; and obtaining aligned multimodal data based on the similarity between the target multimodal data representations.

[0151] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the aforementioned cross-modal data alignment methods. The method includes: acquiring target multimodal data; inputting the target multimodal data into a trained cross-modal data representation model to obtain a target multimodal data representation in a common semantic representation space; the cross-modal data representation model being trained based on multimodal data and the similarity between multimodal data; the cross-modal data representation model being used to represent multimodal data; the similarity between multimodal data serving as supervision information for training the cross-modal data representation model; and obtaining aligned multimodal data based on the similarity between the target multimodal data representations.

[0152] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0153] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0154] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A cross-modal data alignment method, characterized in that, include: Acquire target multimodal data; The target multimodal data is input into the trained cross-modal data representation model to obtain the target multimodal data representation in the common semantic representation space. The cross-modal data representation model is trained based on multimodal data and the similarity between multimodal data. The cross-modal data representation model is used to represent multimodal data; the similarity between the multimodal data serves as supervision information for training the cross-modal data representation model. Aligned multimodal data is obtained based on the similarity between the target multimodal data representations; The cross-modal data representation model is trained in the following manner: Obtain the clustering results of the multimodal data of the samples; Based on the clustering results of the sample multimodal data, the similarity between the sample multimodal data is obtained; Based on the similarity between the sample multimodal data and the sample multimodal data, the initial cross-modal data representation model is trained to obtain the trained cross-modal data representation model; The step of training an initial cross-modal data representation model based on the similarity between the sample multimodal data and the sample multimodal data to obtain a trained cross-modal data representation model includes: The sample multimodal data is input into the initial cross-modal data representation model to obtain the sample multimodal data representation output by the initial cross-modal data representation model. The initial cross-modal data representation model is trained based on the similarity between the multimodal data representations of the samples and the similarity between the multimodal data of the samples, to obtain the trained cross-modal data representation model. The clustering results obtained from the multimodal data of the samples include: Acquire multimodal data of the samples; The multimodal data of the samples are input into the initial cross-modal data representation model to obtain the multimodal data representation of the samples; Based on the multimodal data representation of the samples and the deep reparameterized variational inference network of the Dirichlet process Gaussian mixture model, the clustering results of the multimodal data of the samples are obtained. The method for obtaining clustering results of sample multimodal data based on the deep reparameterized variational inference network of the Dirichlet process Gaussian mixture model and the multimodal data representation of the samples includes: The multimodal data of the samples are input into a deep reparameter variational inference network of the Dirichlet process Gaussian mixture model. The distance between the reparameter variational probability distribution and the probability distribution of the Dirichlet process Gaussian mixture model is optimized to obtain the clustering results of the multimodal data of the samples.

2. The cross-modal data alignment method according to claim 1, characterized in that, The cross-modal data representation model is trained using a preset loss function, which is: ; Among them, the The loss function is represented by N; N represents the number of multimodal samples; the... Indicates the first The first sample of multimodal data and the first The similarity between multimodal data of each sample; Represents the normalization parameter; the Indicates the first The multimodal data representation of the sample and the first The similarity between the multimodal data representations of the samples; represents the weight; m represents the marginal parameter.

3. A cross-modal data alignment device, characterized in that, include: The acquisition module is used to acquire target multimodal data; The representation module is used to input the target multimodal data into the trained cross-modal data representation model to obtain the target multimodal data representation in the common semantic representation space; The cross-modal data representation model is trained based on multimodal data and the similarity between multimodal data. The cross-modal data representation model is used to represent multimodal data; the similarity between the multimodal data serves as supervision information for training the cross-modal data representation model. The cross-modal data representation model is trained in the following manner: Obtain the clustering results of the multimodal data of the samples; Based on the clustering results of the sample multimodal data, the similarity between the sample multimodal data is obtained; Based on the similarity between the sample multimodal data and the sample multimodal data, the initial cross-modal data representation model is trained to obtain the trained cross-modal data representation model; The clustering results obtained from the multimodal data of the samples include: Acquire multimodal data of the samples; The multimodal data of the samples are input into the initial cross-modal data representation model to obtain the multimodal data representation of the samples; Based on the multimodal data representation of the samples and the deep reparameterized variational inference network of the Dirichlet process Gaussian mixture model, the clustering results of the multimodal data of the samples are obtained. The method for obtaining clustering results of sample multimodal data based on the deep reparameterized variational inference network of the Dirichlet process Gaussian mixture model and the multimodal data representation of the samples includes: The multimodal data of the samples are input into a deep reparameter variational inference network of the Dirichlet process Gaussian mixture model. The distance between the reparameter variational probability distribution and the probability distribution of the Dirichlet process Gaussian mixture model is optimized to obtain the clustering results of the multimodal data of the samples. The step of training an initial cross-modal data representation model based on the similarity between the sample multimodal data and the sample multimodal data to obtain a trained cross-modal data representation model includes: The sample multimodal data is input into the initial cross-modal data representation model to obtain the sample multimodal data representation output by the initial cross-modal data representation model. The initial cross-modal data representation model is trained based on the similarity between the multimodal data representations of the samples and the similarity between the multimodal data of the samples, to obtain the trained cross-modal data representation model. An alignment module is used to obtain aligned multimodal data based on the similarity between the target multimodal data representations.

4. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the cross-modal data alignment method as described in claim 1 or 2.

5. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the computer program implements the cross-modal data alignment method as described in claim 1 or 2.

6. A computer program product having executable instructions stored thereon, characterized in that, When executed by the processor, this instruction causes the processor to implement the cross-modal data alignment method as described in claim 1 or 2.