A multi-modal recommendation method based on multi-stage denoising and user preference alignment

By constructing an MD-UPA model and performing multi-level denoising and user preference alignment, the problems of incomplete modal data and semantic inconsistency in multimodal recommendation systems are solved, achieving efficient and accurate multimodal recommendation.

CN122199094APending Publication Date: 2026-06-12GUANGDONG UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUANGDONG UNIV OF TECH
Filing Date
2026-01-28
Publication Date
2026-06-12

Smart Images

  • Figure CN122199094A_ABST
    Figure CN122199094A_ABST
Patent Text Reader

Abstract

The application provides a multi-modal recommendation method based on multi-stage denoising and user preference alignment, relates to the technical field of multi-modal recommendation, constructs an MD-UPA model and pre-trains, fuses semantic and behavior relations through a complementary commodity-commodity graph network, and preliminarily optimizes feature expression; noise in each mode is identified and filtered through a single-mode purification network, and noise in user feedback is removed through a user feedback denoising mechanism, so that multi-stage denoising is realized; a user preference alignment mechanism is used to align multi-modal features with user preferences, and a double-path knowledge guiding learning mechanism is used for learning, so that accurate recommendation is obtained. The application can accurately capture user preferences, has high recommendation accuracy, can be integrated into an existing multi-modal recommendation system, and has strong compatibility.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the technical field of multimodal recommendation, and more specifically, to a multimodal recommendation method based on multi-level denoising and user preference alignment. Background Technology

[0002] Recommender systems (RS) proactively recommend products to users based on their historical behavior and personal preferences, playing a crucial role on various online platforms. Online platforms typically provide rich multimodal content information (such as product images and text descriptions). To address the integration of multimodal data such as images and text, multimodal recommender systems (MRS) have emerged. MRS can typically improve the accuracy and effectiveness of recommendations by performing more comprehensive modeling of product information and user preferences.

[0003] However, while multimodal recommendation systems can integrate information from multiple sources such as text and images to improve recommendation performance, they still face challenges in real-world scenarios due to incomplete modal data and semantic inconsistencies. When there is a lack of effective alignment or noise between different modalities, the models upon which multimodal recommendation systems are based struggle to accurately model the correlations between modalities. Furthermore, existing fusion mechanisms often rely on simple concatenation or weighting methods, failing to fully capture modal interaction features, leading to unbalanced recommendation results easily influenced by the dominant modality. Simultaneously, the models upon which multimodal recommendation systems are based have high computational costs and insufficient interpretability, limiting their practical deployment and user trust. Summary of the Invention

[0004] To address the problems of existing technologies failing to accurately capture user preferences and resulting in poor recommendation accuracy, this invention proposes a multimodal recommendation method based on multi-level denoising and user preference alignment to achieve accurate user preference recommendations.

[0005] To achieve the above-mentioned technical effects, the technical solution of the present invention is as follows:

[0006] Firstly, this application proposes a multimodal recommendation method based on multi-level denoising and user preference alignment, comprising the following steps: A multimodal dataset is acquired, and the multimodal dataset is preprocessed to obtain a preprocessed multimodal dataset; based on the preprocessed multimodal dataset, a multimodal representation of the product is obtained; A complementary product-product graph network, a backbone network, and a first unimodal sanitization network are constructed and pre-trained. Based on the preprocessed multimodal dataset, semantic related relationships and user behavior related relationships are obtained, and a product-product graph is constructed using a complementary product-product graph network. Based on the product-product graph, multimodal features are initially extracted. The backbone network is used to extract the first user feature and the first product feature from the multimodal features, so as to obtain the first user representation and the first product representation; The product multimodal representation is purified using a first unimodal sanitizing network to identify unimodal noise that is irrelevant to user preference prediction, and the purified product unimodal embedding is obtained. Based on the multimodal features and the product unimodal embedding, the backbone network is used to extract the second user features and the second product features to obtain the second user representation and the second product representation. Based on the user feedback denoising mechanism, user feedback noise in the first user representation and the first product representation is removed; Based on the user preference alignment mechanism, the first user representation and the first product representation, as well as the second user representation and the second product representation, are aligned with the actual user behavior, respectively. Based on the dual-path knowledge-guided learning mechanism, they learn from each other and output recommendation results.

[0007] Preferably, a multimodal dataset is acquired, and the multimodal dataset is preprocessed to obtain a preprocessed multimodal dataset; based on the preprocessed multimodal dataset, a multimodal representation of the product is obtained, the process being as follows: Obtain user interaction records and user sets. and product collection The user set includes user IDs, and the product set includes product IDs, product text information, and product visual information. According to user set Commodity Collection User-product interaction records are used to construct a user-product interaction matrix. , recorded as The user-product interaction matrix Each piece of data in the data is composed of The set consists of two parts: i represents the order of product i in product set I; u represents the order of user u in user set U. When user u purchases product i, ,on the contrary 0; Based on product set The first text embedding is extracted using a pre-trained sentence converter, and the first visual embedding is extracted using a pre-trained CNN model; based on the product set The second text embedding is extracted using a pre-trained BERT model, and the second visual embedding is extracted using a ViT model. The first text embedding, the first visual embedding, the second text embedding, and the second visual embedding are used as the multimodal representation of the product.

[0008] Preferably, the complementary product-product graph network includes a second unimodal cleansing network and a graph convolutional network, and the preliminary multimodal feature extraction process is as follows: Based on multimodal representation of goods, product-product graphs are constructed in text modalities according to semantic relevance. Products in visual modalities - Product images Based on the user-product interaction matrix Construct product-product graphs based on user behavior relationships. ; Based on product - product image And products - product images Obtain the multimodal representation of the product from the product image; The second unimodal sanitization network is used to sanitize the multimodal representation of the product-product graph, resulting in sanitized product text embedding and product visual embedding. Based on product - product image Product - Product Image Product - Product Image The graph convolutional network is used to obtain product text features. Product visual characteristics and user behavior characteristics Meanwhile, based on product-product image and products - product images We learn the purified product text embedding and product visual embedding features, and obtain the purified product text features respectively. Visual characteristics of purified goods .

[0009] Preferably, the backbone network is used to extract the first user feature and the first product feature from the multimodal features to obtain the first user representation and the first product representation. The process is as follows: From user ID, first text embedding, first visual embedding, and user-product interaction matrix Extract user features to obtain the first user representation. ; Product features are extracted from the first text embedding, the first visual embedding, and the product ID, and then based on the product text features... Product visual characteristics and user behavior characteristics The first commodity representation is obtained. ,expression:

[0010] in, This represents a feature vector obtained by randomly initializing the product ID. This indicates average pooling.

[0011] Preferably, the first unimodal sanitizing network and the second unimodal sanitizing network have the same structure, both including text feature branches and visual feature branches; The text feature branch includes a primary text modality feature extraction unit, a convolutional layer, a bidirectional long short-term memory network, a first pooling layer, a fully connected layer, a filter, and a second pooling layer; the visual feature branch includes a primary visual modality feature extraction unit, a convolutional layer, a bidirectional long short-term memory network, a first pooling layer, a fully connected layer, a filter, and a second pooling layer.

[0012] Preferably, the product text modal features embedded in the second text are extracted using the primary text modal feature extraction unit, and then projected to obtain the primary text modal features. The primary visual modality features are extracted from the product visual modality features embedded in the second visual embedding using the primary visual modality feature extraction unit. The primary text modal features and the primary visual modal features are used as primary unimodal features. The primary unimodal features are processed using convolutional layers and bidirectional long short-term memory networks respectively to obtain hidden sequence representations. Satisfies the expression:

[0013] in, Indicates the sequence length. This represents the hidden size, and m represents the modality; The hidden sequence representation is subjected to average pooling using the first pooling layer to obtain global information. ; The fully connected layer is used to process global information to obtain fully connected information. Satisfies the expression:

[0014] Processed global information The importance of each marker in the global information is evaluated using a time-noise-aware filter, and the important markers are grouped together. Merging into a combined representation ; Combinatorial representation using the second pooling layer Pooling is performed to obtain the purified product single-modal embedding. Including embedded product text and product visual embedding ; The above process is optimized based on the single-mode network loss function, wherein the single-mode network loss function... The expression is: , , and These are the hyperparameters of the time-noise-aware filter. This represents the sigmoid activation function.

[0015] Preferably, the backbone network is used to extract the second user features and the second product features to obtain the second user representation and the second product representation. The process is as follows: From user ID, purified product monomodal embedding, and user-product interaction matrix Extract user features to obtain a second user representation. ; Product features are extracted from the product ID and the cleaned product unimodal embedding, and then based on the cleaned product text features. Visual characteristics of purified goods and user behavior characteristics The second product is represented. The expression is satisfied;

[0016] in, This represents a feature vector obtained by randomly initializing the product ID. This indicates average pooling.

[0017] Preferably, based on a user feedback denoising mechanism, user feedback noise is removed from the first user representation and the first product representation. The process is as follows: Extract multiple triples from the first user representation and the first product representation. , j Represents the goods in goods set I j The order is such that user behavior towards products is represented by a ranked list with paired relationships. In the ranked list, the order is determined by... and Representing users respectively u For goods i and goods j Credibility of behavior; Loss function for denoising using user feedback For all triples Denoising, loss function The expression is:

[0018] in, Indicates the activation function; ; Indicates the probability of reliability; Indicates the probability of noise; This represents the set of triples.

[0019] Preferably, based on a user preference alignment mechanism, the first user representation and the first product representation, the second user representation and the second product representation are aligned with actual user behavior, respectively. The process is as follows: By modeling potential interactions, the distribution P of user preferences for goods is calculated, satisfying the expression:

[0020] in, Represents a set of goods. , This represents a feature vector obtained by randomly initializing the product ID. This indicates the first user's statement. Indicates transpose; Calculate the first multimodal preference distribution Satisfies the expression:

[0021] in, , Indicates the first commodity; Calculate the second multimodal preference distribution Satisfies the expression:

[0022] in, , This indicates the second product. This indicates the second user's representation; Based on the user's preference distribution P for goods and the first multimodal preference distribution Second multimodal preference distribution Minimize the Kullback-Leibler divergence between the two modes to obtain the first multimodal alignment loss. Second multimodal alignment loss Satisfies the expression:

[0023] in, This indicates Kullback-Leibler divergence alignment.

[0024] Preferably, based on a dual-path knowledge-guided learning mechanism, mutual learning is carried out, and the process is as follows: Utilizing the first total loss Second total loss To achieve alternating bidirectional guided learning, a bidirectional guided loss function is obtained. Satisfies the expression:

[0025] in, Denotes the L2 norm; the first total loss Second total loss The expressions are as follows:

[0026] in, , For hyperparameters, This represents the loss in the backbone network for extracting the second user features and the second product features. This represents the loss in the backbone network for extracting the first user feature and the first product feature; Based on the bidirectional guidance loss function and single-mode network loss function The total loss function is obtained. The expression is: ;in, For hyperparameters; based on the total loss function Conduct mutual learning and training.

[0027] Compared with the prior art, the beneficial effects of the technical solution of the present invention are: This invention proposes a multimodal recommendation method based on multi-level denoising and user preference alignment. It constructs and pre-trains an MD-UPA model, and initially optimizes feature representation by fusing semantic and behavioral relationships through a complementary item-item graph network. A unimodal sanitizing network is used to identify and filter noise in each modality, and a user feedback denoising mechanism is employed to remove noise from user feedback, achieving multi-level denoising. Finally, a user preference alignment mechanism is used to align multimodal features with user preferences, and a dual-path knowledge-guided learning mechanism is used for learning, thereby obtaining accurate recommendations. This invention can accurately capture user preferences, achieves high recommendation accuracy, and can be integrated into existing multimodal recommendation systems with strong compatibility. Attached Figure Description

[0028] Figure 1 This is a flowchart illustrating the multimodal recommendation method based on multi-level denoising and user preference alignment proposed in this embodiment of the invention. Figure 2 This diagram illustrates the structure of the MD-UPA model proposed in this embodiment of the invention. Figure 3This diagram illustrates the structure of the first and second single-mode purification networks proposed in this embodiment of the invention. Detailed Implementation

[0029] The accompanying drawings are for illustrative purposes only and should not be construed as limiting the scope of this patent. To better illustrate this embodiment, some parts of the accompanying drawings may be omitted, enlarged, or reduced, and do not represent the actual dimensions; It is understandable to those skilled in the art that some well-known details may be omitted from the accompanying drawings.

[0030] The technical solution of the present invention will be further described below with reference to the accompanying drawings and embodiments; The positional relationships depicted in the accompanying drawings are for illustrative purposes only and should not be construed as limiting the scope of this patent.

[0031] Example 1 This embodiment provides a multimodal recommendation method based on multi-level denoising and user preference alignment. The flowchart of this method can be found in [link to flowchart]. Figure 1 This includes the following steps: S1: Obtain a multimodal dataset, preprocess the multimodal dataset to obtain a preprocessed multimodal dataset; based on the preprocessed multimodal dataset, obtain a multimodal representation of the product; S2: Construct an MD-UPA model and pre-train it. The MD-UPA model includes a complementary product-product graph network, a backbone network, and a first unimodal sanitization network. S3: Based on the preprocessed multimodal dataset, obtain semantic related relationships and user behavior related relationships, and use complementary product-product graph networks to construct product-product graphs, and preliminarily extract multimodal features based on the product-product graphs; S4: Utilize the backbone network to extract the first user feature and the first product feature from the multimodal features, thereby obtaining the first user representation and the first product representation; S5: Use the first unimodal sanitizing network to sanitize the multimodal representation of the product, identify unimodal noise that is irrelevant to user preference prediction, and obtain the sanitized unimodal embedding of the product; S6: Based on the multimodal features and the product unimodal embedding, the backbone network is used to extract the second user features and the second product features to obtain the second user representation and the second product representation; S7: Based on the user feedback denoising mechanism, remove user feedback noise from the first user representation and the first product representation; S8: Based on the user preference alignment mechanism, the first user representation and the first product representation, the second user representation and the second product representation are aligned with the actual user behavior, and based on the dual-path knowledge-guided learning mechanism, they learn from each other and output recommendation results.

[0032] In step S1, a multimodal dataset is acquired, and the multimodal dataset is preprocessed to obtain a preprocessed multimodal dataset. The preprocessed multimodal dataset includes a user set, a product set, and a user-product interaction matrix. Based on the preprocessed multimodal dataset, a multimodal representation of the product is obtained.

[0033] In step S2, the MD-UPA model is constructed and pre-trained. The schematic diagram of the MD-UPA model structure is shown below. Figure 2 The system comprises a complementary product-product graph network 1, a backbone network 2, a first unimodal cleanup network 3, a user feedback denoising mechanism process box 4, a user preference alignment mechanism process box 5, and a dual-path knowledge-guided learning mechanism process box 6. The complementary product-product graph network 1 is connected to both the input of the MD-UPA model and the backbone network 2. The backbone network is the main network portion of the existing multimodal recommendation system (MRS). The backbone network 2 includes a first branch and a second branch. The backbone network 2 is connected to the input of the MD-UPA model through the first branch. The backbone network 2 is connected to the first unimodal network 3 through the second branch, and the first unimodal network 3 is connected to the input of the MD-UPA model.

[0034] In step S3, semantic relevance is obtained based on the product set, and user behavior relevance is obtained based on the user-product interaction matrix. Complementary product-product graph network 1 is used to construct product-product graphs, and multimodal features are initially extracted based on these graphs. Specifically, the semantic relevance is obtained through content feature vectors (such as text descriptions and attribute tags) in the product multimodal data, reflecting the ontological similarity between products; the behavior relevance is obtained through implicit user feedback vectors (such as embedded representations of clicks and purchase sequences), reflecting the implicit association patterns of consumer behavior.

[0035] In step S4, the backbone network 2 is used to extract the first user feature and the first product feature from the multimodal features to obtain the first user representation and the first product representation. In steps S5 and S6, the first unimodal sanitizing network 3 is used to sanitize the product multimodal representation, identifying unimodal noise that is irrelevant to user preference prediction, to obtain the sanitized product unimodal embedding; based on the multimodal features and the product unimodal embedding, the backbone network 2 is used to extract the second user feature and the second product feature to obtain the second user representation and the second product representation.

[0036] In step S7, user feedback noise in the first user representation and the first product representation is removed based on the user feedback denoising mechanism.

[0037] In step S8, based on the user preference alignment mechanism, the first user representation and the first product representation, as well as the second user representation and the second product representation, are aligned with actual user behavior, respectively. Furthermore, based on a dual-path knowledge-guided learning mechanism, they learn from each other to output the product that the user is most likely to purchase.

[0038] Example 2 In this embodiment, a multimodal dataset is acquired, and the multimodal dataset is preprocessed to obtain a preprocessed multimodal dataset; based on the preprocessed multimodal dataset, a multimodal representation of the product is obtained, the process of which is as follows: Obtain user interaction records and user sets. and product collection The user set includes user IDs, and the product set includes product IDs, product text information, and product visual information. According to user set Commodity Collection User-product interaction records are used to construct a user-product interaction matrix. , recorded as The user-product interaction matrix Each piece of data in the data is composed of The set consists of two parts: i represents the order of product i in product set I; u represents the order of user u in user set U. When user u purchases product i, ,on the contrary 0; Based on product set The first text embedding is extracted using a pre-trained sentence converter, and the first visual embedding is extracted using a pre-trained CNN model; based on the product set The second text embedding is extracted using a pre-trained BERT model, and the second visual embedding is extracted using a ViT model. The first text embedding, the first visual embedding, the second text embedding, and the second visual embedding are used as the multimodal representation of the product.

[0039] Specifically, based on commodity sets The first text embedding is obtained by extracting a 384-dimensional text using a pre-trained sentence transformer; the second text embedding is then processed and truncated using a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model, resulting in a (64,768)-dimensional text embedding. Both the first and second text embeddings contain product descriptions, product categories, and brand information. (Based on a product set...) The first visual embedding of 4096 dimensions was obtained by using a pre-trained CNN model; the second visual embedding of (17,768) dimensions was obtained by using a pre-trained ViT (Vision Transformer, ViT) model and setting the block size to 64.

[0040] Example 3 In this embodiment, the complementary product-product graph network includes a second unimodal cleansing network and a graph convolutional network, and the preliminary multimodal feature extraction process is as follows: Based on multimodal representation of goods, product-product graphs are constructed in text modalities according to semantic relevance. Products in visual modalities - Product images Based on the user-product interaction matrix Construct product-product graphs based on user behavior relationships. ; Based on product - product image And products - product images Obtain the multimodal representation of the product from the product image; The second unimodal sanitization network is used to sanitize the multimodal representation of the product-product graph, resulting in sanitized product text embedding and product visual embedding. Based on product - product image Product - Product Image Product - Product Image The graph convolutional network is used to obtain product text features. Product visual characteristics and user behavior characteristics Meanwhile, based on product-product image and products - product images We learn the purified product text embedding and product visual embedding features, and obtain the purified product text features respectively. Visual characteristics of purified goods .

[0041] Specifically, based on the multimodal representation of goods, product-product graphs are constructed in the text modality according to semantic relevance. Products in visual modalities - Product images The process is as follows: C1: Based on the multimodal representation of goods, calculate the average similarity and similarity matrix for each modality; the similarity matrix between goods i and goods j. The expression is: Where m represents a modality, which includes text modality and visual modality; and These represent the second m-modal embeddings of product i and product j, respectively. Indicates transpose; the average similarity expression for product m modes is: ,in, This represents the number of items in the set; if the similarity matrix of items i and j... Below average similarity Then let ; C2: The similarity matrix is ​​processed separately using the symmetric normalization method to obtain multiple product normalization matrices. Satisfies the expression: ,in yes The angle matrix; C3: Utilizing the k-nearest neighbor method by only retaining the nearest neighbor to the previous neighbor. Each edge corresponding to the highest similarity is processed using the commodity normalization matrix. Obtain the product-product graph in the m-mode. .

[0042] Based on the user-product interaction matrix Construct product-product graphs based on user behavior relationships. The process is as follows: D1: Based on the user-product interaction matrix Initialization term-term co-occurrence matrix In which each element in the term-to-term co-occurrence matrix Indicates the relationship between users and products i and goods j The frequency of common interactions between them; and a threshold parameter is set. If element The interaction frequency is lower than the threshold parameter Then let ; D2: Use the symmetric normalization method to process the similarity matrix item-item co-occurrence matrix to obtain the behavior normalization matrix. Satisfies the expression: ,in, yes The angle matrix; D3: Using the k-nearest neighbor method to process the behavior normalization matrix Get product-product images based on user behavior .

[0043] In this embodiment, the graph convolutional network uses LightGCN as the graph convolution kernel. To capture a more comprehensive range of product-to-product collaboration relationships, user behavior is treated as a new pattern, and a product-to-product graph is used. Expanding product catalogs - product image catalogs, i.e., product catalogs .

[0044] Preferably, the backbone network is used to extract the first user feature and the first product feature from the multimodal features to obtain the first user representation and the first product representation. The process is as follows: From user ID, first text embedding, first visual embedding, and user-product interaction matrix Extract user features to obtain the first user representation. ; Product features are extracted from the first text embedding, the first visual embedding, and the product ID, and then based on the product text features... Product visual characteristics and user behavior characteristics The first commodity representation is obtained. ,expression:

[0045] in, This represents a feature vector obtained by randomly initializing the product ID. This indicates average pooling.

[0046] Preferably, the first single-mode purification network and the second single-mode purification network have the same structure, see [link to relevant documentation]. Figure 3 Both include text feature branches and visual feature branches; The text feature branch includes a primary text modality feature extraction unit, a convolutional layer, a bidirectional long short-term memory network, a first pooling layer, a fully connected layer, a filter, and a second pooling layer; the visual feature branch includes a primary visual modality feature extraction unit, a convolutional layer, a bidirectional long short-term memory network, a first pooling layer, a fully connected layer, a filter, and a second pooling layer.

[0047] Specifically, the primary text modality feature extraction unit includes a text feature extractor BERT and a PW layer, wherein the PW layer consists of two linear layers, and the primary visual modality feature extraction unit includes a visual feature extractor ViT.

[0048] Preferably, the product text modal features embedded in the second text are extracted using the primary text modal feature extraction unit, and then projected to obtain the primary text modal features. The primary visual modality features are extracted from the product visual modality features embedded in the second visual embedding using the primary visual modality feature extraction unit. The primary text modal features and the primary visual modal features are used as primary unimodal features. The primary unimodal features are processed using convolutional layers and bidirectional long short-term memory networks respectively to obtain hidden sequence representations. Satisfies the expression:

[0049] in, Indicates the sequence length. This represents the hidden size, and m represents the modality; The hidden sequence representation is subjected to average pooling using the first pooling layer to obtain global information. ; The fully connected layer is used to process global information to obtain fully connected information. Satisfies the expression:

[0050] in, Indicates a fully connected network. T represents the sequence length; Processed global information The importance of each marker in the global information is evaluated using a time-noise-aware filter, and the important markers are grouped together. Merging into a combined representation ; Combinatorial representation using the second pooling layer Pooling is performed to obtain the purified product single-modal embedding. Including embedded product text and product visual embedding ; The above process is optimized based on the single-mode network loss function, wherein the single-mode network loss function... The expression is: , , and These are the hyperparameters of the time-noise-aware filter. This represents the sigmoid activation function.

[0051] Specifically, the processed global information The importance of each marker in the global information is evaluated using a time-noise-aware filter, and the important markers are grouped together. Merging into a combined representation The process is as follows: Evaluate the importance of each tag in the global information to satisfy the expression:

[0052] Reparameterization is performed based on the importance of each label, preserving discrete properties, and satisfying the expression:

[0053] Based on the importance and discrete nature of each label, important labels are obtained. Satisfies the expression:

[0054] The combined representation is obtained by merging the various important markers. .

[0055] Among them, u U(0,1), This represents the sigmoid activation function. To control the threshold of z, This indicates the cessation of the gradient operator. The gating mechanism of the single-modal sanitizing network is based on the Hard Concrete distribution, which enables efficient gradient optimization while preserving discrete properties through reparameterization.

[0056] Preferably, the backbone network is used to extract the second user features and the second product features to obtain the second user representation and the second product representation. The process is as follows: From user ID, purified product monomodal embedding, and user-product interaction matrix Extract user features to obtain a second user representation. ; Product features are extracted from the product ID and the cleaned product unimodal embedding, and then based on the cleaned product text features. Visual characteristics of purified goods and user behavior characteristics The second product is represented. The expression is satisfied;

[0057] in, This represents a feature vector obtained by randomly initializing the product ID. This indicates average pooling.

[0058] Preferably, based on a user feedback denoising mechanism, user feedback noise is removed from the first user representation and the first product representation. The process is as follows: Extract multiple triples from the first user representation and the first product representation. , j Represents the goods in goods set I j The order is such that user behavior towards products is represented by a ranked list with paired relationships. In the ranked list, the order is determined by... and Representing users respectively u For goods i and goods j Credibility of behavior; Loss function for denoising using user feedback For all triples Denoising, loss function The expression is:

[0059] in, Indicates the activation function; ; Indicates the probability of reliability; Indicates the probability of noise; This represents the set of triples.

[0060] Specifically, the goods i Indicates to users u The product that has been interacted with, the product j This indicates that no communication with the user has been established. u Products that have been interacted with.

[0061] In the ranking list, when At that time, the triplet follows the Bernoulli distribution. Extracted from this, the Bernoulli distribution is derived from the reliability probability. Parameterization.

[0062] in, , Indicates user u For goods i The average estimated preference score across all modalities Indicates user u For goods i Variance of preference fractions across different modalities; , Hyperparameters for modeling control reliability; Indicates model parameters.

[0063] when At that time, the triplet follows the Bernoulli distribution. Extracted from the noise probability, the Bernoulli distribution is derived from the noise probability. Parameterization; where, The expression is:

[0064] in, Indicates user u For goods Maximum estimated preference score in the m-mode; This represents the preference score for negative samples; This represents the average preference score of negative samples in mini-batch B; mini-batch B is a hyperparameter representing the number of items processed simultaneously. Indicates the activation function; Hyperparameters for controlling the "shape" of noise intensity; Indicates to users u Products that have been interacted with by users are less trustworthy than those that have not. u The credibility of products that have been interacted with.

[0065] When a user doesn't like a product, that is... When, noise probability It increases with modality-specific attractiveness, i.e., due to a high maximum estimated preference score. and lower preference for alternatives And increase.

[0066] The loss function Satisfying the expression:

[0067] Preferably, based on a user preference alignment mechanism, the first user representation and the first product representation, the second user representation and the second product representation are aligned with actual user behavior, respectively. The process is as follows: By modeling potential interactions, the distribution P of user preferences for goods is calculated, satisfying the expression:

[0068] in, Represents a set of goods. , This represents a feature vector obtained by randomly initializing the product ID. This indicates the first user's statement. Indicates transpose; Calculate the first multimodal preference distribution Satisfies the expression:

[0069] in, , Indicates the first commodity; Calculate the second multimodal preference distribution Satisfies the expression:

[0070] in, , This indicates the second product. This indicates the second user's representation; Based on the user's preference distribution P for goods and the first multimodal preference distribution Second multimodal preference distribution Minimize the Kullback-Leibler divergence between the two modes to obtain the first multimodal alignment loss. Second multimodal alignment loss Satisfies the expression:

[0071] in, This indicates Kullback-Leibler divergence alignment.

[0072] Specifically, by utilizing a user preference alignment mechanism, user preference alignment can be achieved, thereby enabling multimodal content to be aligned. Aligning with actual user behavior effectively bridges semantic gaps.

[0073] Preferably, based on a dual-path knowledge-guided learning mechanism, mutual learning is carried out, and the process is as follows: Utilizing the first total loss Second total loss To achieve alternating bidirectional guided learning, a bidirectional guided loss function is obtained. Satisfies the expression:

[0074] in, Denotes the L2 norm; the first total loss Second total loss The expressions are as follows:

[0075] in, , For hyperparameters, This represents the loss in the backbone network for extracting the second user features and the second product features. This represents the loss in the backbone network for extracting the first user feature and the first product feature; Based on the bidirectional guidance loss function and single-mode network loss function The total loss function is obtained. The expression is: ;in, For hyperparameters; based on the total loss function Conduct mutual learning and training.

[0076] The backbone network, along with the user feedback denoising mechanism and the user preference alignment mechanism, constitutes the first guiding path; the backbone network, the first unimodal network, and the user preference alignment mechanism constitute the second guiding path. Based on the dual-path knowledge-guided learning mechanism, cross-modal interaction is used to promote mutual learning between the features generated by the first and second guiding paths, jointly optimizing the user preference prediction results. In the initial training phase, the second guiding path may delete useful information or retain useless information, resulting in poor performance of the MD-UPA model. The features generated by the first guiding path are more reliable for user preference prediction. Therefore, the features generated by the first guiding path are used to optimize the second guiding path, enabling it to more accurately identify irrelevant unimodal noise and generate a more refined multimodal representation. Simultaneously, the more refined multimodal representation is used to guide the first guiding path in learning, establishing a bidirectional knowledge feedback loop and outputting the products that users are most likely to purchase.

[0077] The same or similar labels correspond to the same or similar parts; The terms used to describe positional relationships in the accompanying drawings are for illustrative purposes only and should not be construed as limiting the invention. Obviously, the above embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the implementation of the present invention. Those skilled in the art can make other variations or modifications based on the above description. It is neither necessary nor possible to exhaustively describe all embodiments here. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the claims of the present invention.

Claims

1. A multimodal recommendation method based on multi-level denoising and user preference alignment, characterized in that, Includes the following steps: Obtain the multimodal dataset, preprocess the multimodal dataset, and obtain the preprocessed multimodal dataset; Based on the preprocessed multimodal dataset, a multimodal representation of the product is obtained; A complementary product-product graph network, a backbone network, and a first unimodal sanitization network are constructed and pre-trained. Based on the preprocessed multimodal dataset, semantic related relationships and user behavior related relationships are obtained, and a product-product graph is constructed using a complementary product-product graph network. Based on the product-product graph, multimodal features are initially extracted. The backbone network is used to extract the first user feature and the first product feature from the multimodal features, so as to obtain the first user representation and the first product representation; The product multimodal representation is purified using a first unimodal sanitizing network to identify unimodal noise that is irrelevant to user preference prediction, and the purified product unimodal embedding is obtained. Based on the multimodal features and the product unimodal embedding, the backbone network is used to extract the second user features and the second product features to obtain the second user representation and the second product representation. Based on the user feedback denoising mechanism, user feedback noise in the first user representation and the first product representation is removed; Based on the user preference alignment mechanism, the first user representation and the first product representation, as well as the second user representation and the second product representation, are aligned with the actual user behavior, respectively. Based on the dual-path knowledge-guided learning mechanism, they learn from each other and output recommendation results.

2. The multimodal recommendation method based on multi-level denoising and user preference alignment according to claim 1, characterized in that, Obtain a multimodal dataset, preprocess the multimodal dataset to obtain a preprocessed multimodal dataset; based on the preprocessed multimodal dataset, obtain a multimodal representation of the product, the process is as follows: Obtain user interaction records and user sets. and product collection The user set includes user IDs, and the product set includes product IDs, product text information, and product visual information. According to user set Commodity Collection User-product interaction records are used to construct a user-product interaction matrix. , recorded as The user-product interaction matrix Each piece of data in the data is composed of The set consists of two parts: i represents the order of product i in product set I; u represents the order of user u in user set U. When user u purchases product i, ,on the contrary 0; Based on product set The first text embedding is extracted using a pre-trained sentence converter, and the first visual embedding is extracted using a pre-trained CNN model; based on the product set The second text embedding is extracted using a pre-trained BERT model, and the second visual embedding is extracted using a ViT model. The first text embedding, the first visual embedding, the second text embedding, and the second visual embedding are used as the multimodal representation of the product.

3. The multimodal recommendation method based on multi-level denoising and user preference alignment according to claim 2, characterized in that, The complementary product-product graph network includes a second unimodal cleansing network and a graph convolutional network. The preliminary multimodal feature extraction process is as follows: Based on multimodal representation of goods, product-product graphs are constructed in text modalities according to semantic relevance. Products in visual modalities - Product images Based on the user-product interaction matrix Construct product-product graphs based on user behavior relationships. ; Based on product - product image And products - product images Obtain the multimodal representation of the product from the product image; The second unimodal sanitization network is used to sanitize the multimodal representation of the product-product graph, resulting in sanitized product text embedding and product visual embedding. Based on product - product image Product - Product Image Product - Product Image The graph convolutional network is used to obtain product text features. Product visual characteristics and user behavior characteristics Meanwhile, based on product-product image and products - product images We learn the purified product text embedding and product visual embedding features, and obtain the purified product text features respectively. Visual characteristics of purified goods .

4. The multimodal recommendation method based on multi-level denoising and user preference alignment according to claim 3, characterized in that, The backbone network is used to extract the first user feature and the first product feature from the multimodal features, resulting in the first user representation and the first product representation. The process is as follows: From user ID, first text embedding, first visual embedding, and user-product interaction matrix Extract user features to obtain the first user representation. ; Product features are extracted from the first text embedding, the first visual embedding, and the product ID, and then based on the product text features... Product visual characteristics and user behavior characteristics The first commodity representation is obtained. ,expression: in, This represents a feature vector obtained by randomly initializing the product ID. This indicates average pooling.

5. The multimodal recommendation method based on multi-level denoising and user preference alignment according to claim 2, characterized in that, The first and second unimodal sanitizing networks have the same structure, both including text feature branches and visual feature branches; The text feature branch includes a primary text modality feature extraction unit, a convolutional layer, a bidirectional long short-term memory network, a first pooling layer, a fully connected layer, a filter, and a second pooling layer; the visual feature branch includes a primary visual modality feature extraction unit, a convolutional layer, a bidirectional long short-term memory network, a first pooling layer, a fully connected layer, a filter, and a second pooling layer.

6. The multimodal recommendation method based on multi-level denoising and user preference alignment according to claim 5, characterized in that, The product text modal features embedded in the second text are extracted using the primary text modal feature extraction unit, and then projected to obtain the primary text modal features. The primary visual modality features are extracted from the product visual modality features embedded in the second visual embedding using the primary visual modality feature extraction unit. The primary text modal features and the primary visual modal features are used as primary unimodal features. The primary unimodal features are processed using convolutional layers and bidirectional long short-term memory networks respectively to obtain hidden sequence representations. Satisfies the expression: in, Indicates the sequence length. This represents the hidden size, and m represents the modality; The hidden sequence representation is subjected to average pooling using the first pooling layer to obtain global information. ; The fully connected layer is used to process global information to obtain fully connected information. Satisfies the expression: Processed global information The importance of each marker in the global information is evaluated using a time-noise-aware filter, and the important markers are grouped together. Merging into a combined representation ; Combinatorial representation using the second pooling layer Pooling is performed to obtain the purified product single-modal embedding. Including embedded product text and product visual embedding ; The above process is optimized based on the single-mode network loss function, wherein the single-mode network loss function... The expression is: , , and These are the hyperparameters of the time-noise-aware filter. This represents the sigmoid activation function.

7. The multimodal recommendation method based on multi-level denoising and user preference alignment according to claim 3, characterized in that, The second user and second product features are extracted using the backbone network to obtain the second user representation and the second product representation. The process is as follows: From user ID, purified product monomodal embedding, and user-product interaction matrix Extract user features to obtain a second user representation. ; Product features are extracted from the product ID and the cleaned product unimodal embedding, and then based on the cleaned product text features. Visual characteristics of purified goods and user behavior characteristics The second product is represented. The expression is satisfied; in, This represents a feature vector obtained by randomly initializing the product ID. This indicates average pooling.

8. The multimodal recommendation method based on multi-level denoising and user preference alignment according to claim 1, characterized in that, Based on a user feedback denoising mechanism, user feedback noise is removed from the first user representation and the first product representation. The process is as follows: Extract multiple triples from the first user representation and the first product representation. , j Represents the goods in goods set I j The order is such that user behavior towards products is represented by a ranked list with paired relationships. In the ranked list, the order is determined by... and Representing users respectively u For goods i and goods j Credibility of behavior; Loss function for denoising using user feedback For all triples Denoising, loss function The expression is: in, Indicates the activation function; ; Indicates the probability of reliability; Indicates the probability of noise; This represents the set of triples.

9. The multimodal recommendation method based on multi-level denoising and user preference alignment according to claim 8, characterized in that, Based on a user preference alignment mechanism, the first user representation and the first product representation, as well as the second user representation and the second product representation, are aligned with actual user behavior, respectively. The process is as follows: By modeling potential interactions, the distribution P of user preferences for goods is calculated, satisfying the expression: in, Represents a set of goods. , This represents a feature vector obtained by randomly initializing the product ID. This indicates the first user's statement. Indicates transpose; Calculate the first multimodal preference distribution Satisfies the expression: in, , Indicates the first commodity; Calculate the second multimodal preference distribution Satisfies the expression: in, , This indicates the second product. This indicates the second user's representation; Based on the user's preference distribution P for goods and the first multimodal preference distribution Second multimodal preference distribution Minimize the Kullback-Leibler divergence between the two modes to obtain the first multimodal alignment loss. Second multimodal alignment loss Satisfies the expression: in, This indicates Kullback-Leibler divergence alignment.

10. The multimodal recommendation method based on multi-level denoising and user preference alignment according to claim 9, characterized in that, Based on a dual-path knowledge-guided learning mechanism, mutual learning takes place, and the process is as follows: Utilizing the first total loss Second total loss To achieve alternating bidirectional guided learning, a bidirectional guided loss function is obtained. Satisfies the expression: in, Denotes the L2 norm; the first total loss Second total loss The expressions are as follows: in, , For hyperparameters, This represents the loss in the backbone network for extracting the second user features and the second product features. This represents the loss in the backbone network for extracting the first user feature and the first product feature; Based on the bidirectional guidance loss function and single-mode network loss function The total loss function is obtained. The expression is: ;in, For hyperparameters; based on the total loss function Conduct mutual learning and training.