A multi-modal recommendation method based on MLLMs and fine-grained attribute semantic enhancement
By employing a fine-grained attribute semantic enhancement method based on MLLMs, this method generates semantic descriptions of items and users using a multimodal large language model, constructs an item-item graph and performs graph propagation, and combines perturbation view consistency learning to solve the problem of lack of key attribute information in item representation in existing technologies, thus achieving more accurate recommendation results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHONGQING NORMAL UNIVERSITY
- Filing Date
- 2026-04-01
- Publication Date
- 2026-06-12
Smart Images

Figure CN122196273A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer technology, and in particular to a multimodal recommendation method based on fine-grained attribute semantic enhancement using MLLMs. Background Technology
[0002] With the continuous growth of internet information volume, recommendation systems, as an information filtering technology, are widely used in e-commerce platforms, content distribution platforms, and multimedia information service systems. In practical applications, items typically contain multimodal data such as image and text information.
[0003] In existing technologies, multimodal recommendation methods typically utilize the visual and textual features of items directly, or extract features through pre-trained models and then fuse them to construct item representations. These methods generally integrate information from different modalities through vector concatenation, weighted fusion, or graph propagation to model the relationship between users and items.
[0004] However, in the process of implementing the technical solution of this application, the inventors of this application discovered that the above-mentioned technology has at least the following technical problems: existing multimodal recommendation methods have difficulty in explicitly extracting and structurally representing fine-grained attribute semantic information that has an impact on user decision-making from image information and text information, resulting in a lack of key attribute information in item representation, thereby affecting the integrity of subsequent representation modeling. Summary of the Invention
[0005] To overcome the above shortcomings, this invention provides a multimodal recommendation method based on MLLMs with fine-grained attribute semantic enhancement, aiming to improve the problem that the lack of key attribute information in the item representation in the prior art affects the integrity of subsequent representation modeling.
[0006] This invention provides the following technical solution: a multimodal recommendation method based on fine-grained attribute semantic enhancement using MLLMs, comprising the following steps: S1. Obtain the user set, the item set, the historical interaction data between users and items, and the image and text information corresponding to each item; S2. Input the image and text information of each item into a multimodal large language model (MLLMs) and generate corresponding semantic descriptions of the items under the constraints of a preset prompt template; input the semantic descriptions of the items corresponding to the items in the user's historical interaction into the MLLMs to generate user preference semantic descriptions; encode the semantic descriptions of the items and the semantic descriptions of the user preferences to obtain the initial semantic vectors of the items and the initial semantic vectors of the users, thus completing the multimodal semantic encoding. S3. Based on the initial semantic vector of the item and the historical interaction data, construct an item-item graph that integrates semantic similarity relationships and interaction co-occurrence relationships; and perform graph propagation on the initial semantic vector of the item based on the item-item graph to obtain the item backbone representation; S4. Input the image and text information of each item into MLLMs, and generate corresponding attribute semantic descriptions under the constraints of the prompt template for decision-oriented attributes; encode the attribute semantic descriptions to obtain attribute semantic vectors, and complete fine-grained attribute semantic enhancement. S5. Based on the item-item graph, perform graph propagation on the attribute semantic vector; after processing the propagated attribute semantic vector, align and weightedly fuse it with the item backbone representation to obtain the final item representation, thus completing graph-aware attribute fusion; S6. Construct a perturbation view based on the user's initial semantic vector and the final item representation, and perform dual-view operation based on the perturbation view. Figure 1 Consistent learning; training samples are constructed based on users and positive and negative sample items, and the multimodal recommendation model is jointly optimized and trained. S7. Based on the trained multimodal recommendation model, input the user to be recommended, calculate the matching degree between the user to be recommended and the candidate items, and output the recommendation result according to the matching degree.
[0007] Preferably, in step S2, the step of completing multimodal semantic encoding includes: Input the image and text information of each item into MLLMs, and generate the corresponding semantic description of the item under the constraints of the preset prompt template; For each user, summarize the semantic descriptions of the items corresponding to the user's historical interactions, input MLLMs, and generate corresponding semantic descriptions of user preferences under the constraints of preset prompt templates; A unified text encoder is used to encode the semantic description of the items and the semantic description of the user preferences to obtain initial semantic vectors of items and users with consistent dimensions.
[0008] Preferably, in step S3, the step of constructing an item-item graph that integrates semantic similarity relationships and interaction co-occurrence relationships includes: Calculate the semantic similarity relationship between any two items based on their initial semantic vectors; Based on historical interaction data, statistical analysis is performed to determine the co-occurrence relationship between any two items; The semantic similarity relationship and the interaction co-occurrence relationship are respectively subjected to nearest neighbor filtering and sparsification processing to retain the highly relevant neighbors of each item; The processed semantic similarity relationships and interaction co-occurrence relationships are fused to obtain the item-item graph.
[0009] Preferably, in step S3, the step of graph propagation of the initial semantic vector of the item based on the item-item graph includes: The initial semantic vector of the item is used as the initial representation of each node in the item-item graph; Multi-layer neighborhood information aggregation is performed based on the item-item graph to obtain the propagation results at each layer; The propagation results from each layer are aggregated to obtain the core representation of the item. The item backbone representation is mapped to obtain item backbone features, and the user initial semantic vector is mapped to obtain user representation.
[0010] Preferably, in step S4, the step of completing the fine-grained attribute semantic enhancement includes: Input the image and text information of each item into MLLMs, and generate corresponding attribute semantic descriptions under the constraints of decision-oriented attribute prompt templates; The prompt template is used to constrain MLLMs to generate around the decision-related attributes of an item, and the decision-related attributes are attribute information used to influence user choices. The semantic descriptions of the attributes are structured and organized so that they are represented in a unified semantic expression form; A text encoder is used to encode the structured attribute semantic description to obtain the attribute semantic vector; The attribute semantic vector is used to fuse with the item backbone representation in the subsequent graph-aware attribute fusion step.
[0011] Preferably, in step S5, the step of graph propagation of the attribute semantic vector based on the item-item graph includes: The attribute semantic vector is used as the initial attribute representation of each node in the item-item graph; Based on the adjacency relationship of the item-item graph, the initial attribute representation of each node is aggregated with neighborhood information, so that each node integrates the attribute semantic information of its neighboring nodes. The attribute representations of each node are updated layer by layer through a multi-layer propagation method to capture the higher-order relationships between items; The attribute representations obtained from multi-layer propagation are aggregated to obtain the propagated attribute semantic vectors. The propagated attribute semantic vector is used to characterize the attribute semantic representation after fusing neighborhood information, and is used in subsequent graph-aware attribute fusion steps.
[0012] Preferably, in step S5, the step of completing graph-aware attribute fusion includes: Randomly deactivate the propagated attribute semantic vector to reduce the impact of noise in the attribute semantics; The processed attribute semantic vectors are mapped to the representation space corresponding to the item backbone representation to obtain attribute semantic features; The semantic features of the attributes and the core representation of the items are normalized respectively; The normalized attribute semantic features and the normalized item backbone representation are weighted and fused based on preset fusion weights to obtain the final item representation.
[0013] Preferably, in step S6, the dual-view operation based on the perturbation view is performed. Figure 1 The steps of holistic learning include: Inject random perturbations into the user representation and the final item representation respectively to construct a first perturbation view and a second perturbation view; The representations of the same user in the first and second perturbation views are taken as positive user sample pairs, and the representations between different users are taken as negative user sample pairs. The representations of the same item in the first and second perturbation views are taken as positive sample pairs of items, and the representations of different items are taken as negative sample pairs of items. Consistency learning is performed based on the user positive sample pairs, user negative sample pairs, item positive sample pairs, and item negative sample pairs to keep the representation of the same object close under different perturbation views and to keep the representations of different objects distinguishable. The user indicates that the corresponding disturbance intensity is less than the disturbance intensity corresponding to the final item.
[0014] Preferably, in step S6, the step of jointly optimizing and training the multimodal recommendation model includes: A training sample consisting of users, positive sample items, and negative sample items is constructed based on user representations and final item representations. Based on the training samples, the user's matching degree for positive sample items is higher than that for negative sample items; Based on dual views Figure 1 Consistent learning results ensure that the representation of the same user or the same item remains consistent across different perturbation views; The above training objectives are weighted and combined to jointly optimize and train the multimodal recommendation model.
[0015] Preferably, in step S7, the step of outputting recommendation results based on the matching degree includes: Obtain historical interaction data for the users to be recommended; Based on step S2, an initial semantic vector of the user to be recommended is generated, and based on the multimodal recommendation model obtained in step S6, a user representation of the user to be recommended is generated. Based on the final item representation obtained in step S5, the matching degree between the user to be recommended and the candidate items is calculated; The candidate items are sorted according to the degree of matching. The items within the preset position range in the output sorting results are used as recommendations.
[0016] The present invention has the following beneficial effects: 1. This invention utilizes a multimodal large language model to generate attribute semantic descriptions from image and text information of objects, and performs structured organization and encoding of the attribute semantic descriptions, so that the object representation explicitly includes decision-related attribute information, thereby introducing fine-grained attribute semantic information at the representation level and solving the problem that existing methods are difficult to characterize key attributes.
[0017] 2. This invention propagates attribute semantic vectors based on item-item graphs and integrates them with the item backbone representation through normalization and weighting. This allows attribute semantic information to propagate and participate in the fusion within neighborhood relationships, thereby introducing structural constraint information during graph propagation and reducing the impact of noisy edges on representation updates.
[0018] 3. This invention constructs a perturbation view by combining user representation and item representation and performs dual-view processing. Figure 1 Consistent learning, combined with ranking training based on positive and negative samples for joint optimization, ensures that the representation of the same object remains consistent under different perturbation conditions and guarantees the matching relationship between users and positive sample items, thereby improving the representation stability of the model in the presence of structural noise. Attached Figure Description
[0019] Figure 1 This is a flowchart of a multimodal recommendation method based on fine-grained attribute semantic enhancement proposed in this invention; Figure 2 This is a flowchart of the multimodal semantic encoding process for a fine-grained attribute semantic enhancement multimodal recommendation method based on MLLMs proposed in this invention. Figure 3 This is a flowchart illustrating the attribute semantic enhancement and fusion process of a multimodal recommendation method based on fine-grained attribute semantic enhancement using MLLMs proposed in this invention. Detailed Implementation
[0020] The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0021] Reference Figures 1-3 This invention provides a multimodal recommendation method based on fine-grained attribute semantic enhancement using MLLMs, comprising the following steps: S1. Obtain the user set, the item set, the historical interaction data between users and items, and the image and text information corresponding to each item; S2. Input the image and text information of each item into the multimodal large language model (MLLMs) and generate corresponding semantic descriptions of the items under the constraints of the preset prompt template; based on the semantic descriptions of the items corresponding to the items in the user's historical interactions, input them into the MLLMs to generate user preference semantic descriptions; encode the item semantic descriptions and user preference semantic descriptions to obtain the initial semantic vectors of the items and the user, thus completing the multimodal semantic encoding. S3. Based on the initial semantic vector of the item and historical interaction data, construct an item-item graph that integrates semantic similarity relationships and interaction co-occurrence relationships; and perform graph propagation on the initial semantic vector of the item based on the item-item graph to obtain the backbone representation of the item; S4. Input the image and text information of each item into MLLMs, and generate corresponding attribute semantic descriptions under the constraints of the prompt template for decision-oriented attributes; encode the attribute semantic descriptions to obtain attribute semantic vectors, and complete fine-grained attribute semantic enhancement. S5. Based on the item-item graph, perform graph propagation on the attribute semantic vector; after processing the propagated attribute semantic vector, align and weightedly fuse it with the item backbone representation to obtain the final item representation, thus completing graph-aware attribute fusion. S6. Construct a perturbation view based on the user's initial semantic vector and the final item representation, and perform dual-view execution based on the perturbation view. Figure 1 Consistent learning; training samples are constructed based on users and positive and negative sample items, and the multimodal recommendation model is jointly optimized and trained. S7. Based on the trained multimodal recommendation model, input the user to be recommended, calculate the matching degree between the user to be recommended and the candidate items, and output the recommendation result according to the matching degree.
[0022] Specifically, this method is executed in an electronic device, which calls pre-stored program instructions and completes data processing, model building, and recommendation result generation in sequence according to steps S1 to S7.
[0023] In step S1, the user set and the item set are first obtained, and historical interaction data between users and items is established to represent the historical behavior relationship between users and items. At the same time, the image information and text information corresponding to each item are obtained, and the image information and text information are processed in a unified format to meet the input requirements of the subsequent model.
[0024] In step S2, the image and text information of each item are used as input data and fed into a multimodal large language model (MLLMs). Under the constraints of a preset prompt template, the semantic description of the item is generated. Then, for each user, the semantic descriptions of the items corresponding to the user's historical interactions are collected and used as input data again into the MLLMs. Under the constraints of the preset prompt template, the semantic description of the user's preferences is generated. Subsequently, the semantic descriptions of the items and preferences are input into a text encoder for vectorization processing to obtain the initial semantic vectors of the items and the initial semantic vectors of the users, and the two are placed in a unified representation space.
[0025] In step S3, the semantic similarity relationship between any two items is calculated based on the initial semantic vector of the items; at the same time, the interaction co-occurrence relationship between any two items is statistically analyzed based on historical interaction data; then, the semantic similarity relationship and the interaction co-occurrence relationship are filtered respectively, and the highly relevant neighbors of each item are retained; the filtered semantic similarity relationship and the interaction co-occurrence relationship are fused to construct an item-item graph; on the item-item graph, the initial semantic vector of the items is used as the initial representation of the nodes, and multi-level neighborhood information aggregation is performed on each node, and the multi-level aggregation results are summarized to obtain the main representation of the items.
[0026] In step S4, the image and text information of each item are input into the multimodal large language model (MLLMs) again, and attribute semantic descriptions are generated under the constraint of the cue template for decision attributes. The cue template is used to constrain the generated content to revolve around the decision-related attribute information of the item. Then, the attribute semantic descriptions are uniformly organized to have a consistent expression form. The organized attribute semantic descriptions are then input into the text encoder for vectorization processing to obtain attribute semantic vectors.
[0027] In step S5, the attribute semantic vector is used as the initial attribute representation of each node in the item-item graph. Multi-layer neighborhood propagation is performed on the item-item graph to fuse the attribute semantic information of the neighboring items into the attribute representation of each item. The multi-layer propagation results are summarized to obtain the propagated attribute semantic vector. Then, random deactivation processing is performed on the propagated attribute semantic vector to reduce the impact of invalid attribute information. The processed attribute semantic vector is then mapped to a representation space consistent with the item backbone representation, and the attribute semantic vector and the item backbone representation are normalized respectively. Finally, the two are weighted and combined according to the preset fusion weight to obtain the final item representation.
[0028] In step S6, based on the user's initial semantic vector and the final item representation, random perturbations are introduced into the user representation and item representation respectively to construct a first perturbation view and a second perturbation view. The representation of the same user in different perturbation views is taken as a positive user sample pair, and the representation between different users is taken as a negative user sample pair. Similarly, the representation of the same item in different perturbation views is taken as a positive item sample pair, and the representation between different items is taken as a negative item sample pair. Consistency learning is performed based on the above positive and negative sample pairs to ensure that the representation of the same object remains consistent under different perturbation conditions, while maintaining the distinction between the representations of different objects. At the same time, training samples are constructed based on users, positive sample items, and negative sample items, so that the user's matching degree for positive sample items is higher than that for negative sample items. The consistency learning objective and the preference ranking objective are jointly trained to obtain a multimodal recommendation model.
[0029] In step S7, the historical interaction data of the user to be recommended is obtained, and the user representation of the user is generated; the trained multimodal recommendation model is called to calculate the matching degree between the user to be recommended and the candidate items; the candidate items are sorted according to the matching degree, and the items in the sorted results that are within the preset position range are output as the recommendation results.
[0030] The above process enables a complete workflow from multimodal data input, semantic modeling, graph structure modeling, attribute semantic enhancement, model training to recommendation result output.
[0031] Furthermore, in step S2, the steps for completing multimodal semantic encoding include: Input the image and text information of each item into MLLMs, and generate the corresponding semantic description of the item under the constraints of the preset prompt template; For each user, summarize the semantic descriptions of the items corresponding to the user's historical interactions, input them into MLLMs, and generate corresponding semantic descriptions of user preferences under the constraints of preset prompt templates; A unified text encoder is used to encode the semantic descriptions of items and user preferences, resulting in initial semantic vectors of items and users with consistent dimensions.
[0032] Specifically, when generating semantic descriptions of items, for any item i, its image information is denoted as... The text information is denoted as The two are combined and input into a multimodal large language model (MLLMs), and a semantic description of the item is generated by combining the two with a preset prompt template. , is represented as: ; in, This represents the generator function of a multimodal large language model. This represents a prompt template used to generate semantic descriptions of items. This represents the semantic description text of the output item.
[0033] The prompt template is used to limit the structure of the output content, ensuring that the generated semantic description includes the core feature information of the item. The prompt template includes an input description and an output constraint. The input description instructs the model to perform joint analysis of image and text information, while the output constraint limits the expression form of the output text to a unified natural language description, thereby ensuring a consistent expression structure among the semantic descriptions of different items.
[0034] When generating semantic descriptions of user preferences, for any user u, the set of historical interaction items is represented as follows: Among them, This represents the set of items that user u has interacted with historically. This represents the interaction relationship between user u and item i. The set... Semantic descriptions of each item in the text The data is concatenated according to a preset order and passed as input to MLLMs. Combined with user prompt templates, a semantic description of user preferences is generated. , is represented as: ; in, Prompt templates that represent semantic descriptions of user preferences This represents the generated semantic description text of user preferences.
[0035] User suggestion templates are used to limit the scope of user preference descriptions output by the model, ensuring that the generated results are based on summarizing users' historical interaction behavior and outputting semantic descriptions in a uniform format, thereby guaranteeing consistency in semantic representations across different users. In the text encoding stage, a unified text encoder is used to encode both item semantic descriptions and user preference semantic descriptions. For any item semantic description... Encoding yields the initial semantic vector of the item. , is represented as: Among them, This represents a text encoding function. d represents the vector dimension. For any user preference semantic description... Encoding yields the user's initial semantic vector. , is represented as: ;in, .
[0036] A text encoder is a fixed-parameter encoding model or a trainable encoding model. Its input is semantic descriptive text, and its output is a fixed-length vector representation. During the encoding process, the input text is segmented, embedded, mapped, and feature-aggregated to convert the text information into a continuous vector representation. To ensure that the initial semantic vector of the item and the initial semantic vector of the user are in the same representation space, the text encoder uses the same encoding structure and parameter settings for both types of input, so that the two types of vectors have the same dimension and a unified semantic measurement method.
[0037] Through the above processing, the resulting initial semantic vectors of items are used in the subsequent item-item graph construction and graph propagation process, while the initial semantic vectors of users are used in the subsequent matching calculation and consistency learning process. This step realizes the conversion of image information and text information into a unified semantic representation and establishes a semantic correspondence between users and items, providing an input foundation for subsequent modeling.
[0038] Furthermore, in step S3, the steps of constructing an item-item graph that integrates semantic similarity relationships and interaction co-occurrence relationships include: Calculate the semantic similarity relationship between any two items based on their initial semantic vectors; Based on historical interaction data, statistical analysis is performed to determine the co-occurrence relationship between any two items. Nearest neighbor filtering and sparsification are performed on semantic similarity relationships and interaction co-occurrence relationships respectively, retaining the highly relevant neighbors of each item; The processed semantic similarity relationships and interaction co-occurrence relationships are fused to obtain the item-item graph.
[0039] In step S3, the steps of graph propagation of the initial semantic vector of the item based on the item-item graph include: The initial semantic vector of the item is used as the initial representation of each node in the item-item graph; Multi-level neighborhood information aggregation is performed based on the item-item graph to obtain the propagation results at each layer; The propagation results from each layer are aggregated to obtain the core representation of the item; The core representation of the item is mapped to obtain the core features of the item, and the initial semantic vector of the user is mapped to obtain the user representation.
[0040] Specifically, in the process of semantic similarity calculation, for any two items i and j, based on their initial semantic vectors... and Semantic similarity is calculated and represented as: ; in, This represents the semantic similarity between item i and item j. and These represent the initial semantic vectors of the corresponding items. This indicates the transpose operation. This calculation is used to measure the similarity between two items in the same semantic space.
[0041] In the process of calculating the co-occurrence relationship, for any two items i and j, the set of users who have interacted with item i is obtained based on historical interaction data. and the set of users who interact with item j. The co-occurrence relationship between the two can be expressed as: ; in, This represents the degree of co-occurrence between item i and item j, and || represents the number of elements in the set. This represents the intersection operation of sets. This calculation is used to measure the degree of association between two items at the user behavior level.
[0042] During the nearest neighbor selection and sparsification process, for each item i, its semantic similarity set is calculated. Co-occurrence set with interaction The nodes are sorted, and the top K neighboring items are selected as candidate neighbors, where K is a preset neighbor count parameter. Simultaneously, similarity relationships below a preset threshold are removed to reduce the impact of weak correlations on the graph structure. After filtering, semantic similarity relation moments are obtained. and interaction co-occurrence matrix By fusing the semantic similarity matrix and the interaction co-occurrence matrix, the adjacency matrix of the item-item graph is obtained, represented as: ; in, This represents the final adjacency matrix, and the elements in the matrix are... This represents the connection weight between item i and item j. This adjacency matrix is used to describe the overall association relationships between items.
[0043] During graph propagation, the initial semantic vector of the item is used as the initial representation of the node, that is, the representation at level 0 is: ;in, Let i represent the initial node representation of item i. Multi-level neighborhood information aggregation is performed based on the adjacency matrix. For the i-th node... layer to the first The propagation process of layers is represented as follows: ; in, This represents the node representation of the lll-th layer. Let i represent the set of neighbors of item i. Represents the elements in the adjacency matrix. This represents the degree of node i. This propagation process is used to aggregate information from neighboring nodes layer by layer into the representation of the current node. After completing multiple layers of propagation, the results of each layer are aggregated to obtain the item backbone representation, as follows: ; Where L represents the number of propagation layers, This represents the final item's core structure.
[0044] During the mapping process, linear transformation and nonlinear activation are performed on the item backbone representation to obtain the item backbone features, represented as follows: ;in, , Represents weight parameters , Indicates the bias parameter. This represents the activation function. This represents the core features of the mapped item. (Based on the user's initial semantic vector.) User representations are obtained using the same mapping method. This ensures that user representations and item representations reside in a unified representation space.
[0045] Through the above processing, an item-item graph that integrates semantic similarity and interactive co-occurrence relationships is constructed. Based on this graph, the item representation is propagated and aggregated in multiple layers, so that the main item representation contains both semantic information and structural association information, thus providing a unified representation basis for subsequent attribute semantic fusion and matching calculation.
[0046] Furthermore, in step S4, the steps to complete the fine-grained attribute semantic enhancement include: Input the image and text information of each item into MLLMs, and generate corresponding attribute semantic descriptions under the constraints of decision-oriented attribute prompt templates; The prompt template is used to constrain MLLMs to generate around the decision-related attributes of the item, which are attribute information used to influence the user's choice. The semantic descriptions of attributes are structured and organized so that they are represented in a unified semantic expression form; A text encoder is used to encode the structured attribute semantic description to obtain the attribute semantic vector; The attribute semantic vector is used to fuse with the item backbone representation in the subsequent graph-aware attribute fusion step.
[0047] Specifically, in the process of generating attribute semantic descriptions, for any item i, its image information is... and text information The input is combined according to a preset input format and passed as multimodal input to MLLMs, while also incorporating a preset prompt template. Generate attribute semantic descriptions , is represented as: ; in, This represents the generator function of a multimodal large language model. This represents a suggestion template generated based on attribute semantics. This represents the semantic description text of the output attributes.
[0048] The prompt template consists of two parts: input constraints and output constraints. Input constraints limit the model to simultaneously utilize image and text information for analysis, while output constraints limit the output content to include only attribute-related information and to be expressed according to a preset format. The preset format includes combinations of attribute categories and corresponding attribute values, ensuring structural consistency in the semantic descriptions of attributes across different items. During the structuring process, the semantic descriptions of attributes... Field parsing and standardization are performed to convert the raw text into a structured set of attributes, represented as follows: ; in, Represents the set of attributes of item i. This represents the category of the j-th attribute. This represents the corresponding attribute value, and m represents the number of attributes.
[0049] During the structuring process, attribute categories for different items are uniformly mapped, ensuring that attributes with the same semantic meaning use consistent category identifiers. Attribute values are then standardized, including removing redundant descriptions, standardizing unit representations, and eliminating duplicate information, resulting in a unified expression of attribute semantic descriptions. In the encoding stage, the structured attribute semantic descriptions are converted into text sequences and input into a text encoder to obtain attribute semantic vectors. , is represented as: ;in, This represents a text encoding function. d represents the vector dimension. During the encoding process, attribute categories and attribute values are jointly modeled, ensuring that the encoding result reflects both attribute category and value information. The encoded attribute semantic vector has the same dimension as the initial item semantic vector obtained in step S2 to guarantee the feasibility of subsequent fusion operations.
[0050] Through the above processing, the obtained attribute semantic vector is used in the subsequent graph propagation and fusion process, so that the item representation contains fine-grained attribute information generated by the multimodal large language model, and participates in subsequent calculations in a unified structural form, thereby providing an input basis for subsequent representation fusion.
[0051] Furthermore, in step S5, the step of graph propagation of attribute semantic vectors based on item-item graphs includes: Attribute semantic vectors are used as the initial attribute representations for each node in the item-item graph; Based on the adjacency relationship of the item-item graph, the initial attribute representation of each node is aggregated with neighborhood information, so that each node integrates the attribute semantic information of its neighboring nodes. The attribute representation of each node is updated layer by layer through a multi-layer propagation method to capture the high-order relationships between items; The attribute representations obtained from multi-layer propagation are aggregated to obtain the propagated attribute semantic vectors. The propagated attribute semantic vector is used to characterize the attribute semantic representation after fusing neighborhood information, and is used in subsequent graph-aware attribute fusion steps.
[0052] In step S5, the steps for completing graph-aware attribute fusion include: Randomly deactivate the propagated attribute semantic vector to reduce the impact of noise in the attribute semantics; The processed attribute semantic vectors are mapped to the representation space corresponding to the item backbone representation to obtain attribute semantic features; Normalization is performed on the semantic features of attributes and the core representation of items, respectively. The normalized attribute semantic features and the normalized item backbone representation are weighted and fused based on preset fusion weights to obtain the final item representation.
[0053] Specifically, during the propagation of the attribute semantic graph, for any item i, its attribute semantic vector is... As an initial property representation, it is denoted as: ;in, This represents the attribute representation of level 0. This is the attribute semantic vector obtained in step S4. Based on the adjacency matrix A of the item-item graph, neighborhood information aggregation is performed on each node. For the ... layer to the first The propagation process of layers is represented as follows: ; in, Indicates the first Layer attribute representation, Let i represent the set of neighbors of item i. Represents the elements in the adjacency matrix. This represents the degree of node i. This process incorporates the attribute semantic information of neighboring nodes into the current node's representation. Through multi-layer propagation, the attribute representation is updated layer by layer, ensuring that each node's attribute representation includes multi-order attribute information within its neighborhood. After completing multi-layer propagation, the results from each layer are aggregated to obtain the propagated attribute semantic vector, represented as: Where L represents the propagation layer, This represents the attribute semantic vector after fusing information from multiple neighborhoods.
[0054] During attribute semantic processing, a random deactivation operation is performed on the propagated attribute semantic vector, which is represented as: ;in, This represents a random deactivation function, used to set some dimensions of a vector to zero, thereby reducing the impact of anomalous attribute information on subsequent calculations. This represents the processed attribute semantic vector.
[0055] During the representation space alignment process, the processed attribute semantic vector is input into the mapping function to obtain the attribute semantic features, represented as: ;in, Represents a mapping function. This represents the semantic features of the mapped attributes. The mapping function is used to transform the attribute semantic vector into a representation space consistent with the item's core representation.
[0056] During the normalization process, the semantic features of the attributes Represented by the main body of the item Perform normalization operations separately, as follows: ; in, and These represent the normalized semantic features of attributes and the core representation of items, respectively.
[0057] During the fusion process, the two normalized representations are weighted and combined to obtain the final item representation, which is: ;in, This represents the final item representation. This indicates the preset fusion weight, which is used to adjust the proportion of attribute semantic features in the fusion result.
[0058] Through the above processing, the attribute semantic information is propagated in the item-item graph and fused with the backbone representation, so that the final item representation contains both the original semantic information and the attribute semantic information after neighborhood propagation, thus providing a unified representation basis for subsequent consistency learning and matching calculation.
[0059] Furthermore, in step S6, a dual-view operation is performed based on the perturbation view. Figure 1The steps of holistic learning include: Inject random perturbations into the user representation and the final item representation respectively to construct a first perturbation view and a second perturbation view; The representations of the same user in the first and second perturbation views are taken as positive user sample pairs, and the representations between different users are taken as negative user sample pairs. The representations of the same item in the first and second perturbation views are taken as positive sample pairs of items, and the representations of different items are taken as negative sample pairs of items. Consistency learning is performed based on user positive sample pairs, user negative sample pairs, item positive sample pairs, and item negative sample pairs to keep the representation of the same object close under different perturbation views and to keep the representations of different objects distinguishable. The user indicates that the corresponding disturbance intensity is less than the disturbance intensity corresponding to the final item.
[0060] In step S6, the joint optimization training of the multimodal recommendation model includes: A training sample consisting of users, positive sample items, and negative sample items is constructed based on user representations and final item representations. Based on the training samples, the degree of matching between users and positive sample items is higher than the degree of matching between users and negative sample items. Based on dual views Figure 1 Consistent learning results ensure that the representation of the same user or the same item remains consistent across different perturbation views; The above training objectives are weighted and combined to jointly optimize and train the multimodal recommendation model.
[0061] Specifically, in the process of constructing the perturbation view, random noise is injected into the representation vector of any user u and any item i to generate the perturbation view.
[0062] For item representation The representation after perturbation is as follows: ; For user feedback The representation after perturbation is as follows: ; in, This represents the final item representation. This indicates that the user has stated, and Represents a random noise vector. and Let represent the disturbance intensity parameters on the item side and the user side, respectively, and satisfy . The first perturbation view and the second perturbation view are generated using the methods described above.
[0063] During the construction of positive and negative sample pairs, for the same user u, its representation under different perturbation views. and Construct positive sample pairs for different users; and This represents a negative sample pair for the user. On the item side, the representation of the same item under different perturbation views constitutes a positive sample pair for the item, and the representations of different items constitute a negative sample pair for the item.
[0064] In the consistency learning process, a similarity-based contrastive learning approach is adopted to ensure that the representation of the same object remains similar under different perturbation views, while maintaining the distinction between the representations of different objects. For any user, the consistency objective is represented as: ; in, This indicates a loss of consistency on the user side. Represents the similarity function. Indicates temperature parameter, This represents a negative sample user. For the item side, the consistency loss can be constructed in the same way, and the final consistency learning objective is the sum of the consistency losses from the user side and the item side.
[0065] During the joint optimization training process, the matching degree is first calculated based on the user representation and the item representation, as follows: ;in, This represents the degree of matching between user u and item i. For a triple (u, i, j) in the training samples, where i is a positive sample item and j is a negative sample item, the ranking objective function is constructed as follows: Where σ(.) represents the Sigmoid function. The overall optimization objective is obtained by weightedly combining the ranking objective and the consistency learning objective. ;in, This represents the consistent learning loss. This represents the weighting coefficient.
[0066] Through the above training process, the model ensures that the matching degree between users and positive sample items is higher than that between negative sample items, while maintaining the consistent representation of the same object under different perturbation views, thus obtaining a multimodal recommendation model for recommendation tasks.
[0067] Furthermore, in step S7, the step of outputting recommendation results based on the degree of matching includes: Obtain historical interaction data for the users to be recommended; Based on step S2, an initial semantic vector of the user to be recommended is generated, and based on the multimodal recommendation model obtained in step S6, a user representation of the user to be recommended is generated. Based on the final item representation obtained in step S5, the matching degree between the user to be recommended and the candidate items is calculated; Candidate items are sorted according to their matching degree; The items within the preset position range in the output sorting results are used as recommendations.
[0068] Specifically, in the recommendation phase, for any user u to be recommended, its historical interaction data is first obtained to obtain the set of items that the user has interacted with in the past. ;in, This represents the interaction relationship between user u and item i. This represents the set of items the user has interacted with historically. Based on this set, the corresponding semantic descriptions of the items are input into a multimodal large language model (MLLMs) to generate semantic descriptions of user preferences. An initial semantic vector for the user is then obtained through a text encoder, represented as follows: ; in, This represents a semantic description of user preferences. This represents a text encoding function. The user's initial semantic vector is input into the pre-trained multimodal recommendation model, and the user representation is obtained after mapping processing. This representation is the same as the final item representation obtained in step S5. They exist in the same representation space.
[0069] For the candidate item set ,in, Let u represent the entire set of items. For each candidate item i in the set, calculate the matching degree between user u and item i, expressed as: ; in, This indicates the degree of matching between user u and item i. This indicates that the user has stated, This represents the final representation of the item. After calculating the matching degree of all candidate items, the candidate item set is sorted according to the matching degree, arranged from highest to lowest, to obtain the sorted sequence. ;in, This represents the sorted sequence of items. This indicates a sorting operation.
[0070] In the output phase, items within a preset position range are selected from the sorted sequence as recommendations. The preset position range is limited by the parameter K, which means selecting the first K items in the sorted sequence, expressed as: ;in, This represents the final set of recommended items, where K represents the preset number of recommendations. In the specific implementation, the candidate item set can be obtained by filtering items that the user has already interacted with; the sorting process is based on a numerical comparison of the matching degree; and the recommendation results are output in the form of a set of item identifiers.
[0071] Through the above steps, a complete calculation process from user input to recommendation result output is realized, so that the recommendation result is calculated from the user representation and item representation in the unified representation space and output based on the ranking rules.
[0072] Finally, it should be noted that the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A multimodal recommendation method based on fine-grained attribute semantic enhancement using MLLMs, characterized in that, Includes the following steps: S1. Obtain the user set, the item set, the historical interaction data between users and items, and the image and text information corresponding to each item; S2. Input the image and text information of each item into a multimodal large language model (MLLMs) and generate corresponding semantic descriptions of the items under the constraints of a preset prompt template; input the semantic descriptions of the items corresponding to the items in the user's historical interaction into the MLLMs to generate user preference semantic descriptions; encode the semantic descriptions of the items and the semantic descriptions of the user preferences to obtain the initial semantic vectors of the items and the initial semantic vectors of the users, thus completing the multimodal semantic encoding. S3. Based on the initial semantic vector of the item and the historical interaction data, construct an item-item graph that integrates semantic similarity relationships and interaction co-occurrence relationships; and perform graph propagation on the initial semantic vector of the item based on the item-item graph to obtain the item backbone representation; S4. Input the image and text information of each item into MLLMs, and generate corresponding attribute semantic descriptions under the constraints of the prompt template for decision-oriented attributes; encode the attribute semantic descriptions to obtain attribute semantic vectors, and complete fine-grained attribute semantic enhancement. S5. Based on the item-item graph, perform graph propagation on the attribute semantic vector; after processing the propagated attribute semantic vector, align and weightedly fuse it with the item backbone representation to obtain the final item representation, thus completing graph-aware attribute fusion; S6. Construct a perturbation view based on the user's initial semantic vector and the final item representation, and perform dual-view consistency learning based on the perturbation view; Training samples are constructed based on users, positive sample items, and negative sample items, and the multimodal recommendation model is jointly optimized and trained. S7. Based on the trained multimodal recommendation model, input the user to be recommended, calculate the matching degree between the user to be recommended and the candidate items, and output the recommendation result according to the matching degree.
2. The multimodal recommendation method based on fine-grained attribute semantic enhancement according to claim 1, characterized in that, In step S2, the step of completing multimodal semantic encoding includes: Input the image and text information of each item into MLLMs, and generate the corresponding semantic description of the item under the constraints of the preset prompt template; For each user, summarize the semantic descriptions of the items corresponding to the user's historical interactions, input them into MLLMs, and generate corresponding semantic descriptions of user preferences under the constraints of preset prompt templates; A unified text encoder is used to encode the semantic description of the items and the semantic description of the user preferences to obtain initial semantic vectors of items and users with consistent dimensions.
3. The multimodal recommendation method based on fine-grained attribute semantic enhancement according to claim 1, characterized in that, In step S3, the step of constructing an item-item graph that integrates semantic similarity relationships and interaction co-occurrence relationships includes: Calculate the semantic similarity relationship between any two items based on their initial semantic vectors; Based on historical interaction data, statistical analysis is performed to determine the co-occurrence relationship between any two items. The semantic similarity relationship and the interaction co-occurrence relationship are respectively subjected to nearest neighbor filtering and sparsification processing to retain the highly relevant neighbors of each item; The processed semantic similarity relationships and interaction co-occurrence relationships are fused to obtain the item-item graph.
4. The multimodal recommendation method based on fine-grained attribute semantic enhancement according to claim 1, characterized in that, In step S3, the step of performing graph propagation on the initial semantic vector of the item based on the item-item graph includes: The initial semantic vector of the item is used as the initial representation of each node in the item-item graph; Multi-layer neighborhood information aggregation is performed based on the item-item graph to obtain the propagation results at each layer; The propagation results from each layer are aggregated to obtain the core representation of the item; The item backbone representation is mapped to obtain item backbone features, and the user initial semantic vector is mapped to obtain user representation.
5. A multimodal recommendation method based on fine-grained attribute semantic enhancement according to claim 1, characterized in that, In step S4, the step of completing fine-grained attribute semantic enhancement includes: Input the image and text information of each item into MLLMs, and generate corresponding attribute semantic descriptions under the constraints of decision-oriented attribute prompt templates; The prompt template is used to constrain MLLMs to generate around the decision-related attributes of an item, and the decision-related attributes are attribute information used to influence user choices. The semantic descriptions of the attributes are structured and organized so that they are represented in a unified semantic expression form; A text encoder is used to encode the structured attribute semantic description to obtain the attribute semantic vector; The attribute semantic vector is used to fuse with the item backbone representation in the subsequent graph-aware attribute fusion step.
6. The multimodal recommendation method based on fine-grained attribute semantic enhancement according to claim 1, characterized in that, In step S5, the step of performing graph propagation on the attribute semantic vector based on the item-item graph includes: The attribute semantic vector is used as the initial attribute representation of each node in the item-item graph; Based on the adjacency relationship of the item-item graph, the initial attribute representation of each node is aggregated with neighborhood information, so that each node integrates the attribute semantic information of its neighboring nodes. The attribute representation of each node is updated layer by layer through a multi-layer propagation method to capture the high-order relationships between items; The attribute representations obtained from multi-layer propagation are aggregated to obtain the propagated attribute semantic vectors. The propagated attribute semantic vector is used to characterize the attribute semantic representation after fusing neighborhood information, and is used in subsequent graph-aware attribute fusion steps.
7. A multimodal recommendation method based on fine-grained attribute semantic enhancement according to claim 1, characterized in that, In step S5, the step of completing graph-aware attribute fusion includes: Randomly deactivate the propagated attribute semantic vector to reduce the impact of noise in the attribute semantics; The processed attribute semantic vectors are mapped to the representation space corresponding to the item backbone representation to obtain attribute semantic features; The semantic features of the attributes and the core representation of the items are normalized respectively; The normalized attribute semantic features and the normalized item backbone representation are weighted and fused based on preset fusion weights to obtain the final item representation.
8. A multimodal recommendation method based on fine-grained attribute semantic enhancement according to claim 1, characterized in that, In step S6, the step of performing dual-view consistency learning based on the perturbation view includes: Inject random perturbations into the user representation and the final item representation respectively to construct a first perturbation view and a second perturbation view; The representations of the same user in the first and second perturbation views are taken as positive user sample pairs, and the representations between different users are taken as negative user sample pairs. The representations of the same item in the first and second perturbation views are taken as positive sample pairs of items, and the representations of different items are taken as negative sample pairs of items. Consistency learning is performed based on the user positive sample pairs, user negative sample pairs, item positive sample pairs, and item negative sample pairs to keep the representation of the same object close under different perturbation views and to keep the representations of different objects distinguishable. The user indicates that the corresponding disturbance intensity is less than the disturbance intensity corresponding to the final item.
9. A multimodal recommendation method based on fine-grained attribute semantic enhancement according to claim 1, characterized in that, In step S6, the step of jointly optimizing and training the multimodal recommendation model includes: A training sample consisting of users, positive sample items, and negative sample items is constructed based on user representations and final item representations. Based on the training samples, the user's matching degree for positive sample items is higher than that for negative sample items; Based on the dual-view consistency learning results, the representation of the same user or the same item remains consistent under different perturbation views; The above training objectives are weighted and combined to jointly optimize and train the multimodal recommendation model.
10. A multimodal recommendation method based on fine-grained attribute semantic enhancement according to claim 1, characterized in that, In step S7, the step of outputting recommendation results based on the degree of matching includes: Obtain historical interaction data for the users to be recommended; Based on step S2, an initial semantic vector of the user to be recommended is generated, and based on the multimodal recommendation model obtained in step S6, a user representation of the user to be recommended is generated. Based on the final item representation obtained in step S5, the matching degree between the user to be recommended and the candidate items is calculated; The candidate items are sorted according to the degree of matching. The items within the preset position range in the output sorting results are used as recommendations.