Multi-behavior multi-interest generative recommendation method and system based on multi-source item embedding and mixed tokenization
By employing a multi-source item embedding and hybrid tokenization approach, user information from multiple sources is mapped to a unified embedding space. Semantically consistent discrete item tag sequences are generated through residual quantization variational autoencoders. This approach addresses the shortcomings of multimodal understanding and the long-tail problem of cold start in existing generative recommendation systems, enabling explicit modeling of user interests and effective utilization of behavioral signals, thereby enhancing the diversity and granularity of recommendation systems.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- MACAO POLYTECHNIC INST
- Filing Date
- 2026-04-17
- Publication Date
- 2026-06-19
AI Technical Summary
Existing generative recommendation systems have shortcomings in multimodal understanding and semantic control. They cannot fully consider users' multi-level interests, struggle to achieve cross-modal semantic understanding and generation, and perform poorly in cold start and long-tail item recommendations.
By employing a multi-source item embedding and hybrid tokenization approach, user information from multiple sources is mapped to a unified embedding space. Semantically consistent discrete item token sequences are generated using a residual quantization variational autoencoder. Combined with multi-interest modeling and behavioral tokenization, heterogeneous token sequences are constructed for generative recommendations.
It significantly improves the diversity, granularity, and interpretability of recommendation results, enhances the representation of long-tail and cold-start items, and realizes explicit modeling of user interests and effective utilization of behavioral signals.
Smart Images

Figure CN122240933A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of recommendation systems and generative modeling technology, and in particular to a multi-behavior, multi-interest generative recommendation method and system based on multi-source item embedding and hybrid tokenization. Background Technology
[0002] In recent years, generative recommendation has become an important research direction in intelligent recommendation systems. This type of method transforms the recommendation task into a "sequence-to-sequence" generation problem, enabling models to directly generate target item tokens (ID tokens) within a language modeling framework, achieving an end-to-end mapping from user history to recommendation results. Representative works include P5, TIGER, and VIP5, which are based on pre-trained language models (PLMs) or quantized autoencoders (VQ-VAE) frameworks, achieving unified modeling of recommendation and language generation tasks. However, existing generative recommendation techniques still have significant shortcomings in multimodal understanding and semantic control: 1. Limited semantic target generation: Most methods use only "next item" or "item ID" as the sole generation target, making it difficult to explicitly output the user's interest state or express the impact of behavioral differences (such as browsing versus purchasing) on the recommendation intent. This single target prevents the recommendation system from fully considering the user's multi-layered interests, thus limiting the accuracy and personalization of the recommendation results. 2. Semantic fragmentation across multiple sources: Existing generative recommendation technologies typically use vector concatenation or simple fusion as input when processing information from multiple sources, lacking unified and generative discrete semantic units. This leads to semantic inconsistencies across different sources, and the fusion of features from different sources is not tight enough, making it difficult to achieve true cross-modal semantic understanding and generation.
[0003] 3. Insufficient representation of multiple interests: Traditional recommender systems typically embed user interests in a single user vector. This representation method cannot effectively capture users' diverse interests, especially when user interests change over time and context, making it difficult for the system to respond flexibly. This lack of parallel interest modeling limits the diversity and interpretability of recommender systems, making it difficult to accurately recommend items that satisfy users' potential interests.
[0004] 4. Insufficient Utilization of Multi-Action Signals: Many generative recommendation methods fail to effectively utilize the structural information of multi-action sequences, typically focusing only on a single action (such as clicking or purchasing). However, user actions contain a wealth of potential interest information, such as browsing, adding to cart, and favorites, which can provide richer context for recommendation models. Existing technologies do not incorporate action types into the generation objective, resulting in the inability to reflect the causal structure of "predicting actions first and then items" during the generation process.
[0005] 5. Insufficient Generalization for Long-Tail and Cold-Start Issues: Existing generative recommendation models perform poorly when faced with long-tail items or new merchants and dishes. The main reason is that these models rely on historical interaction frequency or high-frequency item IDs for their recommendations, neglecting semantic content-based generalization. When new items lack sufficient historical interaction data, their representation in the generative space is sparse, resulting in insufficient coverage and diversity of recommendations, making it difficult to effectively alleviate the cold-start and long-tail problems.
[0006] To alleviate the problems of "difficulty in sharing cross-modal information, cold start, and difficulty in generalizing to long tails," existing research has mainly proposed two types of enhancement methods: 1. Multimodal Quantitative Language (MQL4GRec) Research. Zhai Jianyang et al. proposed the MQL4GRec model in "Multimodal Quantitative Language for Generative Recommendation." Its core idea is to use a quantitative translator to map text and image modalities to a shared discrete dictionary space, enabling items from different modalities to be represented through a unified "quantitative language." Building on this, a pre-training-fine-tuning paradigm is used to achieve cross-modal and cross-domain knowledge transfer, significantly improving the multimodal representation and generalization capabilities of generative recommendation models.
[0007] 2. Multi-Identifier Quantization: Zheng Bowen et al. proposed the MTGRec model in their paper "Pre-training Generative Recommender with Multi-Identifier Item Tokenization". This model constructs a tokenizer based on a residual quantization variational autoencoder (RQ-VAE) to generate multiple discrete identifiers for the same item, describing the structural and semantic features of the item in a multi-token manner. Furthermore, it improves the model's robustness and recognizability in cold-start and long-tail item scenarios through course learning and multi-stage pre-training.
[0008] 3. Combining Collaborative Semantic Alignment with Large Language Models. Zheng Bowen et al. proposed the LC-Rec model in "Adapting Large Language Models by Integrating Collaborative Semantics for Recommendation." This method introduces collaborative semantics specific to the recommendation scenario into the large language model through collaborative semantic alignment and vector quantization mapping, enabling the model to directly generate target items in a unified semantic space, thereby narrowing the semantic gap between language modeling and recommendation tasks. However, the alignment mechanism of LC-Rec relies on additional vector mapping and post-tuning, and the expression gap between collaborative features and linguistic features still exists, failing to completely solve the shortcomings of business semantic modeling.
[0009] In summary, although current generative recommendation systems have made significant progress in unified modeling, cross-modal fusion, and pre-training fine-tuning frameworks, existing methods still have the following main problems: 1. Limited semantic modeling and insufficient business controllability: Most existing methods only use item IDs as the generation target, and the optimization objective is mainly to maximize the generated sequence likelihood (MLE), lacking explicit modeling of specific business semantics (such as product category, price range, user needs, etc.). This results in the model being unable to directly control the semantics of the recommendation results, and usually relies on post-rule adjustments, which can easily lead to semantic drift and inconsistent recommendation results.
[0010] 2. Insufficient multimodal knowledge fusion and a lack of modality alignment mechanisms; although MQL4GRec attempts to map text and image modalities to a shared vocabulary through a quantization translator, its semantic space still relies on a single quantization layer, making it difficult to represent multi-source features (such as geographical location, price range, etc.) in detail. While MTGRec introduces multi-identifier quantization, it lacks sufficient fusion of structured and collaborative information, and its cross-modal representation lacks unified constraints, affecting the effective integration of multimodal information.
[0011] 3. Limited fusion of collaborative semantics and linguistic semantics; LC-Rec introduces collaborative semantic alignment into a large language model, but its alignment mechanism relies on additional vector mapping and post-tuning, and the expression gap between collaborative features and linguistic features still exists. Furthermore, this method does not incorporate business semantics into the generation objective, resulting in a lack of business controllability in the model's recommendation results.
[0012] 4. The cold start and long tail problems have not been systematically addressed; existing generative recommendation models perform poorly in addressing these issues, primarily because generation relies on historical interaction frequency or high-frequency item IDs, rather than generalizing based on the semantic content of the items. For cold-start items, the representation of the generative model is sparse, resulting in insufficient recommendation coverage and diversity.
[0013] 5. The generation process is disconnected from business objectives, and the inference phase relies on rule-based reranking for compensation. Existing methods mostly use maximum likelihood estimation (MLE) for training, and the training objectives are inconsistent with the platform's business objectives such as diversity, delivery timeliness, and compliance. Therefore, the inference phase often relies on rule-based reranking for secondary screening, which not only increases computational costs and latency but also reduces the system's end-to-end consistency and real-time performance. Summary of the Invention
[0014] This invention provides a multi-behavior, multi-interest generative recommendation method and system based on multi-source item embedding and hybrid tokenization, to solve the problems in existing generative recommendations, such as the difficulty in aligning multi-source information in a unified embedding space, the difficulty in uniformly serializing and expressing multi-behavior signals, and the difficulty in explicitly modeling and participating in the generation process of user interests.
[0015] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is as follows: A multi-behavior, multi-interest generative recommendation method based on multi-source item embedding and hybrid tokenization includes the following steps: 1) Collect user project information from multiple sources to construct a multi-source dataset; The multi-source information of the project includes at least text information, image information, structured attribute information, and collaborative interaction information; The collaborative interaction information includes historical interaction data and interaction behavior data; The historical interaction data includes at least user identifier, interaction time, and project identifier; The interactive behavior data includes at least the behavior type; 2) Preprocess the multi-source dataset to obtain a multi-source input set in a unified format; 3) Perform feature encoding on the text information input, image information input, structured attribute information input and collaborative interaction information input in the multi-source input set to obtain continuous item embedding vectors under different sources, and project the continuous item embedding vectors onto a unified embedding space, and then perform learnable weighted fusion to obtain a unified item semantic vector; 4) Perform multi-level residual quantization on the unified item semantic vector through a residual quantization variational autoencoder to output an initial item tag sequence composed of multiple discrete tags; 5) The initial item tag sequence is used to construct multiple sets of semantically related item tag sequences through a hybrid tokenization mechanism, and the best one is selected as the final item tag sequence output. Specifically: through the hybrid tokenization mechanism, multiple model checkpoints of the residual quantization variational autoencoder in different training cycles are extracted as semantically related tokenizers to construct multiple sets of semantically related item tag sequences with different representations for the same initial item tag sequence. In the pre-training stage of the generative sequence model, the influence of each set of item tag sequences on the model loss gradient is evaluated, and the sampling probability of each set of item tag sequences is dynamically adjusted according to the influence for model training. In the fine-tuning and inference stage, the optimal target tokenizer is fixedly selected from multiple tokenizers, and the final item tag sequence is output. 6) Perform multi-interest modeling processing on the historical interaction data in the collaborative interaction information to generate interest tags; 7) Perform behavior tagging processing on the interactive behavior data in the collaborative interaction information to generate behavior tags; 8) Based on the preset sequence splicing and sorting rules, starting with the discrete tags corresponding to the user identifier, the behavior tags, interest tags and final item tags are spliced and combined in sequence to construct a heterogeneous tag sequence, which is used as the training sample for the generative sequence model. 9) Input the heterogeneous labeled sequences into the encoder of the pre-trained generative sequence model for processing to obtain the context sequence feature matrix; 10) The user identifier is used directly or concatenated with behavior and interest tags as a guiding prefix and input into the decoder of the generative sequence model to obtain the current sequence hidden state. Then, the decoder calculates the attention weight based on the current sequence hidden state and the context sequence feature matrix, and generates candidate sequences through a beam search autoregressive strategy. After prefix truncation, the target item tag sequence is obtained, and finally, it is decoded and mapped to the recommended items as the recommendation result output.
[0016] Preferably, in step 2, the preprocessing of the multi-source dataset includes: 2.1) The text information is cleaned, segmented / subtotalized, truncated, and standardized using embedding encoding to obtain the processed text information; 2.2) After the image information is uniformly sized and normalized, a visual coding network is used for feature extraction to obtain the processed image information; 2.3) Discretize / normalize the structured attribute information and process missing values to obtain the processed structured attribute information; 2.4) Perform missing value filling, outlier removal, and sparsification processing on the collaborative interaction information to obtain the processed collaborative interaction information; Furthermore, in step 2, the multi-source input set in a unified format is represented as follows: ; in, For text input, Input image information, Input for structured attribute information. For collaborative interaction information input.
[0017] Preferably, in step 3, the continuous embedding vector of the item is calculated and projected and aligned for each source, that is: ; ; in, For a certain set of inputs from a certain source, f (s) For the source encoder, For continuous embedding vectors of the project, Proj (s) For projection networks, For items located in a uniform embedding space, form continuous embedding vectors; In step 3, the continuous embedding vectors of items located in the unified embedding space are subjected to learnable weighted fusion to obtain a unified item semantic vector, namely: ; ; in, h v To unify the semantic vectors of projects, For items located in a uniform embedding space, form continuous embedding vectors. α s For source weight, q For query vector, W For learnable matrices, S For the source set.
[0018] Preferably, the multi-level residual quantization process in step 4 is as follows: 4.1) Initialize the residual vector, the formula is as follows: r (1) =h v ; 4.2) Section ℓ Codebook E ℓ Select the codebook vector index that best matches the current residual to obtain the first... ℓ Level Discrete Marking c v,ℓ Its formula is ; 4.3) Update the residual vector, the formula is as follows: r (ℓ+1) =r (ℓ) - E ℓ [ c v,ℓ ]; 4.4) Final output item tag sequence [ c v,ℓ,..., c v,m ].
[0019] Preferably, in step 5, the process of generating the final item tag sequence is as follows: 5.1) Extract multiple model checkpoints from the residual quantization variational autoencoder at different training epochs as a set of semantically relevant tokenizers T={T1,...,T...} M The set of tokenizers is used to construct M sets of item token sequences that represent different but semantically related items for the same initial item token sequence; 5.2) In the pre-training phase of the generative sequence model, the influence score of each group of item label sequences on the model optimization is evaluated by calculating the inner product of the loss gradient of the model on the validation set and the training loss gradient on each group of item label sequences. Then, the sampling probability of each group of item label sequences in the next round of training is dynamically adjusted based on the influence score using a normalization function with a temperature coefficient and a momentum update strategy. 5.3) During the fine-tuning and inference phase, from the set of tokenizers T, the checkpoint with the highest sampling probability at the end of the pre-training period or the smallest reconstruction error on the independent validation set is fixedly selected as the target tokenizer with the best performance, and the final item tag sequence is output by the target tokenizer.
[0020] In step 5.3, the selection of the optimal target tokenizer: After pre-training, during the fine-tuning and inference phase, there is no need for further mixed sampling of multiple sets of data. The system will consistently select the target tokenizer with the best performance. In practice, the "optimal target tokenizer" is typically selected as the one assigned the highest sampling probability p at the end of pre-training. m The tokenizer is either used to reconstruct the model checkpoint of the last cycle with the smallest error on an independent validation set. Fixing a single tokenizer ensures that the item identifier has strict uniqueness and accuracy when decoding the final recommendation.
[0021] Preferably, in step 6, the process of generating the interest tag is as follows: 6.1) Map each interaction representation to a low-level capsule prediction vector: ; in, u i,k This is the prediction vector for the lower-level capsules. W k For learnable mapping matrices or shared matrices, x i Let be the representation vector of the i-th interaction in the user's historical interaction data.
[0022] 6.2) Initialize the route log values and perform dynamic route iteration, calculating the route coefficients in each iteration: ; in, For routing coefficients, b i,k The initial route log value is set to zero; 6.3) Prediction vector for lower-level capsules u i,k Weighted aggregation is performed to obtain the input of the k-th interest slot, i.e. ; 6.4) Obtain the interest vector of the k-th interest slot by applying the result obtained in step 6.3 through a nonlinear compression function, i.e. ; 6.5) Update route pair values based on consistency b i,k ,Right now ; 6.6) Finally, K interest slot vectors are obtained. and routing coefficients ; 6.7) Based on routing coefficients For each interaction, explicit interest assignment is performed to obtain the interest slot index z corresponding to the i-th interaction. i ,Right now , 6.8) Index the interest slot z i This is mapped to a discrete interest tag token: z via a lookup table. t (z i ); Among them, z t (·) is a mapping table from interest slot index to token ID, where k is a preset positive integer.
[0023] Preferably, the interaction behavior data includes at least behavior types (including but not limited to browsing, clicking, adding to favorites, adding to cart, and purchasing). In step 7, the behavior tokenization process includes: Based on the set of behavior types B={b_1,…,b_|B|}, a unique behavior label is established for each behavior type, that is: ; Among them, b i Let bt be the behavior type of the i-th interaction. (bi) The discrete behavior label is the behavior type corresponding to this behavior type. BehaviorVocab is a mapping table from behavior type to label.
[0024] Preferably, the process of generating the heterogeneous marker sequence in step 8 is as follows: Define the concatenation and sorting rules for user identifiers, behaviors, interests, and items, and then concatenate and combine the user identifier, behavior tags, interest tags, and the final item tag sequence in sequence to construct a heterogeneous tag fragment (i.e., a heterogeneous tag sequence): ; in, u t ( u ) represents the discrete tag corresponding to the user identifier u; b t (b i ) represents the behavior type of the i-th interaction. Corresponding behavior marker; z t (z i ) is the index of the interest slot z i Corresponding interest tags; c i,ℓ For the i-th interaction, the corresponding item is in the i-th position. ℓ Discrete labels obtained from level residual quantization i t (·) represents the mapping from discrete item markers to item markers; L represents the length of the item marker sequence.
[0025] Preferably, in steps 9 and 10, the encoder's processing of the heterogeneous label sequence includes: first, converting the discrete labels in the heterogeneous label sequence into continuous vector representations through an embedding layer; then, through an encoder module composed of a self-attention mechanism and a feedforward neural network, calculating the global attention weight of each label in the heterogeneous label sequence with all other labels, realizing deep feature interaction, and outputting the context sequence feature matrix. Given interest tags and behavior tags, the decoder adopts a conditional guided generation strategy. Specifically, the user identifier is used as the starting point and concatenated with the behavior tag and interest tag to construct a combined sequence as a guide prefix. This guide prefix is used as the first input of the decoder and fed into the bottom layer of the decoder. After processing by the embedding layer and masked self-attention layer of the decoder, the current sequence hidden state is obtained. In scenarios where no interest or behavior labels are given, the decoder adopts a fully autoregressive generation strategy. Specifically, the user identifier is used as a guiding prefix. This guiding prefix is used as the first input of the decoder and fed into the bottom layer of the decoder. After processing by the embedding layer and the masked self-attention layer of the decoder, the hidden state of the current sequence is obtained. The context sequence feature matrix is used as the second input to the decoder and introduced into the intermediate layer of the decoder through a cross-attention mechanism; The process of obtaining the target item tag sequence includes: the decoder uses the current sequence hidden state as the query vector and the context sequence feature matrix as the key and value vectors to calculate cross-attention weights; then, through beam search autoregressive method, the top-K item tags with the highest joint probability containing the leading prefix are generated as candidate sequences; subsequently, the candidate sequences are truncated to remove the leading prefix, the target item tag sequence (i.e., the recommendation sequence) is extracted, and then it is decoded and mapped to recommended items as the recommendation result output.
[0026] Preferably, during training, the optimization objectives of the generative sequence model include generation loss, alignment regularization term, and multi-task loss; the comprehensive loss function of the generative sequence model includes a weighted sum of generation loss, alignment loss, and multi-task loss; the weights of each loss term are tuned through methods such as cross-validation to ensure that the loss of each task is effectively optimized during training.
[0027] This invention also provides a multi-behavior, multi-interest generative recommendation system based on multi-source item embedding and hybrid tokenization, comprising: The data acquisition module is used to collect multi-source information about user-related projects; The data preprocessing module is used to clean, format, and standardize the collected multi-source information datasets to obtain a unified format of multi-source input sets, ensuring that all data inputs conform to a unified standard. The multi-source embedding building block is used to encode the features of each input data in the multi-source input set to obtain continuous embedding vectors of items from different sources; The projection alignment and fusion module is used to perform projection and learnable weighted fusion processing on the continuous embedding vectors of the items to obtain a unified item semantic vector; The RQ-VAE tokenization module, consisting of a residual quantization variational autoencoder (RQ-VAE), is used to transform a uniform item semantic vector into a discrete sequence of initial item tokens. The hybrid tokenization control module is used to configure multiple tokenizers for the initial item tag sequence to construct multiple sets of semantically related item tag sequences, and select the best one as the final item tag sequence. The multi-interest extraction and interest tagging module is used to perform multi-interest modeling based on capsule networks and dynamic routing on user historical interaction data, extract K user interest slots (interest vectors) and output the corresponding interest tag tokens; it mainly extracts the user's interest representation through multi-interest modeling and converts it into interest tokens. The behavior tokenization module is used to convert user interaction behavior data (i.e. behavior types, such as browsing, clicking, adding to cart, purchasing, etc.) into discrete behavior tokens; The heterogeneous sequence construction module is used to combine the user's behavior tags, interest tags, and final item tags into a unified heterogeneous tag sequence based on the set splicing and sorting rules, starting with the discrete tags corresponding to the user identifier; Generative sequence models employ a Transformer architecture; where... The encoder is used to process the heterogeneous labeled input sequence to output a context sequence feature matrix that integrates global historical interaction information and multiple intent logic; The decoder employs a dual-input and cross-attention interaction mechanism, and features dual generation strategies, including a conditionally guided generation strategy and a fully autoregressive generation strategy. The decoder is responsible for: In scenarios where interest and behavior tags are given, employing a conditional guided generation strategy, starting with the user identifier and concatenating it with the behavior and interest tags to construct a combined sequence as a prompt; or in scenarios where interest and behavior tags are not given, employing a fully autoregressive generation strategy, using the user identifier as the prompt. It is responsible for using the prompt as the first input to the decoder, feeding it into the decoder's lower layer, where it is processed by its internal embedding layer and masked self-attention layer to obtain the current sequence hidden state. It is also responsible for using the context sequence feature matrix as the second input to the decoder and introducing it into the decoder's intermediate layer through a cross-attention mechanism. Furthermore, it is responsible for calculating cross-attention weights using the current sequence hidden state as the query vector and the context sequence feature matrix as the key and value vectors. Then, it generates the Top-K item tags with the highest joint probability containing the prompt as candidate sequences using a beam search autoregressive method. Finally, it performs prefix truncation on the candidate sequences to remove the prompt, extracting the target item tag sequence.
[0028] The decoding output module is used to decode and map the generated target item tag sequence based on a pre-built identifier mapping dictionary to obtain recommended items and output the recommendation results.
[0029] A computer-readable storage medium storing a computer program that, when executed by a processor, implements the multi-behavior, multi-interest generative recommendation method based on multi-source item embedding and hybrid tokenization.
[0030] By adopting the above technical solution, the beneficial effects achieved by the present invention are as follows: This invention constructs a unified modeling and residual quantization tokenization mechanism for multi-source item embedding, which maps multi-source data such as textual information, image information, structured attribute information, and collaborative interaction information of items to a unified embedding space. Furthermore, it generates semantically consistent discrete token sequences of items through a residual quantization variational autoencoder. Simultaneously, by configuring multiple checkpoint tokenizers for the same item and introducing a hybrid tokenization control strategy, the sampling probabilities of multiple sets of semantically related sequences are dynamically adjusted based on gradient influence evaluation during the pre-training stage, and the optimal tokenizer is fixed during the inference stage. Thus, while ensuring the uniqueness of decoding, the model's representation and generalization capabilities for long-tail items and cold-start items are effectively enhanced.
[0031] This invention explicitly introduces a multi-interest modeling and interest labeling mechanism into a generative recommendation framework, decoupling the multiple interest structures implicit in user history interactions into multiple distinguishable interest labels. Furthermore, it introduces an explicit interest allocation strategy at the per-interaction level, ensuring that each user interaction is associated with a specific interest slot. This solves the problem in existing generative recommendation methods where user interests are compressed into a single vector, making it difficult to express parallel interests. Consequently, it significantly improves the diversity, refinement, and interpretability of recommendation results.
[0032] This invention performs behavior tagging on different user behavior types, unifying heterogeneous behaviors such as browsing, clicking, favorites, adding to cart, and purchasing into discrete behavior tags. Starting with the user identifier, it constructs a unified heterogeneous tag sequence of behavior tags, interest tags, and item tags in a predetermined order. This enables generative sequence models to simultaneously learn the evolution of user interests, behavioral differences, and item semantic structures in a unified sequence space, thereby overcoming the shortcomings of existing methods where multiple behavior signals are only used as weights or feature inputs and fail to participate in the modeling of the generative target.
[0033] This invention employs an autoregressive generative sequence model to perform end-to-end modeling of heterogeneous labeled sequences, constructing a unified generation paradigm of "behavioral tag - interest tag - item tag" starting with user identifiers. During the training phase, the model efficiently learns the sequence distribution in parallel through a teacher forcing strategy. During the inference phase, the model supports not only a fully autoregressive joint generation method, using user identifiers as guiding prefixes to obtain item recommendations, but also a conditionally guided generation method, using user identifiers, behavioral tags, and interest tags as guiding prefixes to obtain item recommendations. This dual-mode mechanism explicitly reflects the causal influence of user behavioral intent and interest state on the recommendation results, avoiding the semantic jump problem of "directly predicting items" in traditional recommendation methods, and endowing the recommendation system with powerful controllable generation capabilities and interpretability.
[0034] This invention introduces a multi-level residual quantization mechanism during the item tag generation stage, which decomposes the item semantics into multiple discrete tags for common expression, avoiding the representation sparsity problem caused by a single item identifier. This provides a more stable and learnable representation basis for low-frequency items in the recommendation generation space, and significantly improves the coverage and exposure fairness of long-tail items in generative recommendations.
[0035] In summary, the core of this invention lies in simultaneously proposing a unified quantitative representation method for multi-source item embedding, a hybrid tokenization control mechanism oriented towards multiple checkpoints, a heterogeneous sequence construction method that starts with user identifiers and jointly models behavior tags, interest tags, and item tags, and a generative recommendation modeling framework oriented towards multiple behaviors and interests. Through the above technical solutions, this invention achieves collaborative modeling of heterogeneous user behavior signals, multi-interest structures, and multi-source semantics of items under a unified generative framework. It solves the problems of missing behavior semantics, insufficient interest expression, and weak long-tail generalization ability in traditional generative recommendations. It has good generalization ability, interpretability, and engineering feasibility, and is suitable as a core generative recommendation solution for large-scale recommendation systems. Attached Figure Description
[0036] Figure 1 This is a block diagram of the overall system structure of the present invention.
[0037] Figure 2 This is a schematic diagram of the generative recommendation method of the present invention.
[0038] Figure 3 for Figure 2 A schematic diagram of the joint generation of heterogeneous sequences during the inference phase of the method flow.
[0039] Figure 4 for Figure 2 A schematic diagram illustrating the generation of the generative sequence model module during the inference phase of the method flow.
[0040] Figure 5 This is a schematic diagram of the multi-source item semantic alignment fusion and item tokenization (RQ-VAE + hybrid tokenization) of the present invention. Detailed Implementation
[0041] To better understand the above-mentioned objectives, features, and advantages of the present invention, the present invention will be further described below in conjunction with the accompanying drawings and embodiments. It should be noted that, unless otherwise specified, the embodiments and features described in these embodiments can be combined with each other.
[0042] Numerous specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways than those described herein, and therefore the invention is not limited to the specific embodiments disclosed in the following specification. Example
[0043] like Figures 1-5 As shown, this invention provides a multi-behavior, multi-interest generative recommendation system based on multi-source item embedding and hybrid tokenization, comprising: The data acquisition module is used to collect multi-source information about user-related projects; this multi-source information includes at least textual information, image information, structured attribute information, and collaborative interaction information, corresponding to... The collaborative interaction information includes historical interaction data and interaction behavior data. The historical interaction data includes at least user identifier, project identifier, and interaction time. The interaction behavior data includes at least behavior type.
[0044] The data preprocessing module is used to clean, format, and standardize the collected multi-source information datasets to obtain a unified format for the multi-source input sets. To ensure all data inputs conform to a unified standard, preprocessing includes at least: word segmentation, cleaning, truncation, and standardized encoding of text information data; size normalization and visual feature extraction of image information data; discretization / normalization and missing value handling of structured attribute information data; and missing value imputation, noise reduction, and interaction feature construction of collaborative interaction information data. The data preprocessing module is used to uniformly organize the preprocessed inputs as follows: .
[0045] The multi-source embedding building block is used to encode features of each input data in a multi-source input set to obtain continuous embedding vectors for items from different sources. This includes using pre-trained models or custom encoders to convert information such as text, images, structured attributes, and collaborative interactions into continuous embedding vectors, obtaining the embedding representation of each item from different sources. Specifically: For each sources via source encoder f (s) For the corresponding data Encode to obtain item continuous embedding vectors ,in: .
[0046] The projection alignment and fusion module is used to project and perform learnable weighted fusion processing on the continuous item embedding vectors to obtain a unified item semantic vector. Specifically, it includes: mapping embeddings from each source to a unified space through a projection network, and performing weighted fusion of embeddings from different sources to obtain a unified item semantic vector. Through the projection network Proj (s) Will Mapped to ,make Located in a unified embedding space; that is: ; Weighted fusion of the projected embeddings from each source yields a unified project semantic vector. h v ,Right now: ; ; in, α s For source weight, q For query vector, W For learnable matrices, S For the source set.
[0047] The RQ-VAE tokenization module, composed of a residual quantization variational autoencoder (RQ-VAE), is used to transform a uniform item semantic vector into a discrete sequence of initial item tokens (initial item tokens). Specifically, it involves multi-level residual quantization of the continuous embeddings of items (properties) to generate multiple discrete tokens. Each item outputs a corresponding discrete sequence of initial item tokens through this module. First, initialize the residual vector, the formula is as follows: r (1) =h v ; And in the ℓth level codebook Select the codebook vector index that best matches the current residual to obtain the ℓth level discrete label. c v,ℓ Its formula is ; The residual vector is then updated using the following formula: r (ℓ+1)=r (ℓ) - E ℓ [ c v,ℓ ]; The final output is the initial item tag sequence. c v,ℓ,..., c v,m ]; The hybrid tokenization control module is used to configure multiple tokenizers for the initial item tag sequence to construct multiple sets of semantically related item tag sequences, and select the best one as the final item tag sequence (final item tag). Specifically, it extracts multiple model checkpoints of the residual quantization variational autoencoder at different training cycles as a set of semantically related tokenizers T={T1,...,T...}. M The set of tokenizers is used to construct M sets of item token sequences that represent different but semantically related items for the same initial item token sequence; In the pre-training phase of the generative sequence model, the influence score of each group of item-labeled sequences on model optimization is evaluated by calculating the inner product of the loss gradient of the model on the validation set and the training loss gradient on each group of item-labeled sequences. Then, the sampling probability of each group of item-labeled sequences in the next round of training is dynamically adjusted based on the influence score using a normalization function with a temperature coefficient and a momentum update strategy. During the fine-tuning and inference phase, from the set of tokenizers T, the checkpoint with the highest sampling probability at the end of pre-training or the smallest reconstruction error on the independent validation set is fixedly selected as the target tokenizer with the best performance, and the final item tag sequence is output by the target tokenizer.
[0048] This invention constructs a unified modeling and residual quantization tokenization mechanism for multi-source item embedding, which maps multi-source data such as text information, image information, structured attribute information, and collaborative interaction information of items to a unified embedding space. Furthermore, it generates semantically consistent discrete token sequences of items through a residual quantization variational autoencoder. At the same time, by configuring multiple checkpoint tokenizers for the same item and introducing a hybrid tokenization control strategy, the item has multiple sets of semantically related tokenized representations during the training phase, thereby effectively enhancing the model's representation and generalization capabilities for long-tail items and cold-start items.
[0049] The multi-interest extraction and interest tagging module performs multi-interest modeling based on capsule networks and dynamic routing on user historical interaction data, extracts K user interest slots (interest vectors), and outputs corresponding interest tag tokens. It mainly extracts user interest representations through multi-interest modeling and transforms them into interest tokens. Specifically, it performs multi-interest modeling on user historical interaction data, extracts K interest representations, obtains interest slot indices according to a per-interaction explicit interest allocation strategy, and finally maps the interest slot indices to interest tag tokens: z t (z i );Right now: The vector representing the i-th interaction in the user's historical interaction data. Capsule mapping is performed to obtain low-level prediction vectors. And it calculates the routing coefficients for interactions to interest slots through dynamic routing iteration. The routing coefficients satisfy the normalization of the interest slot dimension: ; Then, based on the routing coefficients, explicit interest assignment is performed for each interaction to obtain the interest slot index z corresponding to the i-th interaction. i : ; Finally, index z of the interest slots i Output discrete interest tokens (z) through the mapping table. t (z i ),in This is a mapping table from interest slot indexes to token IDs, where K is a preset positive integer.
[0050] This invention explicitly introduces a multi-interest modeling and interest labeling mechanism into a generative recommendation framework, decoupling the multiple interest structures implicit in user history interactions into multiple distinguishable interest labels. Furthermore, it introduces an explicit interest allocation strategy at the per-interaction level, ensuring that each user interaction is associated with a specific interest slot. This solves the problem in existing generative recommendation methods where user interests are compressed into a single vector, making it difficult to express parallel interests. Consequently, it significantly improves the diversity, refinement, and interpretability of recommendation results.
[0051] The behavior tokenization module is used to transform user interaction behavior data (i.e., behavior types, such as browsing, clicking, adding to cart, purchasing, etc.) into discrete behavior tokens. Specifically, this includes generating a corresponding behavior embedding for each behavior type, quantizing the behavior embedding, and generating the corresponding behavior token; that is: Discrete mapping of behavior types: For a set of behavior categories B = {b_1, ..., b_|B|}, a unique behavior token ID is established for each behavior type: .
[0052] It should be noted that when constructing heterogeneous sequences, behavior tokens are mixed with interest tokens and item tokens, enabling generative models to directly predict behavior categories and item tokens through autoregression. Furthermore, the embeddings corresponding to behavior tokens can share space with item token embeddings or use independent embeddings. However, the discretization of behavior tokens does not undergo multi-level residual quantization processing, which is different from the RQ-VAE tokenization mechanism of item tokens.
[0053] The heterogeneous sequence construction module, based on predefined concatenation and sorting rules, concatenates user behavior tokens, interest tokens, and final item tokens into a unified heterogeneous token sequence, starting with the discrete tokens corresponding to the user identifier. Specifically, it combines user identifier tokens, behavior tokens, interest tokens, and final item tokens into a unified sequence in a fixed order, serving as training samples for the generative sequence model. In other words, it pre-defines fixed sorting rules for user identifiers, behaviors, interests, and items, and accordingly, interactively encodes the sequence of user identifiers, behavior tokens, interest tokens, and final item tokens into heterogeneous token fragments according to specified rules. ; Generative sequence models employ a Transformer architecture; where... The encoder processes the input heterogeneous tag sequence to output a context sequence feature matrix that integrates global historical interaction information and multiple intent logics. Specifically, the constructed heterogeneous tag sequence is input into the encoder. First, the discrete tags are transformed into continuous vector representations through an embedding layer. Then, the sequence passes through an encoder module composed of a self-attention mechanism and a feedforward neural network. Under the self-attention mechanism, each tag in the sequence (such as the current "browsing" behavior or a specific "item tag") undergoes global attention weight calculation with all other historical tags in the sequence. Through this deep feature interaction, the model can capture the evolution of user interests over time and the potential impact of different behavior types on subsequent item selections. Finally, the encoder outputs a continuous vector matrix that integrates global historical interaction information and multiple intent logics—the context sequence feature matrix. This matrix, as a high-order representation of the user's historical state, is passed to the decoder.
[0054] The decoder generates target recommendation sequences autoregressively based on the historical states provided by the encoder. Internally, the decoder employs a dual-input and cross-attention interaction mechanism.
[0055] First input (conditional prefix): During the inference phase, the decoder is responsible for using a conditional generation strategy when given interest and behavior tags. It starts with the user ID and concatenates it with the target behavior tag and target interest tag to construct a combined sequence [user ID, target behavior tag, target interest tag] as a guiding prefix. Alternatively, when no interest or behavior tags are given, it uses a fully autoregressive generation strategy, using the user ID as a guiding prefix. The guiding prefix is then used as the first input to the decoder and fed into the decoder's bottom layer. After being processed by the decoder's embedding layer and masked self-attention layer, the guiding prefix is transformed into a sequence hidden state containing the current generation intent (i.e., the current sequence hidden state).
[0056] The second input (contextual constraints) interacts with cross-attention: In the intermediate layer of the decoder, a second input is introduced through a cross-attention mechanism. Specifically, the decoder uses the processed guiding prefix (i.e., the current sequence hidden state) as the query vector, and simultaneously receives the context sequence feature matrix output by the encoder, using it as the key and value vectors for attention weight calculation. Then, a candidate sequence containing the guiding prefix and subsequent item tags is generated through beam search autoregression; subsequently, the candidate sequence is truncated to remove the prefix, extracting the target item tag sequence, and finally, it is decoded and mapped to recommended items to output the recommendation result.
[0057] This invention employs an autoregressive generative sequence model to perform end-to-end modeling of heterogeneous labeled sequences, constructing a unified generation paradigm of "behavior tag - interest tag - item tag" starting with user identifiers. It supports two flexible inference generation modes: fully autoregressive and conditionally guided. In the fully autoregressive generation mode, the model can use user identifiers as guiding prefixes, thus explicitly reflecting the causal influence of user interest states and behavioral intentions on recommendation results during the generation process. This avoids the semantic jump problem of "directly predicting items" in traditional recommendation methods, improving the rationality and interpretability of the generated results. In the conditionally guided generation mode, the model supports using user identifiers and specific target behavior tags and interest tags as guiding prefix inputs, enabling precise intervention in subsequent item generation, thereby greatly improving the controllability and multi-scenario adaptability of the recommendation system.
[0058] The decoding output module is used to decode and map the generated target item tag sequence based on a pre-built identifier mapping dictionary to obtain recommended items and output recommendation results. Specifically, it includes: querying the identifier mapping dictionary to accurately restore the generated target item tag sequence to the actual item ID, and outputting Top-K recommended items as the final recommendation result based on the generation probability; at the same time, depending on the different inference generation modes, it synchronously outputs behavioral tags and interest tags as conditional inputs or generated by model autoregressive prediction, thereby providing explicit multi-intent interpretable information for the current recommendation result. Example
[0059] like Figures 1-3 As shown, the present invention also provides a multi-behavior, multi-interest generative recommendation method based on multi-source item embedding and hybrid tokenization, including the following steps: 1) Collect multi-source information about user projects using the data acquisition module to construct a multi-source dataset; The multi-source information of the project includes at least text information, image information, structured attribute information, and collaborative interaction information; The collaborative interaction information includes historical interaction data and interaction behavior data; The historical interaction data includes at least user identifier, interaction time, and project identifier; The interactive behavior data includes at least the behavior type; 2) The multi-source dataset is preprocessed using a data preprocessing module to obtain a multi-source input set in a unified format; wherein, the preprocessing of the multi-source dataset includes: 2.1) The text information is cleaned, segmented / subtotalized, truncated, and standardized using embedding encoding to obtain the processed text information; 2.2) After the image information is uniformly sized and normalized, a visual coding network is used for feature extraction to obtain the processed image information; 2.3) Discretize / normalize the structured attribute information and process missing values to obtain the processed structured attribute information; 2.4) Perform missing value filling, outlier removal, and sparsification processing on the collaborative interaction information to obtain the processed collaborative interaction information; And the multi-source input set in a unified format is represented as follows: ; in, For text input, Input image information, Input for structured attribute information. Input for collaborative interaction; 3) The multi-source embedding construction module is used to encode the features of text information input, image information input, structured attribute information input, and collaborative interaction information input in the multi-source input set, respectively, to obtain continuous item embedding vectors under different sources. The projection alignment and fusion module is then used to project the continuous item embedding vectors onto a unified embedding space, and finally, learnable weighted fusion is performed to obtain a unified item semantic vector; as detailed below: For each source, calculate the item continuous embedding vector and project it for alignment, i.e.: ; ; in, For a certain set of inputs from a certain source, f (s) For the source encoder, For continuous embedding vectors of the project, Proj (s) For projection networks, For items located in a uniform embedding space, form continuous embedding vectors; Learnable weighted fusion is performed on the continuous embedding vectors of items located in the unified embedding space to obtain a unified item semantic vector, i.e.: ; ; in, h v To unify the semantic vectors of projects, For items located in a uniform embedding space, form continuous embedding vectors. α s For source weight, q For query vector, W For learnable matrices, S For the source set; 4) Perform multi-level residual quantization on the unified item semantic vector using a residual quantization variational autoencoder (RQ-VAE tokenization module), outputting an initial item tag sequence composed of multiple discrete tags; the specific process is as follows: 4.1) Initialize the residual vector, the formula is as follows: r (1) =h v ; 4.2) Section ℓ Codebook E ℓ Select the codebook vector index that best matches the current residual to obtain the ℓth level discrete label. c v,ℓ Its formula is ; 4.3) Update the residual vector, the formula is as follows: r (ℓ+1) =r (ℓ) - E ℓ [ c v,ℓ ]; 4.4) Final output: Initial item tag sequence [ c v,ℓ,..., c v,m ]; 5) Construct multiple semantically related item tag sequences from the initial item tag sequence using a hybrid tagging mechanism, and select the best one as the final item tag sequence output; specifically including: 5.1) Extract multiple model checkpoints from the residual quantization variational autoencoder at different training epochs as a set of semantically relevant tokenizers T={T1,...,T...} M The set of tokenizers is used to construct M sets of item token sequences that represent different but semantically related items for the same initial item token sequence; 5.2) In the pre-training phase of the generative sequence model, the influence score of each group of item label sequences on the model optimization is evaluated by calculating the inner product of the loss gradient of the model on the validation set and the training loss gradient on each group of item label sequences. Then, the sampling probability of each group of item label sequences in the next round of training is dynamically adjusted based on the influence score using a normalization function with a temperature coefficient and a momentum update strategy. As a further step, the specific calculation process for calculating the influence score and dynamically adjusting the sampling probability in step 5.2 is as follows: Let the influence score of the m-th data group be: ; in, This represents the training data of the m-th group of item label sequences in the t-th iteration. Influence score; θ (t) This represents the parameters of the model at the t-th iteration; This represents the loss function of the model on the validation set; This represents the loss function of the model on the m-th training data; ▽ θ ⊤ denotes the gradient with respect to the model parameters θ; ⊤ denotes the vector transpose operation. This formula quantifies the contribution of the training data set to reducing the validation set error by calculating the inner product of the validation set loss gradient and the training set loss gradient, i.e., the first-order gradient approximation.
[0060] Subsequently, an update strategy with temperature coefficient τ and momentum factor λ is used to adjust the sampling probability for the next round: ; in, This represents the probability that the m-th item label sequence is sampled for training in the (t+1)-th iteration; λ is the sampling probability of the previous round; λ is the preset momentum update coefficient (with a value range of 0 < λ < 1), used to smooth the update amplitude of the sampling probability and prevent training oscillations; τ is the temperature hyperparameter (τ > 0), used to control the smoothness of the probability distribution; M is the total number of tokenizers (i.e., the total number of generated item tag sequences); exp(·) is an exponential function with the natural constant e as the base. 5.3) During the fine-tuning and inference phase, from the set of tokenizers T, the checkpoint with the highest sampling probability at the end of the pre-training period or the smallest reconstruction error on the independent validation set is fixedly selected as the target tokenizer with the best performance, and the final item tag sequence is output by the target tokenizer. 6) Using the multi-interest extraction and interest tagging module, perform multi-interest modeling based on capsule networks and dynamic routing on the historical interaction data in the collaborative interaction information to obtain K interest slots and routing coefficients from interactions to these slots. Then, obtain the interest slot index according to the per-interaction explicit interest allocation strategy, and finally generate interest tags from the interest slot index through mapping; specifically including: 6.1) Map each interaction representation to a low-level capsule prediction vector: ;in, u i,k This is the prediction vector for the lower-level capsules. W k For learnable mapping matrices or shared matrices, x i Let be the representation vector of the i-th interaction in the user's historical interaction data.
[0061] 6.2) Initialize the route log values and perform dynamic route iteration, calculating the route coefficients in each iteration: ;in, For routing coefficients, b i,k The initial route log value is set to zero; 6.3) Prediction vector for lower-level capsules u i,k Weighted aggregation is performed to obtain the input of the k-th interest slot, i.e. ; 6.4) Obtain the interest vector of the k-th interest slot by applying the result obtained in step 6.3 through a nonlinear compression function, i.e. ; 6.5) Update route pair values based on consistency b i,k ,Right now ; 6.6) Finally, K interest slot vectors are obtained. and routing coefficients ; 6.7) Based on routing coefficients For each interaction, an explicit interest allocation strategy (i.e., using a per-interaction explicit interest allocation strategy) is performed to obtain the interest slot index z corresponding to the i-th interaction. i ,Right now ; 6.8) Index the interest slot z i This is mapped to a discrete interest tag token: z via a lookup table. t (z i (i.e., interest markers); Among them, z t (·) represents the mapping table from interest slot index to token ID, where k is a preset positive integer; 7) Utilize the behavior tagging module to perform behavior tagging processing on the interactive behavior data in the collaborative interaction information, generating behavior tags; specifically including: Based on the set of behavior types B={b_1,…,b_|B|}, a unique behavior label is established for each behavior type, that is: ; Among them, b i Let bt be the behavior type of the i-th interaction. (bi) The discrete behavior label is the behavior type corresponding to this behavior type, and BehaviorVocab is a mapping table from behavior type to label; 8) Based on pre-defined fixed sorting rules for behaviors, interests, and items, combine the behavior tags, interest tags, and final item tag sequences in a predetermined order to construct a heterogeneous tag sequence; the specific process is as follows: Define the concatenation and sorting rules for user identifiers, behaviors, interests, and items, and then concatenate and combine the user identifier, behavior tags, interest tags, and the final item tag sequence in sequence to construct a heterogeneous tag fragment (i.e., a heterogeneous tag sequence): ; in, u t ( u ) represents the discrete tag corresponding to the user identifier u; b t (b i ) represents the behavior type of the i-th interaction. Corresponding behavior marker; z t (z i ) is the index of the interest slot z i Corresponding interest tags; c i,ℓ For the i-th interaction, the corresponding item is in the i-th position. ℓDiscrete labels obtained from level residual quantization i t (·) represents the mapping from discrete item markers to item markers; L represents the length of the item marker sequence.
[0062] 9) Input the heterogeneous labeled sequences into the encoder of the pre-trained generative sequence model for processing to obtain the context sequence feature matrix; As a further step, in step 9, the encoder's processing of the heterogeneous label sequence includes: first, converting the discrete labels in the heterogeneous label sequence into continuous vector representations through the embedding layer; then, through the encoder module composed of a self-attention mechanism and a feedforward neural network, calculating the global attention weight of each label in the heterogeneous label sequence with all other labels, realizing deep feature interaction, and outputting the context sequence feature matrix. 10) The user identifier is used directly or concatenated with the behavior tag and interest tag as a guiding prefix and input into the decoder of the generative sequence model to obtain the current sequence hidden state. Then, the decoder calculates the attention weight based on the current sequence hidden state and the context sequence feature matrix, and generates candidate sequences through a cluster search autoregressive strategy. After prefix truncation, the target item tag sequence is obtained, and finally, it is decoded and mapped to the recommended items as the recommendation result output.
[0063] As a further step, in step 10, given the interest tag and behavior tag, the decoder adopts a conditional guided generation strategy. Specifically, the user identifier is used as the starting point, and after being concatenated with the behavior tag and interest tag, a combined sequence is constructed as a guiding prefix. This guiding prefix is used as the first input of the decoder and is input to the bottom layer of the decoder. After being processed by the embedding layer and the masked self-attention layer of the decoder, the current sequence hidden state is obtained. In scenarios where no interest or behavior labels are given, the decoder adopts a fully autoregressive generation strategy. Specifically, the user identifier is used as a guiding prefix. This guiding prefix is used as the first input of the decoder and fed into the bottom layer of the decoder. After processing by the embedding layer and the masked self-attention layer of the decoder, the hidden state of the current sequence is obtained. The context sequence feature matrix is used as the second input to the decoder and introduced into the intermediate layer of the decoder through a cross-attention mechanism; The process of obtaining the target item label sequence includes: the decoder uses its currently processed sequence hidden state (i.e., the current sequence hidden state) as the query vector, and uses the context sequence feature matrix output by the encoder as the key and value vectors for attention weight calculation, thereby dynamically extracting the most relevant contextual information from the user's historical interaction features as conditional constraints. Subsequently, a beam search strategy (e.g., beam size set to 50) is used to directly search the top-K candidate sequences with the highest joint probability in the generation space, thus ensuring the diversity and accuracy of the generated results. Finally, prefix truncation is performed on the candidate sequences to remove the leading prefix, extracting the target item label sequence, and then decoding and mapping it to recommended items as the recommendation result output.
[0064] In summary, the core of this invention lies in simultaneously proposing a unified quantitative representation method for multi-source item embedding, a hybrid tokenization control mechanism oriented towards multiple checkpoints, a heterogeneous sequence construction method that starts with user identifiers and jointly models behavior tags, interest tags, and item tags, and a generative recommendation modeling framework oriented towards multiple behaviors and interests. Through the above technical solutions, this invention achieves collaborative modeling of heterogeneous user behavior signals, multi-interest structures, and multi-source semantics of items under a unified generative framework. It solves the problems of missing behavior semantics, insufficient interest expression, and weak long-tail generalization ability in traditional generative recommendations. It has good generalization ability, interpretability, and engineering feasibility, and is suitable as a core generative recommendation solution for large-scale recommendation systems.
[0065] This invention also provides a storage medium for storing a computer program, which, when executed, performs at least the method described in Embodiment 2.
[0066] This invention also provides a control device, including a processor and a storage medium for storing a computer program; wherein, when the processor executes the computer program, it at least executes the method described in Embodiment 2.
[0067] This invention also provides a processor that executes a computer program, at least performing the method described in Embodiment 2.
[0068] Based on the system of Implementation 1 and the method of Implementation 2, the present invention also proposes a training strategy.
[0069] Combination Figure 2 As shown in the diagram, this method flow demonstrates the execution logic of the training phase, specifically, during model training, parameter learning is performed on the generative sequence model based on the training set; in detail: During the training phase, data is first constructed and partitioned using datasets from multiple sources to ensure the effectiveness and accuracy of model training. Specific steps include: 1) Data partitioning: A leave-one-out data partitioning strategy is adopted; each user's last interaction data is used for the test set, the penultimate interaction data for the validation set, and the remaining historical interaction data for the training set. This partitioning strategy ensures that the user data between the training and test sets does not overlap and can simulate a real online recommendation environment.
[0070] Sequence truncation length: During training and evaluation, the interaction sequence of each user is truncated to 50 to ensure the consistency of sequence length and avoid the burden of excessively long sequences on the training process.
[0071] 2) Data preprocessing: During the data preprocessing stage, the various information modalities (text, images, structured attributes, collaborative interactions) of the multi-source datasets will be processed according to the following steps: The text information modal is cleaned, segmented / subtotalized, truncated, and standardized with embedding encoding to obtain the processed text information.
[0072] After standardizing and normalizing the image information modalities, a visual coding network (such as ResNet) is used for feature extraction to obtain the processed image information.
[0073] The structured attribute information is discretized / normalized and missing values are processed to obtain the processed attribute structured attribute information.
[0074] The collaborative interaction information is processed by filling missing values, removing outliers, and sparsifying to obtain the processed collaborative interaction information.
[0075] 3) A unified format for multi-source input sets: After preprocessing, all input data will be organized as follows: ; in, For text input, Input image information, Input for structured attribute information. For collaborative interaction information input.
[0076] 4) Hybrid tokenization mechanism and dynamic sampling strategy: In the pre-training phase of the generative sequence model, a hybrid tokenization mechanism is introduced to enhance the model's representation and generalization ability for long-tailed items. Specifically, for the multi-checkpoint tokenizer set T={T1,...,T...} M During training initialization, each tokenizer is assigned a uniform sampling probability. In each training iteration, the influence score of each tokenizer's generated data on model optimization is evaluated by calculating the inner product of the model's loss gradient on the validation set and the training loss gradient on each group of item-labeled sequences. The sampling probability for the next round is dynamically adjusted using a normalization function with a temperature coefficient and a momentum update strategy. High-influence tokenizers are assigned higher sampling probabilities and prioritized for training. After pre-training, the checkpoint with the highest sampling probability (or the smallest reconstruction error on the independent validation set) is selected as the best-performing target tokenizer for subsequent fine-tuning and inference stages.
[0077] 5) Training sample generation: For each historical user interaction, the system first extracts corresponding discrete interest tags through multi-interest modeling, extracts discrete behavior tags through behavior tokenization, and combines these with the item tag sequence generated by the dynamically sampled tokenizer in the aforementioned hybrid tokenization mechanism. Subsequently, the system strictly follows preset concatenation and sorting rules to sequentially concatenate user tags, behavior tags, interest tags, and item tags, constructing a unified heterogeneous tag sequence. This heterogeneous tag sequence serves as a complete training sample and is input into the generative sequence model for autoregressive joint training.
[0078] 6) Training the generative sequence model: During the training phase, a generative sequence model based on the Transformer architecture is employed, with a teacher forcing strategy used for end-to-end optimization. During training, the encoder receives historical heterogeneous tag sequences from the user for feature extraction; the decoder receives the real "user identifier—behavior tag—interest tag—item tag" sequence as input, using a masking mechanism to ensure that the current position can only focus on tags preceding it, and predicts the next target tag for each position in the sequence in parallel. The model calculates the generation loss (such as cross-entropy loss) between the predicted tag distribution and the real target tags, and updates and optimizes the model parameters through backpropagation.
[0079] 7) Optimize the objective and loss function: During training, the optimization objectives of the generative sequence model include generation loss, alignment regularization term, and multi-task loss; the generation loss is used to measure the accuracy of each generated label; the alignment regularization term is used to ensure the consistency of multimodal features in a unified semantic space; and the multi-task loss is used to improve the model's multi-task learning ability.
[0080] The comprehensive loss function of the generative sequence model includes a weighted sum of generation loss, alignment loss, and multi-task loss. The weights of each loss term are tuned through methods such as cross-validation to ensure that the loss for each task is effectively optimized during training. Each part of the loss function is trained with a corresponding optimization objective to ensure the performance, semantic consistency, and multi-task learning ability of the generative model. Specifically: Generative loss: Given the target sequence y of the training samples t =[y t,1 ,y t,2 ,...y t,Ut ] and the corresponding historical target sequence y (t,<u) The formula for generating loss is: ; in, L gen U represents the generation loss. t Indicates the length of the target sequence; CE( y t,u | y t,<u ,x t ) represents the conditional cross-entropy loss of the u-th label in the target sequence, used to calculate the model's effect on the target label y. t,u The prediction error; ω t,u The loss weight for the u-th target label in the t-th training sample is defined as follows: ; in, w t These are sample-level weights, representing the importance of training sample t. γ sem This is the amplification factor for semantic tags, which controls the importance of semantic tags in the training loss.
[0081] Alignment regularization: In multi-source information fusion, avoiding semantic drift is crucial, especially since information such as text, images, and structured attributes need to have similar representations in the same space. Alignment loss can be calculated by minimizing the cosine similarity between the quantized reconstructed semantic vector and the semantic label vector of the item, i.e.: The formula for calculating the alignment regularization term is: ; in, The semantic vector of the quantized i-th item is reconstructed based on the unified semantic vector. h v The reconstruction result obtained after performing multi-level residual quantization; h (s i () is the encoded vector representation of the semantic tag sequence corresponding to the i-th item; The cosine similarity between the quantized reconstructed semantic vector and the semantic tag vector is used to measure their similarity. The formula for calculating multimodal consistency loss is: ; in, Let be the text or image information modal embedding vector of the i-th item. The InfoNCE loss function in contrastive learning is used during the calculation to constrain the semantic consistency between the text and image modalities. The overall alignment loss is: ; in, L align Indicates alignment loss. α ∈[0,1] are weighting coefficients used to control the relative contributions of the two alignment losses.
[0082] Multi-task loss includes index-language alignment task loss, mask reconstruction task loss, semantic consistency task loss, instruction reasoning task loss, and preference ranking task loss.
[0083] The total multi-task loss is: ; in, L multi For multi-tasking loss, L ILA Loss due to index-language alignment task; L MRT The loss for mask reconstruction is intended to improve the model's robustness in scenarios with missing local information. L SCT The semantic consistency task loss ensures the semantic consistency of the same item across different modalities or contexts. L IRT The loss for the instruction inference task is used to guide the model to generate recommendation results that conform to business rules; L PGT The loss for the preference ranking task is determined by ranking preferences based on the user's historical behavior; β1, β2, ..., β5 are the weight coefficients for each loss item, and the weight coefficients for each loss item are used to adjust the contribution of different tasks to the overall loss.
[0084] Comprehensive loss function: Ultimately, the model's overall loss function comprises a weighted sum of the generation loss, alignment loss, and multi-task loss: ; The weights of each loss term are tuned through methods such as cross-validation to ensure that the loss of each task is effectively optimized during training.
[0085] In addition, the optimizer, hyperparameters, learning rate, etc., are described as follows: Optimizer: The AdamW optimizer is used, which combines weight decay to optimize model parameters.
[0086] Learning rate: During training, a learning rate scheduling method is used. The initial value of the learning rate is set to 0.001 and is dynamically adjusted according to the performance of the validation set during training. The learning rate range is {0.0008, 0.001, 0.0012}.
[0087] Batch size: During training, the batch size is set to 512 to ensure the stability of each training batch.
[0088] Training steps: The total number of training steps is 350,000 to ensure that the model can be fully trained.
[0089] Validation set model selection strategy: During training, NDCG@10 (Normalized Discounted Cumulative Gain) is used as the evaluation metric for the validation set, and the best validation model is selected for testing.
[0090] In summary, during the training phase, this invention uses multi-source data as input to sequentially complete multi-source feature encoding, residual quantization, and construct a hybrid target sequence. An autoregressive model combined with multi-task loss is used for optimization. The loss function includes generation loss, alignment loss, and multi-task loss. The AdamW optimizer is used for training with a learning rate range of {0.0008, 0.001, 0.0012}, a batch size of 512, and 350,000 training steps. NDCG@10 is selected as the validation metric.
[0091] Combination Figure 2 As shown in the diagram, the method flow also illustrates the execution logic used to demonstrate the inference phase, where an autoregressive decoding approach is employed to generate recommendation results on the validation set, test set, or online data. Details are as follows: Decoding method: During the inference phase, an autoregressive decoding approach is used to generate recommendation results. Specifically, the system supports two flexible generation strategies: 1) Conditional Generation: Given interest tags and behavior tags, [user identifier, target behavior tag, target interest tag] is used as a guiding prefix and input to the decoder of the generative sequence model. After processing by the embedding layer and mask self-attention layer of the decoder, the hidden state of the current sequence is obtained.
[0092] 2) Full Autoregressive Generation: In scenarios where no explicit information is given, only the user identifier is input as a guiding prefix to the decoder of the generative sequence model; after processing by the embedding layer and mask self-attention layer of the decoder, the current sequence hidden state is obtained.
[0093] Regardless of the strategy employed, after receiving the current sequence hidden state input, the decoder interacts with the encoder through its internal cross-attention mechanism to extract deep features. Specifically, the decoder uses its currently processed sequence hidden state (i.e., the current sequence hidden state) as the query vector and the context sequence feature matrix output by the encoder as the key and value vectors for attention weight calculation. This dynamically extracts the most relevant contextual information from the user's historical interaction features as conditional constraints. Subsequently, a beam search strategy (e.g., beam size set to 50) is used to directly search the generation space for the Top-K candidate sequences with the highest joint probability, thus ensuring the diversity and accuracy of the generated results.
[0094] Truncating Extraction and Identifier Mapping: After generating candidate sequences, prefix truncation is performed to remove leading prefixes composed of user identifiers, or leading prefixes composed of user identifiers, behavior tags, and interest tags, extracting a clean target item label sequence (i.e., the recommendation sequence). This sequence is then decoded and mapped to recommended items as the recommendation result output. Since generative recommendation directly searches in the discrete label space, the system does not need to score and rank all items. Instead, it directly matches and maps the generated target item label sequence back to the actual item IDs through an identifier mapping dictionary, ultimately outputting a Top-K recommended item list.
[0095] Evaluation metrics: The recommendation performance of the model is calculated by directly comparing the generated item tag sequence with the actual interaction item tag sequence. Recall@K: Evaluates the recall rate of the model in the Top-K generated results, that is, the proportion of the generated Top-K sequences that exactly match the user's actual interactive item tag sequence.
[0096] NDCG@K: Uses the Normalized Discounted Cumulative Gain metric to evaluate the ranking quality of the generated results. NDCG@K considers the position order of the generated sequences in the Beam Search output list, giving higher weight to the more accurate matches that rank higher.
[0097] The evaluation process described above fully leverages the "Direct Retrieval" feature of generative recommendations, eliminating the need for full-ranking of the candidate set in traditional recommendation systems and significantly improving inference efficiency.
[0098] Compared with traditional technologies, the main advantages of this invention are: This invention constructs a unified modeling and residual quantization tokenization mechanism for multi-source item embedding, which maps multi-source data such as textual information, image information, structured attribute information, and collaborative interaction information of items to a unified embedding space. Furthermore, it generates semantically consistent discrete token sequences of items through a residual quantization variational autoencoder. Simultaneously, by configuring multiple checkpoint tokenizers for the same item and introducing a hybrid tokenization control strategy, the sampling probabilities of multiple sets of semantically related sequences are dynamically adjusted based on gradient influence evaluation during the pre-training stage, and the optimal tokenizer is fixed during the inference stage. Thus, while ensuring the uniqueness of decoding, the model's representation and generalization capabilities for long-tail items and cold-start items are effectively enhanced.
[0099] This invention explicitly introduces a multi-interest modeling and interest labeling mechanism into a generative recommendation framework, decoupling the multiple interest structures implicit in user history interactions into multiple distinguishable interest labels. Furthermore, it introduces an explicit interest allocation strategy at the per-interaction level, ensuring that each user interaction is associated with a specific interest slot. This solves the problem in existing generative recommendation methods where user interests are compressed into a single vector, making it difficult to express parallel interests. Consequently, it significantly improves the diversity, refinement, and interpretability of recommendation results.
[0100] This invention performs behavior tagging on different user behavior types, unifying heterogeneous behaviors such as browsing, clicking, favorites, adding to cart, and purchasing into discrete behavior tags. Starting with the user identifier, it constructs a unified heterogeneous tag sequence of behavior tags, interest tags, and item tags in a predetermined order. This enables generative sequence models to simultaneously learn the evolution of user interests, behavioral differences, and item semantic structures in a unified sequence space, thereby overcoming the shortcomings of existing methods where multiple behavior signals are only used as weights or feature inputs and fail to participate in the modeling of the generative target.
[0101] This invention employs an autoregressive generative sequence model to perform end-to-end modeling of heterogeneous labeled sequences, constructing a unified generation paradigm of "behavioral tag - interest tag - item tag" starting with user identifiers. During the training phase, the model efficiently learns the sequence distribution in parallel through a teacher forcing strategy. During the inference phase, the model supports not only a fully autoregressive joint generation method, using user identifiers as guiding prefixes to obtain item recommendations, but also a conditionally guided generation method, using user identifiers, behavioral tags, and interest tags as guiding prefixes to obtain item recommendations. This dual-mode mechanism explicitly reflects the causal influence of user behavioral intent and interest state on the recommendation results, avoiding the semantic jump problem of "directly predicting items" in traditional recommendation methods, and endowing the recommendation system with powerful controllable generation capabilities and interpretability.
[0102] This invention introduces a multi-level residual quantization mechanism during the item tag generation stage, which decomposes the item semantics into multiple discrete tags for common expression, avoiding the representation sparsity problem caused by a single item identifier. This provides a more stable and learnable representation basis for low-frequency items in the recommendation generation space, and significantly improves the coverage and exposure fairness of long-tail items in generative recommendations.
[0103] In summary, the core of this invention lies in simultaneously proposing a unified quantitative representation method for multi-source item embedding, a hybrid tokenization control mechanism oriented towards multiple checkpoints, a heterogeneous sequence construction method that starts with user identifiers and jointly models behavior tags, interest tags, and item tags, and a generative recommendation modeling framework oriented towards multiple behaviors and interests. Through the above technical solutions, this invention achieves collaborative modeling of heterogeneous user behavior signals, multi-interest structures, and multi-source semantics of items under a unified generative framework. It solves the problems of insufficient interest expression, missing behavioral semantics, and weak long-tail generalization ability in traditional generative recommendations. It has good generalization ability, interpretability, and engineering feasibility, and is suitable as a core generative recommendation solution for large-scale recommendation systems.
[0104] In the description of this specification, the terms "one embodiment," "some embodiments," "specific embodiment," etc., refer to a specific feature, structure, material, or characteristic described in connection with that embodiment or example, which is included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0105] Although embodiments of the invention have been shown and described, those skilled in the art will understand that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
Claims
1. A multi-behavior, multi-interest generative recommendation method based on multi-source item embedding and hybrid tokenization, characterized in that, include: 1) Collect user project information from multiple sources and construct a multi-source dataset; The multi-source information for the project includes at least text information, image information, structured attribute information, and collaborative interaction information; Collaborative interaction information includes historical interaction data and interaction behavior data; 2) Preprocess the multi-source datasets to obtain a multi-source input set in a unified format; 3) After feature encoding and projection alignment of each information input from the multi-source input set, a unified project semantic vector is obtained through weighted fusion; 4) Perform multi-level residual quantization on the unified item semantic vector to output an initial item label composed of multiple discrete labels; 5) Construct multiple sets of semantically related item tags from the initial item tags using a hybrid tagging mechanism, and select the best one as the final item tag output; 6) Perform multi-interest modeling on historical interaction data to generate interest tags; 7) Perform behavioral tagging processing on the interaction behavior data to generate behavioral tags; 8) Starting with the user identifier, the behavior tag, interest tag, and final item tag are concatenated sequentially to obtain a heterogeneous tag sequence; 9) Input the heterogeneous labeled sequences into the encoder of the pre-trained generative sequence model for processing to obtain the context sequence feature matrix; 10) The user identifier is used directly or concatenated with the behavior tag and interest tag as a guiding prefix and input into the decoder of the generative sequence model for processing to obtain the current sequence hidden state; Then, the decoder calculates the attention weights based on the current sequence hidden state and the context sequence feature matrix, generates candidate sequences through a bundle search autoregressive strategy, obtains the target item label sequence through prefix truncation, and finally decodes and maps it to the recommended items as the recommendation result output.
2. The multi-behavior, multi-interest generative recommendation method based on multi-source item embedding and hybrid tokenization according to claim 1, characterized in that, In step 3, the continuous embedding vector of the item is calculated and projected and aligned for each source, that is: ; ; in, For a certain set of inputs from a certain source, f (s) For the source encoder, For continuous embedding vectors of the project, Proj (s) For projection networks, For items located in a uniform embedding space, form continuous embedding vectors; In step 3, the continuous embedding vectors of items located in the unified embedding space are subjected to learnable weighted fusion to obtain the unified item semantic vector, i.e.: ; ; in, h v To unify the semantic vectors of projects, For items located in a uniform embedding space, form continuous embedding vectors. α s For source weight, q For query vector, W For learnable matrices, S For the source set.
3. The multi-behavior, multi-interest generative recommendation method based on multi-source item embedding and hybrid tokenization according to claim 1, characterized in that, The process of multi-level residual quantization in step 4 is as follows: 4.1) Initialize the residual vector, the formula is as follows: r (1) =h v ; 4.2) Section ℓ Codebook E ℓ Select the codebook vector index that best matches the current residual to obtain the ℓth level discrete label. c v,ℓ Its formula is ; 4.3) Update the residual vector, the formula is as follows: r (ℓ+1) =r (ℓ) - E ℓ [ c v,ℓ ]; 4.4) Final output: Initial item tag sequence [ c v,ℓ,..., c v,m ].
4. The multi-behavior, multi-interest generative recommendation method based on multi-source item embedding and hybrid tokenization according to claim 1, characterized in that, In step 5, the final item tag sequence is generated as follows: 5.1) Extract multiple model checkpoints from the residual quantization variational autoencoder at different training epochs as a set of semantically relevant tokenizers T={T1,...,T...} M The set of tokenizers is used to construct M sets of item token sequences that represent different but semantically related items for the same initial item token sequence; 5.2) In the pre-training phase of the generative sequence model, the influence score of each group of item label sequences on the model optimization is evaluated by calculating the inner product of the loss gradient of the model on the validation set and the training loss gradient on each group of item label sequences. Then, the sampling probability of each group of item label sequences in the next round of training is dynamically adjusted based on the influence score using a normalization function with a temperature coefficient and a momentum update strategy. 5.3) During the fine-tuning and inference phase, from the set of tokenizers T, the checkpoint with the highest sampling probability at the end of the pre-training period or the smallest reconstruction error on the independent validation set is fixedly selected as the target tokenizer with the best performance, and the final item tag is output by the target tokenizer.
5. The multi-behavior, multi-interest generative recommendation method based on multi-source item embedding and hybrid tokenization according to claim 1, characterized in that, In step 6, the process of generating interest tags is as follows: 6.1) Map each interaction representation to a low-level capsule prediction vector: ;in, u i,k This is the prediction vector for the lower-level capsules. W k For learnable mapping matrices or shared matrices, x i Let be the representation vector of the i-th interaction in the user's historical interaction data; 6.2) Initialize the route log values and perform dynamic route iteration, calculating the route coefficients in each iteration: ;in, For routing coefficients, b i,k The initial route log value is set to zero; 6.3) Prediction vector for lower-level capsules u i,k Weighted aggregation is performed to obtain the input of the k-th interest slot, i.e. ; 6.4) Obtain the interest vector of the k-th interest slot by applying the result obtained in step 6.3 through a nonlinear compression function, i.e. ; 6.5) Update route pair values based on consistency b i,k ,Right now ; 6.6) Finally, K interest slot vectors are obtained. and routing coefficients ; 6.7) Based on routing coefficients For each interaction, explicit interest assignment is performed to obtain the interest slot index z corresponding to the i-th interaction. i ,Right now ; 6.8) Index the interest slot z i This is mapped to a discrete interest tag token: z via a lookup table. t (z i ); Among them, z t (·) is a mapping table from interest slot index to token ID, where k is a preset positive integer.
6. The multi-behavior, multi-interest generative recommendation method based on multi-source item embedding and hybrid tokenization according to claim 1, characterized in that, The interaction behavior data includes behavior types. In step 7, the behavior tokenization process includes: Based on the set of behavior types B={b_1,…,b_|B|}, a unique behavior label is established for each behavior type, that is: ; Among them, b i Let bt be the behavior type of the i-th interaction. (bi) The discrete behavior label corresponds to this behavior type, and BehaviorVocab is a mapping table from behavior type to label.
7. The multi-behavior, multi-interest generative recommendation method based on multi-source item embedding and hybrid tokenization according to claim 1, characterized in that, In step 8, the process of generating heterogeneous marker sequences is as follows: Define the concatenation and sorting rules for user identifiers, behaviors, interests, and items, and then sequentially concatenate and combine the user identifier, behavior tags, interest tags, and the final item tag sequence to construct a heterogeneous tag fragment: ; in, u t ( u ) represents the discrete tag corresponding to the user identifier u; b t (b i Let ) be the behavior type b of the i-th interaction. i Corresponding behavior marker; z t (z i ) is the index of the interest slot z i Corresponding interest tags; c i,ℓ For the i-th interaction, the corresponding item is in the i-th position. ℓ Discrete labels obtained from level residual quantization i t (·) represents the mapping from discrete item markers to item markers; L represents the length of the item marker sequence.
8. The multi-behavior, multi-interest generative recommendation method based on multi-source item embedding and hybrid tokenization according to claim 1, characterized in that, In steps 9 and 10, the encoder's processing of the heterogeneous label sequence includes: first, converting the discrete labels in the heterogeneous label sequence into continuous vector representations through an embedding layer; then, through an encoder module composed of a self-attention mechanism and a feedforward neural network, calculating the global attention weight of each label in the heterogeneous label sequence with all other labels, realizing deep feature interaction, and outputting a set of context sequence feature matrices that integrate global historical interaction information and multiple intent logic. In a scenario where interest tags and behavior tags are given, the decoder adopts a conditional guided generation strategy. Specifically, the user identifier is used as the starting point and concatenated with the behavior tag and interest tag to construct a combined sequence as a guiding prefix. This guiding prefix serves as the first input of the decoder and is fed into the bottom layer of the decoder. After processing by the embedding layer and the mask self-attention layer of the decoder, the hidden state of the current sequence is obtained. In scenarios where no interest or behavior tags are given, the decoder adopts a fully autoregressive generation strategy. Specifically, the user identifier is used as a guiding prefix. This guiding prefix is used as the first input of the decoder and fed into the bottom layer of the decoder. After processing by the embedding layer and the mask self-attention layer of the decoder, the hidden state of the current sequence is obtained. The context sequence feature matrix is used as the second input to the decoder and introduced into the intermediate layer of the decoder through a cross-attention mechanism; The process of obtaining the target item tag sequence includes: the decoder uses the current sequence hidden state as the query vector and the context sequence feature matrix as the key vector and value vector to calculate the cross-attention weight; then, through a beam search autoregressive method, the top-K item tags with the highest joint probability containing the leading prefix are generated as candidate sequences; subsequently, the candidate sequences are truncated to remove the leading prefix, the target item tag sequence is extracted, and then it is decoded and mapped to recommended items as the recommendation result output.
9. The multi-behavior, multi-interest generative recommendation method based on multi-source item embedding and hybrid tokenization according to claim 1, characterized in that, During training, the optimization objectives of the generative sequence model include generation loss, alignment regularization term, and multi-task loss; the comprehensive loss function of the generative sequence model includes the weighted sum of generation loss, alignment loss, and multi-task loss; the weights of each loss term are tuned through methods such as cross-validation to ensure that the loss of each task is effectively optimized during training.
10. A system employing the multi-behavior, multi-interest generative recommendation method based on multi-source item embedding and hybrid tokenization as described in any one of claims 1-9, characterized in that, include: The data acquisition module is used to collect multi-source information about user-related projects; The data preprocessing module is used to clean, format, and standardize the collected multi-source information datasets to obtain a unified format of multi-source input sets, ensuring that all data inputs conform to a unified standard. The multi-source embedding building block is used to encode the features of each input data in the multi-source input set to obtain continuous embedding vectors of items from different sources; The projection alignment and fusion module is used to perform projection and learnable weighted fusion processing on the continuous embedding vectors of the items to obtain a unified item semantic vector; The RQ-VAE tokenization module, consisting of a residual quantization variational autoencoder, is used to transform a uniform item semantic vector into a discrete sequence of initial item tokens. The hybrid tokenization control module is used to configure multiple tokenizers for the initial item tag sequence to construct multiple sets of semantically related item tag sequences, and select the best one as the final item tag sequence. The multi-interest extraction and interest tagging module is used to perform multi-interest modeling based on capsule network and dynamic routing on user historical interaction data, extract K user interest slots and output the corresponding interest tag tokens; The behavior tokenization module is used to transform user interaction behavior data into discrete behavior tokens; The heterogeneous sequence construction module is used to concatenate user identifiers, behavior tags, interest tags, and final item tags into a unified heterogeneous tag sequence according to the set sorting rules. Generative sequence models employ a Transformer architecture; where... The encoder is used to process the heterogeneous labeled input sequence to output a context sequence feature matrix that integrates global historical interaction information and multiple intent logic; The decoder employs a dual-input and cross-attention interaction mechanism and features dual generation strategies, including a conditionally guided generation strategy and a fully autoregressive generation strategy. The decoder is responsible for: When given interest and behavior tags, employing a conditional guided generation strategy, it uses the user identifier as the starting point, concatenating it with the behavior and interest tags to construct a combined sequence as a guiding prefix; or when no interest or behavior tags are given, it employs a fully autoregressive generation strategy, using the user identifier as the guiding prefix. It is responsible for using the guiding prefix as the first input to the decoder, feeding it into the decoder's lower layer, where it is processed by its internal embedding layer and masked self-attention layer to obtain the current sequence hidden state. It is responsible for using the context sequence feature matrix as the second input to the decoder, and introducing it into the decoder's intermediate layer through a cross-attention mechanism. It is responsible for using the current sequence hidden state as the query vector and the context sequence feature matrix as the key and value vectors to calculate the cross-attention weights. Then, it generates the Top-K item tags containing the guiding prefix with the highest joint probability through a beam search autoregressive method, serving as candidate sequences. Subsequently, it performs prefix truncation on the candidate sequences to remove the guiding prefix, extracting the target item tag sequence. The decoding output module is used to decode and map the generated target item tag sequence based on a pre-built identifier mapping dictionary to obtain recommended items and output them as recommendation results.