A user interest portrait prediction method based on cross-modal semantic consistency regulation

By employing a cross-modal semantic consistency control method, adaptive fusion of image and text features is achieved, which solves the problem of insufficient cross-modal semantic alignment capability in existing technologies, improves the accuracy and robustness of user interest prediction, and performs particularly well in fine-grained classification tasks.

CN122241440APending Publication Date: 2026-06-19闽南科技学院

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
闽南科技学院
Filing Date
2026-05-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively integrate image and text information and lack cross-modal semantic alignment capabilities, resulting in low accuracy and poor robustness in fine-grained user interest classification. In particular, they are difficult to achieve high-precision classification when categories are similar, semantic dependencies are strong, and image quality is unstable.

Method used

By employing a cross-modal semantic consistency control method, feature mapping units and gating functions are used for cross-modal feature alignment. Combined with an adaptive residual structure and a competitive weight allocation mechanism, adaptive fusion and dynamic control of multimodal features are achieved, thereby improving cross-modal semantic alignment capability and model robustness.

Benefits of technology

It improves the accuracy and robustness of user interest prediction, and can achieve high-precision classification in up to 24 fine-grained classification tasks, outperforming traditional single-modal and conventional multimodal fusion methods.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241440A_ABST
    Figure CN122241440A_ABST
Patent Text Reader

Abstract

This invention relates to the fields of user profiling and multimodal deep learning, specifically to a user interest profiling prediction method based on cross-modal semantic consistency regulation, comprising the following steps: S1: acquiring image data and corresponding text data of the user to be predicted, assigning interest tags based on the image data, and obtaining a text set corresponding to the image data based on the text data; S2: extracting the image visual features of the image data using a pre-trained convolutional neural network, performing global averaging and linear transformation on the image visual features to obtain visual features, and extracting the semantic features of the text set using text representation methods to obtain text features; S3: performing cross-modal adaptive fusion of the visual features and the text features to obtain fused features; S4: inputting the fused features into a fully connected layer for mapping, and improving the accuracy of user interest prediction by outputting the user interest category and its probability, as well as the user interest prediction result.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of user profiling and multimodal deep learning, specifically to a user interest profile prediction method based on cross-modal semantic consistency regulation. Background Technology

[0002] With the rapid development of social networking platforms and image-sharing applications, users continuously publish a large number of images and related descriptive information on these platforms. This data contains rich user interest characteristics. User interest profiling prediction technology based on image and text information plays an important role in scenarios such as personalized recommendation, precise advertising, and content distribution. Existing user interest prediction methods mainly rely on single-modal information, extracting visual features through image convolutional neural networks (such as ResNet) or extracting semantic features through text representation models (such as TF-IDF, Word2Vec, or BERT) and then classifying interests.

[0003] However, methods using only a single modality have limited accuracy when dealing with fine-grained interest categories, especially in the following situations: Images have high semantic similarity and are difficult to distinguish by visual means alone: ​​different interest categories may be highly similar visually. For example, the food and beverage category may both include dining scenes, and the outdoor and travel category may both include natural landscapes and outdoor activities. It is difficult to effectively distinguish these fine-grained interest categories by relying solely on visual information, which can easily lead to classification confusion. Interest identification relies on semantic information, while visual information is insufficient: the determination of some user interests depends on textual semantics, such as categories like "business," "education," and "culture." These categories are difficult to accurately identify using images alone and require semantic supplementation by combining keywords, tags, or descriptive text in the images. Otherwise, the model will have difficulty capturing high-level semantic information. Complex image content or noise leads to unstable visual features: Images in social networks often have problems such as complex backgrounds, unclear subjects, severe occlusion, or inconsistent shooting quality, which lead to unstable visual feature expression and thus affect the accuracy of interest classification results.

[0004] While existing research attempts to combine image and text information for user interest prediction, it still suffers from the following shortcomings: a lack of structured image-text pairing data processing mechanisms makes it difficult to construct high-quality multimodal training samples; multimodal models have complex structures and high computational costs, hindering practical deployment; cross-modal semantic alignment capabilities are insufficient, with most methods remaining at the feature concatenation level, failing to achieve deep semantic interaction between images and text, thus limiting model performance improvement; therefore, existing technologies struggle to achieve high-precision, fine-grained user interest classification in complex social network image scenarios, especially when there is category similarity, strong semantic dependence, and unstable image quality.

[0005] How to effectively integrate image and text information, improve cross-modal semantic alignment capabilities, and achieve high accuracy and robustness in up to 24 fine-grained interest classification tasks has become an urgent technical problem to be solved. Summary of the Invention

[0006] The purpose of this invention is to provide a user interest profile prediction method based on cross-modal semantic consistency regulation to improve the accuracy of user interest prediction.

[0007] To achieve the above objectives, the present invention adopts the following technical solution: A user interest profile prediction method based on cross-modal semantic consistency regulation includes the following steps performed sequentially: S1: Obtain the image data and corresponding text data of the user to be predicted, assign interest tags based on the image data, and obtain the text set corresponding to the image data based on the text data; S2: Use a pre-trained convolutional neural network to extract the visual features of the image data and obtain the visual features. Use text representation methods to extract the semantic features of the text set and obtain the text features. S3: Perform cross-modal adaptive fusion of the visual feature and the text feature. The specific steps are as follows: S3-1: The text feature is mapped to the same feature space as the visual feature through the feature mapping unit. The visual feature is then semantically guided to be updated using a gating function based on cross-modal semantic consistency and feature differences, so as to achieve cross-modal feature alignment. S3-2: By introducing an adaptive residual structure, dynamic control of the enhanced features can be achieved; S3-3: In the fusion process, a competitive weight allocation mechanism is adopted, which takes visual features, text features and cross-modal interaction features as different semantic expression paths, and performs competitive selection and dynamic weighting of each feature based on the normalized weight generation function, thereby realizing the adaptive fusion of multimodal features at different semantic levels and obtaining fused features. S4: Input the fused feature into the fully connected layer for mapping, and output the user interest category and its probability, as well as the user interest prediction result through the activation function.

[0008] Preferably, the text data is obtained through keyword extraction or image description.

[0009] Preferably, the interest category is a fine-grained classification task, with no fewer than 24 categories.

[0010] Preferably, step S1 includes resizing and normalizing the image data.

[0011] Preferably, step S1 includes cleaning processes such as word segmentation, stop word removal, noise character removal, duplicate content filtering, and text normalization of the text data.

[0012] Preferably, the specific steps of step S3 include the following processing flow: S3-1: Calculate the first difference feature between the visual feature and the text feature, aggregate the first difference feature to obtain the difference measure, calculate the semantic similarity between the visual feature and the text feature, normalize the semantic similarity to obtain the normalized semantic similarity, construct a gating function based on the semantic similarity and the difference measure, and introduce an adaptive adjustment factor. γ The gating function is calibrated to obtain the gating coefficient. The gating coefficient is then used to guide the update of the visual feature, so that the visual feature is adaptively adjusted along the semantic direction of the text to obtain the guiding feature. S3-2: Calculate the second difference feature between the guiding feature and the visual feature, aggregate the second difference feature to obtain the second difference measure, calculate the semantic similarity between the second difference feature and the guiding feature, calculate the residual feature between the guiding feature and the visual feature, construct the adjustment coefficient based on the second difference feature and the semantic similarity, modulate the residual feature using the residual adjustment coefficient to obtain the residual feature, and perform residual fusion with the visual feature to obtain the enhanced visual feature; S3-3: Construct the visual feature, the text feature, and the cross-modal interaction feature between the visual feature and the text feature into visual semantic branches, text semantic branches, and interaction semantic branches respectively, and perform multi-semantic path feature aggregation; Based on each semantic branch, branch competition weights are constructed. The branch weights are normalized using a competitive normalization function to obtain the competition weights corresponding to each semantic branch. Based on each competition weight, the corresponding semantic branches are competitively weighted and fused to obtain the fused features.

[0013] By adopting the aforementioned design scheme, the beneficial effects of the present invention are: the present application improves the cross-modal semantic alignment capability by using a gating function constructed based on cross-modal semantic consistency and feature differences to perform semantic guidance updates on visual features. By introducing an adaptive residual structure, dynamic control of the enhanced features can be achieved, avoiding over-enhancement or information suppression problems and improving the robustness of the model. By employing a competitive weight allocation mechanism during the fusion process, visual features, text features, and cross-modal interaction features are used as different semantic expression paths. Based on a normalized weight generation function, each feature is competitively selected and dynamically weighted, thereby achieving adaptive fusion of multimodal features at different semantic levels. Unlike traditional methods based on fixed fusion strategies or simple weighted summation, this mechanism achieves dynamic adjustment of the contributions of different modalities through multi-branch competition, improving the expressive power and robustness of the fused representation. Attached Figure Description

[0014] Figure 1 This is a flowchart of the prediction method of the present invention; Figure 2 This is a flowchart of the cross-modal semantic consistency gating process of the present invention; Figure 3 This is a flowchart of the adaptive residual enhancement process of the present invention; Figure 4 This is a flowchart of the multi-branch competitive fusion process of the present invention. Detailed Implementation

[0015] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of this invention, and not all embodiments. Based on the embodiments of this invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this invention.

[0016] The terms "first," "second," "third," etc., used in the specification, claims, and accompanying drawings of this invention are used to distinguish different objects, not to describe a specific order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or apparatuses.

[0017] A user interest profile prediction method based on cross-modal semantic consistency regulation, such as Figure 1 As shown, the steps are executed sequentially as follows: S1: Obtain the image data and corresponding text data of the user to be predicted, preprocess the image data and assign interest tags, clean the text data and obtain the text set corresponding to the image data; In this embodiment, image data of the user to be predicted and corresponding text data are obtained from a social network platform. The text data includes image keywords, tags and / or user description information. This text data was obtained through keyword extraction or image description; The image data is preprocessed, including resizing (e.g., to 224×224) and normalization, to meet the input requirements of the visual feature extraction model; The text data cleaning process includes word segmentation, stop word removal, noise character removal, duplicate content filtering, and text normalization. Based on the interest category to which the image belongs, interest tags are assigned to the image and its corresponding text to construct image-text-tag triplet data.

[0018] S2: Use a pre-trained convolutional neural network ResNet50 to extract the visual features of this image data. During the image visual feature extraction process, the parameters of the pre-trained convolutional neural network ResNet50 are frozen, and only the newly added network layer is trained. The newly added network layer includes a global average pooling layer and at least one fully connected layer, which is used to reduce the dimensionality and perform feature mapping on the feature vectors extracted by ResNet50 to obtain a visual feature representation suitable for subsequent multimodal fusion.

[0019] Global average pooling is performed on the feature map output by convolution to compress the spatial dimension into channel feature vectors; The obtained channel feature vectors are input into a fully connected layer for linear mapping, and then subjected to nonlinear transformation using the ReLU function. Introduce a Dropout mechanism (e.g., a dropout rate of 0.5) during training to prevent overfitting; A channel attention mechanism is introduced to perform global average pooling and global max pooling on the feature map, and channel weights are generated through a fully connected layer to weight and enhance the original features to obtain visual features, thereby improving the ability to express key features. The semantic features of the text set are extracted using text representation methods to obtain text features. In this embodiment, the specific steps for extracting semantic features using text representation methods are as follows: calculate term frequency (TF) and inverse document frequency (IDF) to obtain text weight representation; set the feature word dimension to 5000 to control the feature space size. Ultimately, the text is converted into a fixed-dimensional vector representation, i.e., text features.

[0020] S3: Perform cross-modal adaptive fusion of the visual feature and the text feature. The specific steps are as follows: S3-1: As Figure 2As shown, the text feature is mapped to the same feature space as the visual feature through the feature mapping unit. A gating function based on cross-modal semantic consistency and feature differences is used to perform semantic-guided updates on the visual feature to achieve cross-modal feature alignment. Calculate the first difference feature between the visual feature and the text feature. : ; in, Represents visual feature vectors. This represents the text feature vector.

[0021] The first difference feature is aggregated to obtain the difference measure. : ; Here, mean() represents the mean aggregation function.

[0022] Calculate the semantic similarity between the visual feature and the text feature. : ; Here, cosine() represents the cosine similarity calculation function.

[0023] The semantic similarity is normalized to obtain the normalized semantic similarity. : ; A gating function is constructed based on this semantic similarity and this difference metric. : ; in, () represents the exponential mapping function. τ This represents the differential decay adjustment parameter.

[0024] Introducing an adaptive adjustment factor γ The gating function is calibrated. ; in, G This represents the calibrated gating coefficient.

[0025] The visual feature is updated with a calibrated gating function to guide its direction, thus adaptively adjusting the visual feature along the semantic direction of the text and obtaining the guiding feature. :

[0026] S3-2: As Figure 3As shown, by introducing an adaptive residual structure, dynamic control of the enhanced features is achieved: Calculate the second difference feature between the guiding feature and the visual feature. : ; The second difference feature is aggregated to obtain the second difference measure. :

[0027] Here, mean() represents the mean aggregation function.

[0028] Calculate the semantic similarity between the second difference feature and the guiding feature. : ; Here, cosine() represents the cosine similarity calculation function.

[0029] Calculate the residual feature between the guiding feature and the visual feature. R : ; Based on this second difference feature and this semantic similarity, an adjustment coefficient α is constructed: ; Here, exp() represents the exponential mapping function.

[0030] Using residual adjustment coefficient α Modulate the residual feature:

[0031] in, R′ This represents the characteristics of the modulated residual.

[0032] The modulated residual feature is residually fused with the visual feature to obtain the enhanced visual feature. : ; S3-3: As Figure 4 As shown, a competitive weight allocation mechanism is adopted in the fusion process, using visual features, text features, and cross-modal interaction features as different semantic expression paths. Based on the normalized weight generation function, each feature is competitively selected and dynamically weighted, thereby achieving adaptive fusion of multimodal features at different semantic levels. The visual feature, the text feature, and the cross-modal interaction feature between the visual feature and the text feature are respectively constructed as visual semantic branches, text semantic branches, and interaction semantic branches, and multi-semantic path feature aggregation is performed; Based on each semantic branch, branch competition weights are constructed. The competition weights of each branch are then normalized using the competitive normalization function Softmax() to obtain the competition weights corresponding to each semantic branch.

[0033] in, This represents the competition weight corresponding to the i-th semantic branch. This represents the competition score for the corresponding semantic branch.

[0034] Based on the competitive weights, the corresponding semantic branches are fused using a competitive weighted fusion method to obtain the fused features. F :

[0035] in, Let F represent the i-th semantic branch feature, and let F represent the i-th fused feature.

[0036] Furthermore, Represents visual feature branches, Representing text feature branches, If we represent the cross-modal interaction feature branch, then the above formula can be expanded as follows: ; in, The feature weights are generated through a competitive normalization function; S4: Input the fused feature F into the fully connected layer for mapping, and output the user interest category and its probability, as well as the user interest prediction result, through the Softmax function; the interest category is a fine-grained classification task with no less than 24 categories; in this embodiment, during the training process, sparse classification cross-entropy is used as the loss function, the training sample labels are used as the supervision signal, the Adam optimizer is used for parameter updates, and an early stopping mechanism is set. When the validation set loss does not decrease for several consecutive rounds, the training is terminated, and the optimal model parameters are restored after training is completed.

[0037] Table 1. Accuracy comparison of the user interest profile prediction method in this application with other models:

[0038] As shown in Table 1, under the same dataset conditions, the performance of various methods in the user interest classification task shows a trend of gradual improvement from unimodal to multimodal.

[0039] First, from the perspective of text modality, traditional text representation methods based on TF-IDF and Word2Vec generally perform similarly across different classifiers. TF-IDF combined with Logistic Regression and SVM models achieves an accuracy of 0.38, while Word2Vec performs slightly lower, with accuracy typically between 0.31 and 0.36. The BERT model achieves an accuracy of 0.38, comparable to TF-IDF. Overall, pure text models are limited by the sparsity of image keyword text and their limited semantic expressive power, making it difficult to accurately characterize users' fine-grained interest features.

[0040] Secondly, from the perspective of image modalities, visual models based on convolutional neural networks can improve classification performance to some extent. Specifically, DenseNet121 achieved an accuracy of 0.41, outperforming ResNet50 (0.39) and VGG16 (0.29), indicating that deep convolutional networks have stronger expressive power in visual feature extraction. However, relying solely on visual information is still insufficient to fully distinguish semantically similar interest categories, resulting in limited overall performance improvement.

[0041] In the multimodal fusion method, the model performance is further improved by combining image and text information, indicating that multimodal information can effectively make up for the shortcomings of a single modality and improve the accuracy of interest classification.

[0042] Compared to the baseline model, the fusion method proposed in this invention achieves a stable improvement in all indicators, with an accuracy of 0.44 and other indicators being better. It also demonstrates good stability and consistency in multiple experiments, outperforming traditional single-modal methods and conventional multimodal fusion methods.

[0043] Experimental results show that the method of this application can effectively improve the consistency and discriminativeness of cross-modal feature representation in fine-grained user interest classification tasks, and has stronger discriminative and generalization abilities.

[0044] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A user interest portrait prediction method based on cross-modal semantic consistency regulation, characterized by: The steps are as follows, performed sequentially: S1: Obtain the image data and corresponding text data of the user to be predicted, assign interest tags based on the image data, and obtain the text set corresponding to the image data based on the text data; S2: Use a pre-trained convolutional neural network to extract the visual features of the image data and obtain the visual features. Use text representation methods to extract the semantic features of the text set and obtain the text features. S3: Perform cross-modal adaptive fusion of the visual feature and the text feature. The specific steps are as follows: S3-1: The text feature is mapped to the same feature space as the visual feature through the feature mapping unit. The visual feature is then semantically guided to be updated using a gating function based on cross-modal semantic consistency and feature differences, so as to achieve cross-modal feature alignment. S3-2: By introducing an adaptive residual structure, dynamic control of the enhanced features can be achieved; S3-3: In the fusion process, a competitive weight allocation mechanism is adopted, which takes visual features, text features and cross-modal interaction features as different semantic expression paths, and performs competitive selection and dynamic weighting of each feature based on the normalized weight generation function, thereby realizing the adaptive fusion of multimodal features at different semantic levels and obtaining fused features. S4: Input the fused feature into the fully connected layer for mapping, and output the user interest category and its probability, as well as the user interest prediction result through the activation function.

2. The user interest portrait prediction method based on cross-modal semantic consistency regulation according to claim 1, wherein: The text data was obtained through keyword extraction or image description.

3. The user interest profiling prediction method based on cross-modal semantic consistency regulation according to claim 2, characterized in that: The interest categories are fine-grained classification tasks, with no fewer than 24 categories.

4. The user interest profiling prediction method based on cross-modal semantic consistency regulation according to claim 3, characterized in that: Step S1 includes resizing and normalizing the image data.

5. The user interest profiling prediction method based on cross-modal semantic consistency regulation according to claim 4, characterized in that: Step S1 includes cleaning the text data by segmenting it into words, removing stop words, removing noisy characters, filtering duplicate content, and normalizing the text.

6. The user interest profiling prediction method based on cross-modal semantic consistency regulation according to claim 5, characterized in that: The specific steps in step S3 include the following processing flow: S3-1: calculating a first difference feature between the visual feature and the text feature, aggregating the first difference feature to obtain a difference measure, calculating a semantic similarity between the visual feature and the text feature, normalizing the semantic similarity to obtain a normalized semantic similarity, constructing a gating function based on the semantic similarity and the difference measure, introducing an adaptive adjustment factor γ , calibrating the gating function to obtain a gating coefficient, and using the gating coefficient to directionally guide update of the visual feature, so that the visual feature is adaptively adjusted in a text semantic direction to obtain a guided feature; S3-2: Calculate the second difference feature between the guiding feature and the visual feature, aggregate the second difference feature to obtain the second difference measure, calculate the semantic similarity between the second difference feature and the guiding feature, calculate the residual feature between the guiding feature and the visual feature, construct the adjustment coefficient based on the second difference feature and the semantic similarity, modulate the residual feature using the residual adjustment coefficient to obtain the residual feature, and perform residual fusion with the visual feature to obtain the enhanced visual feature; S3-3: Construct the visual feature, the text feature, and the cross-modal interaction feature between the visual feature and the text feature into visual semantic branches, text semantic branches, and interaction semantic branches respectively, and perform multi-semantic path feature aggregation; Based on each semantic branch, branch competition weights are constructed. The branch weights are normalized using a competitive normalization function to obtain the competition weights corresponding to each semantic branch. Based on each competition weight, the corresponding semantic branches are competitively weighted and fused to obtain the fused features.