Content recommendation method and apparatus, computer device, storage medium, and product

By generating aligned positive and negative samples to train the content feature extraction model, the problem of inaccurate content recommendation in existing technologies is solved, and higher recommendation accuracy and understanding of modal content relevance are achieved.

CN117056537BActive Publication Date: 2026-06-19TENCENT TECH (BEIJING) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TENCENT TECH (BEIJING) CO LTD
Filing Date
2022-08-11
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Current technologies lack accuracy in content recommendation, relying primarily on the experience of operations personnel or popular content lists, leading to inaccurate recommendations.

Method used

By acquiring a sample set, extracting multimodal sub-content, generating aligned positive and aligned negative samples, training the content feature extraction model, and using the trained model for multimodal feature extraction and recommendation.

Benefits of technology

It improves the accuracy of content recommendation, enabling it to better learn the semantic information and relevance of different modalities of content, thereby enhancing recommendation performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117056537B_ABST
    Figure CN117056537B_ABST
Patent Text Reader

Abstract

This application discloses a content recommendation method, apparatus, computer device, storage medium, and product. It involves acquiring a sample set and extracting multimodal sub-content from each content sample in the sample set to obtain multiple modal-different sub-contents of the content sample; generating aligned positive samples based on the modal-different first and second sub-contents of each content sample; generating aligned negative samples based on the first sub-content of each content sample and the dissimilar second sub-contents of dissimilar content samples in the sample set, wherein the dissimilar second sub-contents have the same modality as the second sub-contents; training a content feature extraction model based on the aligned positive and aligned negative samples to obtain a trained content feature extraction model; and performing content recommendation on the content to be distributed based on the trained content feature extraction model. This improves the feature extraction capability of the content feature extraction model, thereby improving the accuracy of content recommendation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of communication technology, specifically to a content recommendation method, apparatus, computer equipment, storage medium, and product, wherein the storage medium is a computer-readable storage medium, and the product is a computer program product. Background Technology

[0002] In this era of rapid internet development, content is constantly being generated, but the quality of that content varies greatly. If we cannot recommend high-quality content that users are interested in, users will not be able to access information in a timely manner. High-quality content is defined as content that receives many clicks and plays in a short period of time. Timely prediction of high-quality content can improve the accuracy of content recommendations.

[0003] Currently, the judgment of high-quality content mainly relies on the work experience of operations personnel or by checking the popular content lists of similar content platforms on the Internet. The relevant high-quality content is determined and pushed based on the popular content lists of other content platforms, which leads to inaccurate content recommendations. Summary of the Invention

[0004] This application provides a content recommendation method, apparatus, computer device, storage medium, and product, which can improve the accuracy of content recommendation.

[0005] This application provides a content recommendation method, including:

[0006] Obtain a sample set, and extract multimodal sub-contents for each content sample in the sample set to obtain multiple sub-contents with different modalities from the content sample;

[0007] Generate aligned positive samples based on the first and second sub-contents with different modalities in each content sample;

[0008] Based on the first sub-content of each content sample and the distinct second sub-content of the distinct content samples of the content sample in the sample set, an aligned negative sample is generated, wherein the distinct second sub-content has the same modality as the second sub-content.

[0009] The content feature extraction model is trained using the aligned positive samples and the aligned negative samples to obtain a trained content feature extraction model. Based on the multimodal content feature information obtained by multimodal feature extraction of the content to be distributed using the trained content feature extraction model, content recommendation is performed.

[0010] Accordingly, this application also provides a content recommendation device, comprising:

[0011] An acquisition unit is used to acquire a sample set and extract multimodal sub-contents from each content sample in the sample set to obtain multiple sub-contents with different modalities from the content sample.

[0012] The positive sample generation unit is used to generate aligned positive samples based on the first and second sub-contents with different modalities in each content sample;

[0013] A negative sample generation unit is configured to generate aligned negative samples based on a first sub-content of each content sample and a second sub-content of a dissimilar content sample of the content sample in the sample set, wherein the second dissimilar sub-content has the same modality as the second sub-content.

[0014] The training unit is used to train the content feature extraction model based on the aligned positive samples and the aligned negative samples to obtain the trained content feature extraction model, and to recommend content based on the multimodal content feature information obtained by multimodal feature extraction of the content to be distributed through the trained content feature extraction model.

[0015] In one embodiment, the positive sample generation unit includes:

[0016] The first positive sample generation subunit is used to generate the first aligned positive sample based on the video content and audio content in each content sample;

[0017] The second positive sample generation subunit is used to generate a second aligned positive sample based on the text content and audio content in each content sample;

[0018] The third positive sample generation subunit is used to generate the aligned positive sample based on the first aligned positive sample and the second aligned positive sample corresponding to each content sample.

[0019] In one embodiment, the training unit includes:

[0020] The masking subunit is used to perform masking processing on the aligned positive sample and the aligned negative sample respectively to obtain the masked positive sample and the masked negative sample.

[0021] The first model training subunit is used to train the content feature extraction model based on the masked positive samples and the masked negative samples.

[0022] In one embodiment, the training unit includes:

[0023] A similarity calculation subunit is used to calculate the similarity between the content sample and the dissimilar content sample based on the target modal sub-content of the content sample and the target modal sub-content of the dissimilar content sample.

[0024] The weight calculation subunit is used to calculate the negative sample weight corresponding to the aligned negative sample based on the similarity, wherein the similarity is negatively correlated with the sample weight;

[0025] The second model training subunit is used to train the content feature extraction model based on the aligned positive samples, the aligned negative samples, and the weights of the negative samples.

[0026] In one embodiment, the content recommendation device further includes:

[0027] The data acquisition unit is used to acquire the content to be distributed, as well as the content interaction data and content publishing object of the content to be distributed;

[0028] A fitting unit is used to perform trend fitting on the content interaction data to obtain the interaction trend feature information of the content to be distributed.

[0029] The content feature extraction unit is used to extract content features from the multimodal content in the content to be distributed, and obtain the multimodal content feature information of the content to be distributed;

[0030] An object feature extraction unit is used to extract object features from the content publishing object based on the object data of the content publishing object, and obtain the object feature information of the content publishing object.

[0031] The fusion unit is used to perform feature fusion processing on the interaction trend feature information, the multimodal content feature information and the object feature information to obtain the fused content feature information of the content to be distributed;

[0032] The recommendation unit is used to recommend content to be distributed based on the fused content feature information.

[0033] In one embodiment, the content feature extraction unit includes:

[0034] The content extraction subunit is used to extract content from the content to be distributed, and obtain sub-content of different modalities contained in the content to be distributed;

[0035] The feature extraction subunit is used to extract content features from sub-contents of different modalities to obtain content feature information for each sub-content.

[0036] The information determination subunit is used to obtain the multimodal content feature information of the content to be distributed based on the content feature information of each sub-content.

[0037] In one embodiment, the sub-content includes video sub-content, and the feature extraction sub-unit includes:

[0038] The frame extraction module is used to perform frame extraction processing on the video sub-content to obtain multiple video sub-content frames.

[0039] The image feature extraction module is used to extract image features from each video frame to obtain the image feature information corresponding to each video frame.

[0040] The feature aggregation module is used to aggregate the image feature information corresponding to each content video frame to obtain the content feature information of the video sub-content.

[0041] In one embodiment, the sub-content includes audio sub-content, and the feature extraction sub-unit includes:

[0042] The content acquisition module is used to acquire the audio sub-content of the content to be distributed;

[0043] The preprocessing module is used to perform audio preprocessing on the audio sub-content to obtain the audio spectrum information of the audio sub-content;

[0044] The audio feature extraction module is used to extract audio features from the audio spectrum information to obtain the content feature information of the audio sub-content.

[0045] In one embodiment, the sub-content includes text sub-content, video sub-content, and audio sub-content, and the feature extraction sub-unit includes:

[0046] The information acquisition module is used to acquire content-related information of the content to be distributed;

[0047] The text recognition module is used to perform text recognition on the video sub-content to obtain the video text sub-content contained in the video sub-content frame;

[0048] A speech recognition module is used to perform speech recognition on the audio sub-content to obtain the audio text sub-content contained in the video sub-content frame;

[0049] The content determination module is used to use the video text sub-content, the audio text sub-content, and the content-related information as the text sub-content;

[0050] The text feature extraction module is used to extract text features from the text sub-contents to obtain the content feature information of the text sub-contents.

[0051] In one embodiment, the recommendation unit includes:

[0052] The prediction subunit is used to predict the recommendation level of the content to be distributed based on the fused content feature information.

[0053] The object acquisition subunit is used to acquire the target object corresponding to the content to be distributed if the recommendation level meets the preset conditions.

[0054] The content recommendation subunit is used to recommend the content to be distributed to the target object.

[0055] In one embodiment, the fitting unit includes:

[0056] The data acquisition subunit is used to acquire content interaction data of the content to be distributed in each time period within a historical time period.

[0057] A sequence generation subunit is used to generate an interaction data sequence of the content to be distributed based on the number of content interactions in each time period.

[0058] The trend fitting subunit is used to perform trend fitting based on the interaction data sequence to obtain the interaction trend feature information of the content to be distributed.

[0059] Accordingly, this application also provides a computer device including a memory and a processor; the memory stores a computer program, and the processor is used to run the computer program in the memory to execute any of the content recommendation methods provided in this application.

[0060] Accordingly, embodiments of this application also provide a computer-readable storage medium for storing a computer program, which is loaded by a processor to execute any of the content recommendation methods provided in embodiments of this application.

[0061] Accordingly, this application also provides a computer program product, including a computer program, which, when executed by a processor, implements any of the content recommendation methods provided in this application.

[0062] This application embodiment obtains a sample set and extracts multimodal sub-contents from each content sample in the sample set to obtain multiple modal sub-contents of different modes in the content sample; generates aligned positive samples based on the first and second modal sub-contents of different modes in each content sample; generates aligned negative samples based on the first sub-content of each content sample and the distinct second sub-contents of distinct content samples in the sample set, wherein the distinct second sub-contents have the same mode as the second sub-contents; trains the content feature extraction model based on the aligned positive and aligned negative samples to obtain a trained content feature extraction model, and recommends content based on the multimodal content feature information obtained by multimodal feature extraction of the content to be distributed through the trained content feature extraction model.

[0063] This application embodiment constructs aligned positive and aligned negative samples by analyzing the multimodal content contained in the content samples in the sample set. This eliminates the need for manual labeling of the content samples, and the sample pairs composed of sub-contents of different modalities enable the content feature extraction model to better learn the semantic information of different modal content and the correlation between different modal content, thereby improving the feature extraction capability of the content feature extraction model and thus improving the accuracy of content recommendation. Attached Figure Description

[0064] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0065] Figure 1 This is a flowchart of the content recommendation method provided in the embodiments of this application;

[0066] Figure 2 This is a sub-flowchart of the content recommendation method provided in the embodiments of this application;

[0067] Figure 3 This is another flowchart of the content recommendation method provided in the embodiments of this application;

[0068] Figure 4 This is a schematic diagram of the model structure provided in the embodiments of this application;

[0069] Figure 5 This is a schematic diagram of the content recommendation system provided in the embodiments of this application;

[0070] Figure 6 This is a schematic diagram of the content recommendation device provided in the embodiments of this application;

[0071] Figure 7 This is a schematic diagram of another content recommendation device provided in an embodiment of this application;

[0072] Figure 8 This is a schematic diagram of the structure of the computer device provided in the embodiments of this application. Detailed Implementation

[0073] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0074] This application provides a content recommendation method, apparatus, computer device, and computer-readable storage medium. The content recommendation apparatus can be integrated into a computer device, which may be a server or a terminal, etc.

[0075] The terminal may include mobile phones, wearable smart devices, tablets, laptops, personal computers (PCs), and in-vehicle computers, etc.

[0076] The server can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.

[0077] The following sections provide detailed descriptions of each example. It should be noted that the order in which the embodiments are described is not intended to limit the preferred order of the embodiments.

[0078] This embodiment will be described from the perspective of a content recommendation device, which can be integrated into a computer device, such as a server or a terminal.

[0079] This application provides a content recommendation method, such as... Figure 1 As shown, the specific process of this content recommendation method can be summarized as follows:

[0080] 101. Obtain the sample set, and extract multimodal sub-contents for each content sample in the sample set to obtain multiple sub-contents with different modalities.

[0081] The sample set can include multiple content samples. The content samples in the sample set can be selected as samples based on historical content records, such as popular content (e.g., content with more than a preset number of views, more than a preset threshold of favorites, and more than a preset threshold of likes).

[0082] The content sample can be sub-content containing multiple modalities or content with a single modality. For example, the content sample can be a video (from which audio and text can be extracted), an article containing other modalities, and audio.

[0083] For example, when a content sample contains multiple sub-contents of different modalities, the sub-contents of different modalities can be extracted from the content sample based on the data format. Optionally, not only can the sub-contents of different modalities be extracted based on the data format, but also content of a different modality than that sub-content can be extracted from the sub-contents of different modalities. When a content sample contains content of a single modality, the sub-contents of different modalities can be extracted from the content of that single modality.

[0084] Taking video as an example, audio can be extracted from the video to obtain audio sub-content, and the video frame sequence can be used as video sub-content. Alternatively, text sub-content can be obtained by performing text recognition on the video frame sequence using Optical Character Recognition (OCR) technology. Alternatively, text sub-content can be obtained by performing speech recognition on the audio sub-content using Automatic Speech Recognition (ASR) technology.

[0085] 102. Generate aligned positive samples based on the first and second sub-contents with different modalities in each content sample.

[0086] For example, specifically, the first and second sub-contents with different modalities in each content sample can be used to generate sample pairs to obtain aligned positive samples. For example, if the content sample contains video content, audio content and text content, the video content and audio content can be combined into sample pairs, the video content and text content can be combined into sample pairs, and the audio content and text content can be combined into sample pairs to generate aligned positive samples.

[0087] If the content sample is a single-modality video content, content recommendation is mainly based on the video content. Therefore, aligned positive samples can be constructed based on the video content to reduce the types of samples, lower the learning difficulty of the content feature extraction model, and improve the training efficiency of the content feature extraction model. That is, in one embodiment, the step "generating aligned positive samples based on the first sub-content and second sub-content with different modalities in each content sample" can specifically include:

[0088] Generate a first aligned positive sample based on the video and audio content in each content sample;

[0089] A second aligned positive sample is generated based on the video content and text content in each content sample;

[0090] Based on the first and second aligned positive samples corresponding to each content sample, an aligned positive sample is generated.

[0091] For example, a first alignment positive sample can be constructed from video content and audio content, and a second alignment positive sample can be constructed from video content and text content. Alignment positive samples can be generated based on the first and second alignment positive samples of each content sample.

[0092] Optionally, the generated sample pairs may include two tasks: a video-to-text (VTM) task and an audio-to-audio (VTA) task. The VTA task constructs video content and audio content pairs (including aligned positive and aligned negative samples), while the VTM task constructs text content and video content pairs (including aligned positive and aligned negative samples).

[0093] 103. Based on the first sub-content of each content sample and the distinct second sub-content of the distinct content samples in the sample set, generate aligned negative samples, where the distinct second sub-content has the same modality as the second sub-content.

[0094] Among them, the distinct content samples are the samples in the sample set that are different from the content sample.

[0095] For example, there could be a content sample—sample A, and a content sample that is distinct from the content sample—sample B. Sample A and sample B both have sub-contents of different modalities—video content, audio content, and text content. Negative sample pairs can be constructed by combining the video content of sample A with the audio content of sample B, and negative sample pairs can be constructed by combining the video content of sample A with the text content of sample B, thus obtaining aligned negative samples.

[0096] Optionally, the dissimilar content samples can be samples that are different from the content samples and whose similarity to the content samples is less than a preset threshold. The calculation of the similarity between the content samples and the dissimilar content samples can be referred to the relevant description in step 104, which will not be repeated here.

[0097] 104. Train the content feature extraction model based on aligned positive and aligned negative samples to obtain a trained content feature extraction model. Then, based on the multimodal content feature information obtained by multimodal feature extraction of the content to be distributed through the trained content feature extraction model, perform content recommendation.

[0098] For example, a sufficient number of aligned positive and negative samples can be used to train the content feature extraction model. The content feature extraction model can then predict whether the training samples are positive or negative. This allows the content feature extraction model to learn the content features of different modalities and the correlation between different modalities, until the preset conditions are met, resulting in the trained content feature extraction model.

[0099] To further improve the training efficiency of the model, the aligned positive samples and aligned negative samples can be masked to obtain masked positive samples and masked negative samples. In one embodiment, the step "training the content feature extraction model based on the aligned positive samples and aligned negative samples" includes:

[0100] The aligned positive samples and aligned negative samples are masked separately to obtain the masked positive samples and the masked negative samples.

[0101] The content feature extraction model is trained based on the positive and negative samples after masking.

[0102] For example, the aligned positive and negative samples can be masked to obtain masked positive and negative samples. The masked parts in the masked positive and negative samples can then be restored using a content feature extraction model to train the content feature extraction model and improve its content feature extraction capability.

[0103] Specifically, masking can include text masking (MLM) and video masking (MFM). MLM masks the text content in aligned positive and negative samples, where 15% of the text tokens in the aligned positive and negative samples may be masked, of which 80% will be replaced by a "mask" token, 10% will be replaced by a random token, and 10% will remain unchanged. MFM masks the video content in aligned positive and negative samples.

[0104] Optionally, tasks based on MFM, MLM, VTM, and VTA can be performed simultaneously or in a preset order.

[0105] The higher the similarity between different content samples, the greater the correlation between different modalities in the samples generated from that content sample. If this sample is used as a negative sample to train the content feature extraction model, the model will struggle to learn the commonalities between trending content. Therefore, negative samples can be selected based on similarity, and the weights of the samples can be adjusted accordingly to help the content feature extraction model better learn the semantic information of different modalities. This is the step "training the content feature extraction model based on aligned positive and aligned negative samples," which can specifically include:

[0106] Calculate the similarity between content samples and dissimilar content samples based on the target modal sub-content of content samples and the target modal sub-content of dissimilar content samples;

[0107] The weights of the negative samples corresponding to the aligned negative samples are calculated based on similarity, and similarity and sample weights are negatively correlated.

[0108] The content feature extraction model is trained based on the alignment of positive samples, the alignment of negative samples, and the weights of negative samples.

[0109] For example, the similarity between target modal sub-contents in different content samples can be used to determine the similarity between different content samples. The target modal sub-content can be determined according to the type of content sample. For example, when the content sample contains single-modality content—video content, the target modal sub-content can be determined as video content. The similarity between the content sample and the dissimilar content sample can be calculated, and the negative sample weight of the aligned negative sample can be calculated based on the similarity. The relationship between the similarity and the negative sample weight can be: W = -kS + b, where W is the negative sample weight of the aligned negative sample, S is the similarity between the content sample and the dissimilar content sample, and k and b can be adjusted according to the training effect of the content feature extraction model.

[0110] Optionally, k and b can be preset to 1.

[0111] The content feature extraction model is trained by aligning positive samples, aligning negative samples, and assigning negative sample weights to the aligned negative samples, thus obtaining the trained content feature extraction model.

[0112] After obtaining the trained content feature extraction model, multimodal features are extracted from the content to be distributed based on the trained content feature extraction model. The obtained multimodal content feature information is then used to predict whether the content to be distributed will become popular. If so, content recommendation is performed on the content to be distributed.

[0113] In addition to prediction based on the multimodal content feature information of the content to be distributed, prediction can also be made based on the historical content interaction data and content publishing targets of the content to be distributed. That is, in one embodiment, such as... Figure 2 As shown, the step "recommending content based on the multimodal content feature information obtained by multimodal feature extraction of the content to be distributed through the trained content feature extraction model" can specifically include:

[0114] 1051. Obtain the content to be distributed, as well as the content interaction data and content publishing targets of the content to be distributed;

[0115] 1052. Perform trend fitting on the content interaction data to obtain the interaction trend feature information of the content to be distributed;

[0116] 1053. By using the trained content feature extraction model, content features are extracted from the multimodal content in the content to be distributed, and the multimodal content feature information of the content to be distributed is obtained.

[0117] 1054. Based on the object data of the content publishing object, extract the object features of the content publishing object to obtain the object feature information of the content publishing object;

[0118] 1055. Perform feature fusion processing on interaction trend feature information, multimodal content feature information and object feature information to obtain the fused content feature information of the content to be distributed;

[0119] 1056. Based on the integrated content feature information, recommend content for distribution.

[0120] The specific explanation of the above steps is as follows:

[0121] 1051. Obtain the content to be distributed, as well as the content interaction data and content publishing objects of the content to be distributed.

[0122] The content to be distributed may include videos, articles, or audio, or it may be a combination of videos, articles, or audio.

[0123] Content interaction data can include data generated by users' interactive behaviors toward the content to be distributed. Interactive behaviors can include liking content, collecting content, commenting content, subscribing to content, and blocking content. Content interaction data can include the number of likes, collections, and comments.

[0124] The content publishing target can include the person who publishes the content to be distributed on the content platform. For example, the content publishing target can be a personal account or a group account.

[0125] For example, it could involve retrieving the content to be distributed from a database or the internet, identifying the recipients of the content to be distributed, and obtaining the content interaction data of the content to be distributed over a past period.

[0126] Obtaining content interaction data for content to be distributed can be done by periodically (e.g., every minute or every hour) acquiring data on user interactions with the content to be distributed, thus obtaining time-related content interaction data.

[0127] 1052. Perform trend fitting on the content interaction data to obtain the interaction trend feature information of the content to be distributed.

[0128] Among them, the interaction trend feature information can characterize the changing trend of the content to be distributed in the past content interaction data.

[0129] For example, by performing trend fitting on content interaction data, an interaction trend curve can be obtained, which shows the change in the interaction trend of the content to be distributed over time. The interaction trend curve can reflect the changes in the content interaction data of the location of the content to be distributed.

[0130] Since interaction trends are time-related, time series can be generated based on content interaction data at different times. Then, trend fitting is performed on the time series. In one embodiment, the step "trend fitting of content interaction data to obtain interaction trend feature information of the content to be distributed" can specifically include:

[0131] Obtain content interaction data for the content to be distributed in each time period within a historical timeframe;

[0132] Generate an interaction data sequence of the content to be distributed based on the number of interactions in each time period;

[0133] By performing trend fitting based on the interactive data sequence, the interactive trend feature information of the content to be distributed is obtained.

[0134] The historical time period can be a specified time period, such as the content of the past hour or the past 24 hours.

[0135] The time period can include the data statistics period, for example, collecting content interaction data every 5 minutes.

[0136] The interactive data sequence can include multiple content interaction data.

[0137] For example, this could involve statistically analyzing user interactions with content to be distributed at each time period to obtain content interaction data for that period, acquiring content interaction data from each time period within a historical timeframe, and generating an interaction data sequence. Trend fitting could then be performed on the interaction data sequence to obtain interaction trend characteristic information.

[0138] Optionally, a Long Short-Term Memory (LSTM) network can be used to fit the trend of content interaction data. Specifically, the content interaction data can be a time series, which contains statistics on user interaction behavior with the content to be distributed at each time unit over a period of time. The content interaction data is input into the LSTM network, and the LSTM network extracts features from the content interaction data to fit the trend curve of the content interaction data, thereby obtaining the interaction trend feature information of the content to be distributed.

[0139] Optionally, trend fitting can also be performed using a transformer model. Specifically, the interaction data sequence can be input into the transformer model, which captures the dependencies between the interaction data sequences to fit the curve of the interaction trend and obtain the interaction trend feature information.

[0140] 1053. Extract content features from the multimodal content in the content to be distributed to obtain the multimodal content feature information of the content to be distributed.

[0141] Multimodal content can include content of different modalities, such as video, text, and audio data.

[0142] The multimodal content feature information may include information that characterizes the plot, theme, and concept of different modalities in the content to be distributed. The content feature information may include information in the form of feature values, feature vectors, or feature tensors.

[0143] For example, the specific process could involve extracting content features from different modalities of the content to be distributed, obtaining content feature information for each modality, and then performing feature fusion processing on the content feature information of the different modalities to obtain multimodal content feature information for the content to be distributed. Feature fusion processing could involve concatenating the content feature information of the different modalities or adding the feature information together, for example, weighting and then adding the content feature information of the different modalities together.

[0144] Optionally, the content to be distributed may contain multiple modalities or only one modality. Based on this content, different modalities can be extracted. For example, text content can be extracted from audio content using speech recognition technology. Therefore, in one embodiment, the step "extracting content features from the multimodal content in the content to be distributed to obtain multimodal content feature information of the content to be distributed" may specifically include:

[0145] Extract content from the content to be distributed to obtain sub-content of different modalities contained in the content to be distributed;

[0146] Content features are extracted from sub-content of different modalities to obtain content feature information for each sub-content;

[0147] The multimodal content feature information of the content to be distributed is obtained based on the content feature information of each sub-content.

[0148] For example, when the content to be distributed contains multiple sub-contents of different modalities, the sub-contents of different modalities can be extracted from the content to be distributed based on the data format. Optionally, not only can the sub-contents of different modalities be extracted based on the data format, but also content of a different modality from the sub-contents of different modalities can be extracted separately. When the content to be distributed contains content of a single modality, the sub-contents of different modalities can be extracted from the content of that single modality.

[0149] Taking video as an example, audio can be extracted from the video to obtain audio sub-content, and the video frame sequence can be used as video sub-content. Alternatively, text sub-content can be obtained by performing text recognition on the video frame sequence using Optical Character Recognition (OCR) technology. Alternatively, text sub-content can be obtained by performing speech recognition on the audio sub-content using Automatic Speech Recognition (ASR) technology.

[0150] Content features are extracted from sub-contents of different modalities to obtain content feature information for each sub-content; multiple content feature information are fused to obtain multimodal content feature information, or the multiple content feature information is used as multimodal content feature information.

[0151] When the sub-content includes video sub-content, video content frames can be obtained from the video sub-content, and feature information of the video sub-content can be extracted from the video content frames. That is, in one embodiment, the sub-content includes video sub-content, and the step "extracting content features from sub-content of different modalities to obtain content feature information of each sub-content" can specifically include:

[0152] Frame extraction is performed on the sub-contents of the video to obtain multiple video content frames;

[0153] Image features are extracted from each video frame to obtain the image feature information corresponding to each video frame.

[0154] The image feature information corresponding to each video frame is aggregated to obtain the content feature information of the video sub-content.

[0155] Among them, video content frames can be video frames within video sub-contents.

[0156] The image feature information may include feature information representing the content video frames.

[0157] For example, a specific approach could be to extract a portion of video content from a sub-content. For instance, a predetermined number of frames could be extracted from the sub-content at regular intervals; these extracted frames are the content video frames. Convolutional Neural Networks (CNNs) are then used to extract image features from these content video frames, obtaining the image feature information corresponding to each frame. The image feature information from multiple content video frames is then aggregated to convert frame-level feature information into video-level feature information.

[0158] Content feature information of video sub-content can be used to characterize the content features of video sub-content, and can also be used to measure similarity. The distance between two content feature information can represent the similarity between two video contents.

[0159] Optionally, the Swin-Transformer model can be used to extract features from video sub-contents. The Swin-Transformer is a novel visual transformer.

[0160] In one embodiment, the sub-content includes audio sub-content, and the step of "extracting content features from the sub-content of different modalities to obtain content feature information for each sub-content" may specifically include:

[0161] Retrieve the audio sub-content of the content to be distributed;

[0162] Audio preprocessing is performed on the audio sub-content to obtain the audio spectrum information of the audio sub-content;

[0163] Audio feature extraction is performed on the audio spectrum information to obtain the content feature information of the audio sub-content.

[0164] Preprocessing can include pre-emphasis, framing, and windowing.

[0165] For example, the process could involve extracting audio sub-content from the content to be distributed, applying a window function to the audio sub-content, and then performing frame segmentation by a certain frame shift. A Short-Time Fourier Transform (STFT) is then performed to obtain a spectrogram, which is subsequently mapped onto a 64th-order Mel filter bank to calculate the Mel spectrum, i.e., the audio spectral information. By extracting audio features from the audio spectral information, the content feature information of the audio sub-content can be obtained.

[0166] Optionally, we extract audio features using the VGG model (also known as the VGGish model) based on TensorFlow. VGGish resamples the audio sub-contents to 16kHz mono audio, performs a short-time Fourier transform on the audio using a 25ms Hann time window and a 10ms frame shift to obtain the spectrogram, and calculates the Mel spectrum by mapping the spectrogram to a 64th-order Mel filter bank. The Mel spectrum is framed with a duration of 0.96s and there is no frame overlap. Each frame contains 64 Mel frequency bands and a duration of 10ms. Based on the Mel frequency bands, the sub-audio feature information of each frame can be obtained.

[0167] The NextVlad network is used to reduce the dimensionality of the sub-audio feature information obtained from the VGGish model to video-level audio feature information, thus obtaining audio content feature information. VGGish has a strong ability to express scene-based sound events. Using VGGish to extract audio features from text sub-content significantly improves the accuracy of popularity prediction for content such as action movies and music.

[0168] In one embodiment, the sub-content also includes text sub-content. In addition to separating the text-formatted content from the content to be distributed, the text sub-content can also be extracted from the video and audio sub-content. That is, the sub-content includes text sub-content, video sub-content, and audio sub-content. The step "extracting content features from the sub-content of different modalities to obtain content feature information for each sub-content" can specifically include:

[0169] Obtain content-related information for the content to be distributed;

[0170] Perform text recognition on video sub-contents to obtain the video text content contained in the video sub-content frames;

[0171] Speech recognition is performed on the audio sub-content to obtain the audio text content contained in the video sub-content frames;

[0172] The video text content, audio text content, and content-related information are treated as text sub-content;

[0173] Text features are extracted from the sub-contents of the text to obtain the content feature information of the sub-contents.

[0174] The content-related information may include the title, description, author, and personal information of the content to be distributed.

[0175] The video text content may include text content obtained from the video sub-content, such as dialogue and monologues between characters in the video; the audio text content may include text content obtained from the audio sub-content.

[0176] For example, it can be based on OCR technology to perform text recognition on the video frame sequence in the video sub-content to obtain the video text content, and use Automatic Speech Recognition (ASR) technology to perform speech recognition on the audio sub-content to obtain the audio text content. The video text content, audio text content, and content-related information are used as text sub-content; text features are extracted from the text sub-content to obtain the content feature information of the text sub-content.

[0177] Optionally, before the step "perform speech recognition on the audio sub-content to obtain the audio text content contained in the video sub-content frame", the subtitles of the video sub-content can be detected. If there are no subtitles, speech recognition is performed on the audio sub-content to obtain the audio text content; if there are subtitles, speech recognition is not performed. By extracting the audio text content from the audio sub-content, the lack of text information can be made up for when the video sub-content lacks subtitles.

[0178] Optionally, after the step "perform text recognition on video sub-content to obtain the video text content contained in the video sub-content frame", text denoising processing can be performed on the video text content. For example, text recognition is performed by selecting the text to be recognized in the video content frame to obtain a text recognition box, and then recognizing the text in the text recognition box to obtain the text content. Text denoising processing can filter text content consisting of single characters, pure numbers, and pure letters, filter text content where the position offset of the bounding box (bbox) between two adjacent video content frames is small and the text repetition rate is high, and filter text content where the bbox is at the bottom of the screen and has a small height, etc. After denoising processing, the video text content is obtained.

[0179] 1054. Based on the object data of the content publishing object, extract the object features of the content publishing object to obtain the object feature information of the content publishing object.

[0180] The object data can include attribute information of the content publishing object, such as number of followers, number of likes, like rate, number of favorites, favorite rate, level, style, ranking on lists, and the field it belongs to.

[0181] The object characteristic information may include information that characterizes the features of the content publishing object.

[0182] For example, object data can be mapped in a preset order to obtain object feature information, which includes feature information representing the object data, such as feature values ​​or feature vectors. Alternatively, object features can be extracted from the object data using a feature embedding network. For example, a Transform network can be used as an embedding network, and the object data can be input into the Transform network. The Transform network can then extract object features based on the object data to obtain object feature information.

[0183] 1055. Perform feature fusion processing on the interaction trend feature information, multimodal content feature information and object feature information to obtain the fused content feature information of the content to be distributed.

[0184] The fused content feature information may include information that characterizes the overall features of the content to be distributed, such as changes in the interaction data of the content to be distributed, the feature information of its content publishing objects, and its own feature information.

[0185] For example, interactive trend features, multimodal content features, and object features can be concatenated or added together to perform feature fusion processing, resulting in fused content feature information of the content to be distributed.

[0186] 1056. Based on the integrated content feature information, recommend content for distribution.

[0187] For example, it could predict whether to recommend the content to be distributed based on the merged content feature information. If so, the content would be recommended to the user; otherwise, it would not. Optionally, it could predict the recommendation score of the content to be distributed based on the merged content feature information. The recommendation score represents the popularity of the content; a higher recommendation score means more users will like the content, and a lower recommendation score means fewer users will like it. If the recommendation score is greater than a preset threshold, the content to be distributed would be recommended to the user.

[0188] Optionally, a classifier can be used to classify the content to be distributed based on the fused content feature information, determine the recommendation level of the content to be distributed based on the classification result, and recommend content based on the recommendation level. That is, in one embodiment, the step "recommend content based on the fused content feature information" may specifically include:

[0189] Based on the integrated content feature information, predict the recommendation level of the content to be distributed;

[0190] If the recommendation level meets the preset conditions, then obtain the target object corresponding to the content to be distributed;

[0191] Recommend content to be distributed to the target audience.

[0192] A high recommendation level indicates that the content to be distributed is popular, while a low recommendation level indicates that the content to be distributed is unpopular.

[0193] The target audience can include the intended recipients of the content to be distributed, as well as users who are interested in the content.

[0194] For example, the fused content feature information could be input into a classifier, which would then categorize the content to be distributed. The recommendation level of the content would be determined based on the classification results. If the recommendation level of the content to be distributed meets a certain condition, the corresponding target audience would be identified. For instance, the target audience could be determined based on the multimodal content feature information of the content to be distributed and the user's interests, or it could be determined by identifying content similar to the content to be distributed as the target audience, and then recommending the content to that target audience.

[0195] As can be seen from the above, the embodiments of this application obtain a sample set and extract multimodal sub-contents for each content sample in the sample set to obtain multiple sub-contents with different modalities of the content sample; generate aligned positive samples based on the first and second sub-contents with different modalities in each content sample; generate aligned negative samples based on the first sub-content of each content sample and the different second sub-contents of the different content samples in the sample set, wherein the different second sub-contents have the same modality as the second sub-contents; train the content feature extraction model based on the aligned positive samples and aligned negative samples to obtain the trained content feature extraction model, and recommend content based on the multimodal content feature information obtained by multimodal feature extraction of the content to be distributed through the trained content feature extraction model.

[0196] This application embodiment constructs aligned positive and aligned negative samples by analyzing the multimodal content contained in the content samples in the sample set. This eliminates the need for manual labeling of the content samples, and the sample pairs composed of sub-contents of different modalities enable the content feature extraction model to better learn the semantic information of different modal content and the correlation between different modal content, thereby improving the feature extraction capability of the content feature extraction model and thus improving the accuracy of content recommendation.

[0197] Based on the above embodiments, the following examples will provide further detailed explanations.

[0198] This embodiment will take video content as an example, with the content sample being video content and the content to be distributed being video content, and will describe it from the perspective of a content recommendation device. Specifically, the content recommendation device can be integrated into a computer device, which can be a server or a terminal or other device.

[0199] This application provides a content recommendation method, such as... Figure 3 As shown, the specific process of this content recommendation method can be summarized as follows:

[0200] Training phase:

[0201] 2011. The server obtains a sample set and performs multimodal sub-content extraction on each content sample in the sample set to obtain multiple sub-contents with different modalities of the content sample.

[0202] For example, taking video as the content sample, the server can extract audio from the video to obtain audio sub-content, and use the video frame sequence as video sub-content; optionally, it can also use Optical Character Recognition (OCR) to perform text recognition on the video frame sequence to obtain text sub-content, etc.; optionally, it can also use Automatic Speech Recognition (ASR) to perform speech recognition on the audio sub-content to obtain text sub-content.

[0203] In 2012, the server generated aligned positive samples based on the first and second sub-contents with different modalities in each content sample.

[0204] For example, the server can construct a first alignment positive sample from video content and audio content, construct a second alignment positive sample from video content and text content, and generate an alignment positive sample based on the first and second alignment positive samples of each content sample.

[0205] In 2013, the server generated aligned negative samples based on the first sub-content of each content sample and the second sub-content of the mutually different content samples in the sample set that meet the preset conditions.

[0206] For example, the server can cluster the content samples in the sample set to obtain multiple clusters, calculate the similarity between each cluster based on the distance between the cluster centers, and select different content samples from the clusters with similarity less than a threshold based on the cluster in which the content sample belongs.

[0207] Based on the first sub-content of each content sample and the second sub-content of its corresponding dissimilar content sample, an aligned negative sample is generated.

[0208] 2014. The server calculates the negative sample weights corresponding to the aligned negative samples based on the similarity between the content samples and the dissimilar content samples.

[0209] For example, the server can calculate the similarity between content samples and dissimilar content samples, and calculate the negative sample weights of aligned negative samples based on the similarity. The relationship between similarity and negative sample weights can be: W = -S + 1, where W is the negative sample weight of aligned negative samples and S is the similarity between content samples and dissimilar content samples.

[0210] In 2015, the server trained the content feature extraction model based on the alignment of positive samples, the alignment of negative samples, and the weights of negative samples to obtain the trained content feature extraction model.

[0211] The content feature extraction model is trained by aligning positive samples, aligning negative samples, and assigning negative sample weights to the aligned negative samples, thus obtaining the trained content feature extraction model.

[0212] Application phase:

[0213] 2021. The server obtains the content to be distributed, as well as the content interaction data and content publishing objects of the content to be distributed.

[0214] For example, the server could retrieve the content to be distributed from a database or the Internet, the content publishing target of the content to be distributed, and the content interaction data of the content to be distributed over a past period.

[0215] The server can obtain content interaction data for the content to be distributed by statistically analyzing interactive behaviors such as "reading", "forwarding", "favoriting", "liking" and "commenting" within a 5-minute time period. The interaction data sequence is then obtained based on the statistical data. The length of the interaction data sequence can be no less than 12, meaning at least 1 hour of content interaction data must be obtained. The interaction data sequence is denoted as {v1, v2, ..., vt}.

[0216] 2022. The server performs trend fitting on the content interaction data to obtain the interaction trend feature information of the content to be distributed.

[0217] For example, such as Figure 4 As shown, the server inputs the interaction data sequence {v1, v2, ..., vt} into the Transform model. The Transform model is used to fit the long-term growth trend of the content interaction data (e.g., total views) of the content to be distributed over time. The advantage of Transform is that the memory unit contains historical information and is good at capturing the dependencies of time series (e.g., interaction data sequence). Therefore, it does not require making specific assumptions about the functional form of historical trends.

[0218] Due to various factors, the content interaction data (e.g., total readership) curve exhibits both rising and falling phases, such as... Figure 4 As shown, these short-term fluctuations in content interaction data (e.g., total readership) can be captured using convolutional neural networks (CNNs).

[0219] The server uses the long-term growth trend obtained by fitting the Transform model and the CNN network to capture short-term fluctuation trends as interactive trend feature information.

[0220] 2023. The server extracts content from the content to be distributed, obtaining sub-content of different modalities contained in the content to be distributed.

[0221] Taking video as an example of content to be distributed, the server can extract audio data from the video to obtain audio sub-content, and use the video frame sequence as video sub-content; it can also use Optical Character Recognition (OCR) to perform text recognition on the video frame sequence to obtain video text content, and use Automatic Speech Recognition (ASR) to perform speech recognition on the audio sub-content to obtain audio text content, as well as obtain content-related information such as the title, description, author, and people information of the content to be distributed; and use the video text content, audio text content, and content-related information as text sub-content.

[0222] 2024. The server extracts content features from the sub-content of different modalities to obtain multimodal content feature information of the content to be distributed.

[0223] For example, it could involve extracting a portion of video from a sub-content of a video. Specifically, it could involve extracting a preset number of frames from the sub-content at regular intervals; these extracted frames would then be the content video frames. Figure 4 As shown, the server uses the Swin-Transformer (also known as SwinT) network as the feature embedding layer to extract image features from the content video frames, obtaining the image feature information corresponding to each content video frame. The NextVlad network is used to aggregate the image feature information of multiple video content frames to aggregate the frame-level feature information into video-level feature information.

[0224] like Figure 4 As shown, the server extracts audio features from the audio sub-contents using the VGGish network, which yields frame-level sub-audio feature information. The NextVlad network is then used to reduce the dimensionality of the sub-audio feature information obtained from the VGGish model to video-level audio feature information, thus obtaining the audio content feature information.

[0225] like Figure 4 As shown, the server uses OCR technology to perform text recognition on the video frame sequence in the video sub-content to obtain the video text content, and uses ASR technology to perform speech recognition on the audio sub-content to obtain the text sub-content. Then, it extracts text features from the audio text content, video text content, audio text content, and content-related information to obtain the content feature information of the text sub-content.

[0226] 2025. Based on the object data of the content publishing object, the server extracts object features from the content publishing object to obtain object feature information of the content publishing object.

[0227] For example, the server can input object data into the Transform network, and the Transform network can extract object features based on the object data to obtain object feature information.

[0228] 2026. The server performs feature fusion processing on interaction trend feature information, multimodal content feature information and object feature information to obtain the fused content feature information of the content to be distributed.

[0229] For example, the server can concatenate or add interactive trend features, multimodal content features, and object features to perform feature fusion processing and obtain the fused content feature information of the content to be distributed.

[0230] 2027. The server predicts the recommendation level of the content to be distributed based on the integrated content feature information.

[0231] For example, the server can input the fused content feature information into a classifier, which then classifies the content to be distributed and determines the recommendation level of the content based on the classification results.

[0232] 2028. If the recommendation level meets the preset conditions, the server will recommend content for distribution.

[0233] If the recommendation level of the content to be distributed meets the conditions, it means that the content to be distributed will become popular. The server obtains the corresponding target object. For example, it can determine the target object based on the multimodal content feature information of the content to be distributed and the user's interests, or it can determine the target object of content similar to the content to be distributed as the target object of the content to be distributed, and distribute the content to be distributed to the target object.

[0234] If the content to be distributed is unpopular, it will be filtered so that the server will not calculate the corresponding target object and recommend content for it when making content recommendations, thus saving relevant resources.

[0235] As can be seen from the above, in this embodiment of the application, the server obtains a sample set and performs multimodal sub-content extraction on each content sample in the sample set to obtain multiple sub-contents with different modalities of the content sample; aligned positive samples are generated based on the first and second sub-contents with different modalities in each content sample; aligned negative samples are generated based on the first sub-content of each content sample and the second sub-contents of dissimilar content samples in the sample set that meet preset conditions; the negative sample weights corresponding to the aligned negative samples are calculated according to the similarity between the content samples and the dissimilar content samples; and the content feature extraction model is trained according to the aligned positive samples, aligned negative samples, and negative sample weights to obtain the trained content feature extraction model.

[0236] The server acquires the content to be distributed, along with its content interaction data and the content publishing object; it performs trend fitting on the content interaction data to obtain the interaction trend feature information of the content to be distributed; it extracts content from the content to be distributed, obtaining sub-content of different modalities contained within the content; it extracts content features from the sub-content of different modalities to obtain the multimodal content feature information of the content to be distributed; based on the object data of the content publishing object, it extracts object features from the content publishing object to obtain the object feature information of the content publishing object; it performs feature fusion processing on the interaction trend feature information, multimodal content feature information, and object feature information to obtain the fused content feature information of the content to be distributed; based on the fused content feature information, it predicts the recommendation degree of the content to be distributed; if the recommendation degree meets the preset conditions, the server recommends the content to be distributed.

[0237] This application's embodiments construct aligned positive and negative samples from the multimodal content contained in the sample set. This eliminates the need for manual labeling of the content samples, and the sample pairs composed of sub-content from different modalities allow the content feature extraction model to better learn the semantic information of different modalities and the correlations between them, thus improving the model's feature extraction capabilities. Furthermore, it integrates the interaction trend features of the content to be distributed, multimodal content features, and the object features of the content publisher to predict the recommendation level of the content to be distributed. This not only captures the interaction trends of the content interaction data over time but also utilizes the multimodal content features and the object features of the content publisher to capture the commonalities between the content to be distributed, the content publisher, and the content that users are interested in, making content recommendations more accurate.

[0238] In one embodiment, such as Figure 5 As shown in the embodiments of this application, a content recommendation system is also provided, as detailed below:

[0239] I. Content Production and Content Consumption

[0240] (1) Content producers who provide content through mobile devices or backends include professionally generated content (PGC), user-generated content (UGC), multi-channel network (MCN), and expert-generated content (PUGC). These are the main sources of content for distribution.

[0241] (2) The content production end communicates with the upstream and downstream content interface server to first obtain the upload server interface address, and then uploads the video content.

[0242] (3) The content consumer communicates with the upstream and downstream content interface server to obtain the index information of the accessed video, and then communicates with the content database to obtain the corresponding content.

[0243] (4) The content consumption terminal will report the user's interactive behavior data such as reading, clicking, swiping, sharing, collecting, and forwarding during the browsing process to the server.

[0244] (5) Content consumers can browse content through Feeds. If there is content in Feeds that meets the recommendation criteria, the content can be pinned to the top, or it can be pushed to more users through active PUSH.

[0245] II. Uplink and Downlink Content Interface Server

[0246] (1) Communicate directly with the content production end and store content-related information such as title, publisher, summary, cover image and release time of the content submitted from the front end into the video content database.

[0247] (2) Write the content's metadata, such as file size, cover image link, title, publication time, and author, into the content database.

[0248] (3) Submit the uploaded data to the dispatch center server for subsequent content processing and transfer.

[0249] III. Content Database

[0250] (1) The metadata of the content published by the content producer is stored in this database, which may also include the classification of the content during the review process (including first, second and third level video classification and tag information).

[0251] (2) The review process will read data from the content database, and the review results and status will also be sent back to the content database.

[0252] IV. Dispatch Center Server

[0253] (1) Responsible for the entire scheduling process of content flow, receiving content through the uplink and downlink content interface server and storing it in the content database, and then obtaining the video metadata from the database.

[0254] (2) The scheduling review system and machine processing system control the order and priority of scheduling.

[0255] (5) Provide content to content consumers through recommendation and distribution services.

[0256] VI. Video Content Popularity Prediction Server

[0257] (1) The video content popularity prediction model is engineered to provide actual online service capabilities.

[0258] (2) Accept scheduling from the scheduling service center and predict the popularity of the published and activated content and mark the results.

[0259] VII. Statistical Reporting Interface Server

[0260] (1) Receive user interaction data such as reading, clicking, swiping, sharing, collecting, and forwarding during the content distribution process.

[0261] (2) Provide necessary data support for interactive behavior analysis services, so as to build the short-term and long-term trend statistical analysis data and time-based sequences required for subsequent modeling.

[0262] VIII. Interactive Behavior Analysis and Statistical Services

[0263] (1) Accept the data written by the statistical reporting interface server, and at the same time provide the necessary data for the content recommendation degree prediction model server to model the distribution process.

[0264] (2) The interactive behaviors such as "reading", "forwarding", "favoriting", "liking" and "commenting" are statistically analyzed every 5 minutes to obtain the interactive data sequence {v1, v2, ..., vt}. The Transformer network is used to model the long-term growth trend of video distribution, while CNN is used to capture short-term fluctuations. The Transformer network is used to fit the reading volume growth curve, and a 1D-CNN (1D is 1 day, or 24 hours) network is used to capture the explosive growth of reading volume.

[0265] IX. Content and Account Feature Modeling Server

[0266] (1) As mentioned above, the video content feature modeling uses SwinT+NeXtVLad, VGG and Bert networks to receive the video content frames, and the text modality and audio modality of the video content are processed separately.

[0267] (2) Account feature modeling of content producers (content production objects). Account data includes the account's user click-through rate, user like rate, user comment rate, user forwarding rate, historical content activation rate, number of active followers, and number of times the account ranks on external new lists. Account performance has a certain cumulative effect over time, so the performance of the content published by the account in the past 30 days is used to precipitate these features onto the account. Account metadata includes the account category, account level (e.g., it can include four levels: authoritative, high-quality, and potential), account registration time, and follower level (e.g., individual, tens, hundreds, thousands, tens of thousands, hundreds of thousands, millions, tens of millions, and hundreds of millions).

[0268] 10. Content Recommendation Prediction Model Server

[0269] (1) According to the model structure and modal processing method described in the above embodiments, the interactive trend feature information, multimodal content feature information and object feature information of the content to be distributed are modeled and fused to output the recommendation degree with the highest prediction probability (for example, it may include popular, unpopular and normal).

[0270] (2) The corresponding category entries can be recorded in the content dimension for further recommendation and operation.

[0271] This application integrates the interaction trend characteristics of the content to be distributed, multimodal content characteristics, and object characteristics of the content publisher to predict the recommendation level of the content to be distributed. It can not only capture the interaction trend of the content interaction data of the content to be distributed over time, but also use multimodal content characteristics and object characteristics of the content publisher to capture the commonalities between the content to be distributed, the content publisher, and the content that users are interested in, making the content recommendation more accurate.

[0272] To facilitate better implementation of the content recommendation method provided in the embodiments of this application, a content recommendation apparatus is also provided in one embodiment. The meanings of the terms used are the same as in the content recommendation method described above, and specific implementation details can be found in the description of the method embodiments.

[0273] This content recommendation device can be integrated into computer devices, such as... Figure 6 As shown, the content recommendation device may include: an acquisition unit 301, a positive sample generation unit 302, a negative sample generation unit 303, and a training unit 304, as detailed below:

[0274] (1) Acquisition unit 301: used to acquire a sample set and extract multimodal sub-contents for each content sample in the sample set to obtain multiple sub-contents with different modalities of the content sample.

[0275] (2) Positive sample generation unit 302: used to generate aligned positive samples based on the first and second sub-contents with different modalities in each content sample.

[0276] In one embodiment, the positive sample generation unit 302 may include a first positive sample generation subunit, a second positive sample generation subunit, and a third positive sample generation subunit, specifically:

[0277] First Positive Sample Generation Subunit: Used to generate the first aligned positive sample based on the video and audio content in each content sample;

[0278] Second Positive Sample Generation Subunit: Used to generate a second aligned positive sample based on the text content and audio content in each content sample;

[0279] The third positive sample generation subunit is used to generate aligned positive samples based on the first and second aligned positive samples corresponding to each content sample.

[0280] (3) Negative sample generation unit 303: is used to generate aligned negative samples based on the first sub-content of each content sample and the second sub-content of the different content samples of the content samples in the sample set, wherein the second sub-content of the different content samples has the same modality as the second sub-content.

[0281] (4) Training unit 304: used to train the content feature extraction model based on aligned positive samples and aligned negative samples, so as to obtain the trained content feature extraction model, and to recommend content based on the multimodal content feature information obtained by multimodal feature extraction of the content to be distributed through the trained content feature extraction model.

[0282] In one embodiment, the training unit 304 may include a mask subunit and a first model training subunit, specifically:

[0283] Masking sub-unit: used to mask the aligned positive sample and the aligned negative sample respectively, to obtain the masked positive sample and the masked negative sample;

[0284] The first model training subunit is used to train the content feature extraction model based on the masked positive and negative samples.

[0285] In one embodiment, the training unit 304 may include a similarity calculation subunit, a weight calculation subunit, and a second model training subunit, specifically:

[0286] Similarity calculation subunit: used to calculate the similarity between content samples and dissimilar content samples based on the target modal sub-content of the content sample and the target modal sub-content of dissimilar content samples;

[0287] Weight calculation subunit: used to calculate the weight of the negative sample corresponding to the aligned negative sample based on similarity. Similarity and sample weight are negatively correlated.

[0288] The second model training subunit is used to train the content feature extraction model based on the alignment of positive samples, the alignment of negative samples, and the weights of the negative samples.

[0289] In one embodiment, such as Figure 7 The content recommendation device shown also includes a data acquisition unit 401, a fitting unit 402, a content feature extraction unit 403, an object feature extraction unit 404, a fusion unit 405, and a recommendation unit 406, specifically:

[0290] Data acquisition unit 401: used to acquire the content to be distributed, as well as the content interaction data and content publishing objects of the content to be distributed.

[0291] Fitting unit 402: Used to perform trend fitting on content interaction data to obtain interaction trend feature information of the content to be distributed.

[0292] In one embodiment, the fitting unit 402 may include a data acquisition subunit, a sequence generation subunit, and a trend fitting subunit, specifically:

[0293] Data Acquisition Subunit: Used to acquire content interaction data of the content to be distributed in each time period within a historical time period;

[0294] Sequence generation subunit: used to generate an interaction data sequence of the content to be distributed based on the number of content interactions in each time period;

[0295] Trend Fitting Subunit: Used to fit trends based on interactive data sequences to obtain interactive trend feature information of the content to be distributed.

[0296] Content feature extraction unit 403: used to extract content features from the multimodal content in the content to be distributed, and obtain the multimodal content feature information of the content to be distributed.

[0297] In one embodiment, the content feature extraction unit 403 may include a content extraction subunit, a feature extraction subunit, and an information determination subunit, specifically:

[0298] Content extraction subunit: used to extract content from the content to be distributed, and obtain sub-content of different modalities contained in the content to be distributed;

[0299] Feature extraction subunit: used to extract content features from sub-content of different modalities to obtain content feature information for each sub-content;

[0300] Information Determination Subunit: Used to obtain multimodal content feature information of the content to be distributed based on the content feature information of each sub-content.

[0301] In one embodiment, the sub-content includes video sub-content, and the feature extraction sub-unit may include a frame extraction module, an image feature extraction module, and a feature aggregation module, specifically:

[0302] Frame extraction module: Used to extract frames from video sub-content to obtain multiple video sub-content frames;

[0303] Image feature extraction module: used to extract image features from each video frame to obtain the image feature information corresponding to each video frame;

[0304] Feature aggregation module: This module aggregates the image feature information corresponding to each video frame to obtain the content feature information of the video sub-content.

[0305] In one embodiment, the sub-content includes audio sub-content, and the feature extraction sub-unit may include a content acquisition module, a preprocessing module, and an audio feature extraction module, specifically:

[0306] Content Acquisition Module: Used to acquire the audio sub-content of the content to be distributed;

[0307] Preprocessing module: used to perform audio preprocessing on audio sub-contents to obtain the audio spectrum information of the audio sub-contents;

[0308] Audio feature extraction module: used to extract audio features from audio spectrum information to obtain content feature information of audio sub-content.

[0309] In one embodiment, the sub-content includes text sub-content, video sub-content, and audio sub-content. The feature extraction sub-unit may include an information acquisition module, a text recognition module, a speech recognition module, a content determination module, and a text feature extraction module. Specifically:

[0310] Information Acquisition Module: Used to acquire content-related information of the content to be distributed;

[0311] Text recognition module: used to perform text recognition on video sub-contents to obtain the video text sub-contents contained in the video sub-content frames;

[0312] Speech recognition module: used to perform speech recognition on audio sub-content to obtain the audio text sub-content contained in the video sub-content frame;

[0313] Content determination module: used to treat video text sub-content, audio text sub-content, and content-related information as text sub-content;

[0314] Text feature extraction module: Used to extract text features from text sub-contents to obtain content feature information of the text sub-contents.

[0315] Object feature extraction unit 404: Used to extract object features from the content publishing object based on the object data of the content publishing object, and obtain the object feature information of the content publishing object.

[0316] Fusion unit 405: Used to perform feature fusion processing on interactive trend feature information, multimodal content feature information and object feature information to obtain fused content feature information of the content to be distributed.

[0317] Recommendation Unit 406: Used to recommend content to be distributed based on the integrated content feature information.

[0318] In one embodiment, the recommendation unit 406 may include a prediction subunit, an object acquisition subunit, and a content recommendation subunit, specifically:

[0319] Prediction subunit: used to predict the recommendation level of the content to be distributed based on the fused content feature information;

[0320] Object retrieval subunit: Used to retrieve the target object corresponding to the content to be distributed if the recommendation level meets the preset conditions;

[0321] Content recommendation sub-unit: Used to recommend content to be distributed to the target audience.

[0322] As can be seen from the above, the content recommendation device in this embodiment acquires a sample set through the acquisition unit 301, and performs multimodal sub-content extraction on each content sample in the sample set to obtain multiple sub-contents with different modalities of the content sample; the positive sample generation unit 302 generates aligned positive samples based on the first and second sub-contents with different modalities in each content sample; the negative sample generation unit 303 generates aligned negative samples based on the first sub-content of each content sample and the different second sub-contents of the different content samples in the sample set, wherein the different second sub-contents have the same modality as the second sub-contents; finally, the training unit 304 trains the content feature extraction model based on the aligned positive samples and the aligned negative samples to obtain the trained content feature extraction model, and performs content recommendation based on the multimodal content feature information obtained by multimodal feature extraction of the content to be distributed through the trained content feature extraction model.

[0323] This application embodiment constructs aligned positive and aligned negative samples by analyzing the multimodal content contained in the content samples in the sample set. This eliminates the need for manual labeling of the content samples, and the sample pairs composed of sub-contents of different modalities enable the content feature extraction model to better learn the semantic information of different modal content and the correlation between different modal content, thereby improving the feature extraction capability of the content feature extraction model and thus improving the accuracy of content recommendation.

[0324] This application also provides a computer device, which can be a terminal or a server, such as... Figure 8 As shown, it illustrates a structural schematic diagram of the computer device involved in the embodiments of this application, specifically:

[0325] The computer device may include components such as a processor 1001 with one or more processing cores, a memory 1002 with one or more computer-readable storage media, a power supply 1003, and an input unit 1004. Those skilled in the art will understand that... Figure 8 The computer device structure shown does not constitute a limitation on the computer device and may include more or fewer components than shown, or combine certain components, or have different component arrangements. Wherein:

[0326] The processor 1001 is the control center of the computer device. It connects various parts of the computer device via various interfaces and lines. By running or executing software programs and / or modules stored in the memory 1002, and by calling data stored in the memory 1002, it performs various functions of the computer device and processes data, thereby providing overall monitoring of the computer device. Optionally, the processor 1001 may include one or more processing cores; preferably, the processor 1001 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and computer programs, and the modem processor mainly handles wireless communication. It is understood that the modem processor may not be integrated into the processor 1001.

[0327] The memory 1002 can be used to store software programs and modules. The processor 1001 executes various functional applications and data processing by running the software programs and modules stored in the memory 1002. The memory 1002 may mainly include a program storage area and a data storage area. The program storage area may store the operating system, computer programs required for at least one function (such as sound playback function, image playback function, etc.), etc.; the data storage area may store data created according to the use of the computer device, etc. In addition, the memory 1002 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 1002 may also include a memory controller to provide the processor 1001 with access to the memory 1002.

[0328] The computer equipment also includes a power supply 1003 that supplies power to the various components. Preferably, the power supply 1003 can be logically connected to the processor 1001 through a power management system, thereby enabling functions such as charging, discharging, and power consumption management through the power management system. The power supply 1003 may also include one or more DC or AC power supplies, recharging systems, power fault detection circuits, power converters or inverters, power status indicators, and other arbitrary components.

[0329] The computer device may also include an input unit 1004, which can be used to receive input digital or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

[0330] Although not shown, the computer device may also include a display unit, etc., which will not be described in detail here. Specifically, in this embodiment, the processor 1001 in the computer device loads the executable files corresponding to the processes of one or more computer programs into the memory 1002 according to the following instructions, and the processor 1001 runs the computer programs stored in the memory 1002 to realize various functions, as follows:

[0331] Obtain a sample set, and extract multimodal sub-contents for each content sample in the sample set to obtain multiple sub-contents with different modalities from the content sample;

[0332] Generate aligned positive samples based on the first and second sub-contents with different modalities in each content sample;

[0333] Based on the first sub-content of each content sample, and the distinct second sub-content of the distinct content samples of the content samples in the sample set, an aligned negative sample is generated, wherein the distinct second sub-content has the same modality as the second sub-content.

[0334] The content feature extraction model is trained using aligned positive and negative samples to obtain a trained content feature extraction model. Based on the multimodal content feature information obtained by multimodal feature extraction of the content to be distributed using the trained content feature extraction model, content recommendation is performed.

[0335] For details on the implementation of each of the above operations, please refer to the previous examples, which will not be repeated here.

[0336] As can be seen from the above, the computer device in this application embodiment can obtain a sample set and extract multimodal sub-contents for each content sample in the sample set to obtain multiple sub-contents with different modalities of the content sample; generate aligned positive samples based on the first and second sub-contents with different modalities in each content sample; generate aligned negative samples based on the first sub-content of each content sample and the different second sub-contents of the different content samples in the sample set, wherein the different second sub-contents have the same modality as the second sub-contents; train the content feature extraction model according to the aligned positive samples and aligned negative samples to obtain the trained content feature extraction model, and recommend content based on the multimodal content feature information obtained by multimodal feature extraction of the content to be distributed through the trained content feature extraction model.

[0337] This application embodiment constructs aligned positive and aligned negative samples by analyzing the multimodal content contained in the content samples in the sample set. This eliminates the need for manual labeling of the content samples, and the sample pairs composed of sub-contents of different modalities enable the content feature extraction model to better learn the semantic information of different modal content and the correlation between different modal content, thereby improving the feature extraction capability of the content feature extraction model and thus improving the accuracy of content recommendation.

[0338] According to one aspect of this application, a computer program product is provided, comprising a computer program containing computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the methods provided in the various optional implementations of the above embodiments.

[0339] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be performed by a computer program, or by a computer program controlling related hardware. The computer program can be stored in a computer-readable storage medium and loaded and executed by a processor.

[0340] Therefore, embodiments of this application provide a computer-readable storage medium storing a computer program that can be loaded by a processor to execute any of the content recommendation methods provided in embodiments of this application.

[0341] For details on the implementation of each of the above operations, please refer to the previous examples, which will not be repeated here.

[0342] The computer-readable storage medium may include: read-only memory (ROM), random access memory (RAM), disk or optical disk, etc.

[0343] Since the computer program stored in the computer-readable storage medium can execute any of the content recommendation methods provided in the embodiments of this application, it can achieve the beneficial effects that any of the content recommendation methods provided in the embodiments of this application can achieve, as detailed in the preceding embodiments, and will not be repeated here.

[0344] The foregoing has provided a detailed description of a content recommendation method, apparatus, computer device, and computer-readable storage medium provided in the embodiments of this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A content recommendation method characterized by, include: Obtain a sample set, and extract multimodal sub-contents for each content sample in the sample set to obtain multiple sub-contents with different modalities from the content sample; A first aligned positive sample is generated based on the video and audio content in each content sample, a second aligned positive sample is generated based on the text and audio content in each content sample, and the aligned positive sample is generated based on the first and second aligned positive samples corresponding to each content sample. Construct aligned negative samples by combining the video content of the content sample with the audio content of the dissimilar content sample, and construct aligned negative samples by combining the video content of the content sample with the text content of the dissimilar content sample. The aligned positive samples and the aligned negative samples are masked respectively to obtain masked positive samples and masked negative samples. Based on the target modal sub-content of the content sample and the target modal sub-content of the dissimilar content sample, the similarity between the content sample and the dissimilar content sample is calculated. The negative sample weight corresponding to the aligned negative sample is calculated according to the similarity. The similarity is negatively correlated with the sample weight. The content feature extraction model is trained based on the masked positive samples, the masked negative samples, the aligned positive samples, the aligned negative samples, and the negative sample weight to obtain the trained content feature extraction model. Obtain the content to be distributed, as well as the content interaction data and content publishing object of the content to be distributed; obtain the content interaction data of the content to be distributed in each time period within a historical time period, generate the interaction data sequence of the content to be distributed based on the number of content interactions in each time period, and perform trend fitting based on the interaction data sequence to obtain the interaction trend feature information of the content to be distributed. The trained content feature extraction model is used to extract content features from the multimodal content in the content to be distributed, thereby obtaining the multimodal content feature information of the content to be distributed. Based on the object data of the content publishing object, object features are extracted from the content publishing object to obtain object feature information of the content publishing object; the interaction trend feature information, the multimodal content feature information, and the object feature information are fused to obtain fused content feature information of the content to be distributed; and content recommendation is performed on the content to be distributed based on the fused content feature information.

2. The method of claim 1, wherein, The step of extracting content features from the multimodal content in the content to be distributed using the trained content feature extraction model to obtain the multimodal content feature information of the content to be distributed includes: Multimodal content extraction is performed on the content to be distributed to obtain sub-contents of different modalities contained in the content to be distributed; The trained content feature extraction model is used to extract content features from sub-contents of different modalities to obtain content feature information for each sub-content. The multimodal content feature information of the content to be distributed is obtained based on the content feature information of each sub-content.

3. The method according to any of claims 1-2, characterized in that, The step of recommending content to be distributed based on the fused content feature information includes: Based on the fused content feature information, predict the recommendation level of the content to be distributed; If the recommendation level meets the preset conditions, then the target object corresponding to the content to be distributed is obtained; The content to be distributed is recommended to the target audience.

4. A content recommendation apparatus characterized by comprising: include: An acquisition unit is used to acquire a sample set and extract multimodal sub-contents from each content sample in the sample set to obtain multiple sub-contents with different modalities from the content sample. The positive sample generation unit includes: a first positive sample generation subunit, used to generate a first aligned positive sample based on the video content and audio content in each content sample; a second positive sample generation subunit, used to generate a second aligned positive sample based on the text content and audio content in each content sample; and a third positive sample generation subunit, used to generate the aligned positive sample according to the first aligned positive sample and the second aligned positive sample corresponding to each content sample. The negative sample generation unit is used to construct aligned negative samples by combining the video content of the content sample with the audio content of the dissimilar content sample, and to construct aligned negative samples by combining the video content of the content sample with the text content of the dissimilar content sample. The training unit includes: a masking subunit, used to mask the aligned positive samples and the aligned negative samples respectively to obtain masked positive samples and masked negative samples; a similarity calculation subunit, used to calculate the similarity between the content sample and the dissimilar content sample based on the target modal sub-content of the content sample and the target modal sub-content of the dissimilar content sample; a weight calculation subunit, used to calculate the negative sample weight corresponding to the aligned negative sample according to the similarity, wherein the similarity is negatively correlated with the sample weight; and a second model training subunit, used to train the content feature extraction model according to the masked positive samples, the masked negative samples, the aligned positive samples, the aligned negative samples, and the negative sample weights to obtain the trained content feature extraction model. The data acquisition unit is used to acquire the content to be distributed, as well as the content interaction data and content publishing object of the content to be distributed; The fitting unit is used to acquire the content interaction data of the content to be distributed in each time period within a historical time period, generate the interaction data sequence of the content to be distributed based on the number of content interactions in each time period, and perform trend fitting based on the interaction data sequence to obtain the interaction trend feature information of the content to be distributed. The content feature extraction unit is used to extract content features from the multimodal content in the content to be distributed using the trained content feature extraction model, so as to obtain the multimodal content feature information of the content to be distributed. An object feature extraction unit is used to extract object features from the content publishing object based on the object data of the content publishing object, and obtain the object feature information of the content publishing object. The fusion unit is used to perform feature fusion processing on the interaction trend feature information, the multimodal content feature information and the object feature information to obtain the fused content feature information of the content to be distributed; The recommendation unit is used to recommend content to be distributed based on the fused content feature information.

5. The content recommendation device according to claim 4, characterized in that, The content feature extraction unit includes: The content extraction subunit is used to extract content from the content to be distributed, and obtain sub-content of different modalities contained in the content to be distributed; The feature extraction subunit is used to extract content features from sub-contents of different modalities to obtain content feature information for each sub-content. The information determination subunit is used to obtain the multimodal content feature information of the content to be distributed based on the content feature information of each sub-content.

6. The content recommendation device according to any one of claims 4-5, characterized in that, The recommendation unit includes: The prediction subunit is used to predict the recommendation level of the content to be distributed based on the fused content feature information. The object acquisition subunit is used to acquire the target object corresponding to the content to be distributed if the recommendation level meets the preset conditions. The content recommendation subunit is used to recommend the content to be distributed to the target object.

7. A computer device, characterized in that, It includes a memory and a processor; the memory stores a computer program, and the processor is used to run the computer program in the memory to perform the content recommendation method according to any one of claims 1 to 3.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium is used to store a computer program, which is loaded by a processor to perform the content recommendation method according to any one of claims 1 to 3.

9. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the content recommendation method according to any one of claims 1 to 3.