A short video advertisement click rate scoring and ranking method, system, device and medium

By performing singular value decomposition and self-attention mechanism aggregation on the image and audio features of short video ads, combined with difference scoring and neural network training, the problem of fine-grained differentiation and consistency in the click-through rate ranking of short video ads is solved, and more accurate offline evaluation is achieved.

CN122199061APending Publication Date: 2026-06-12GUIZHOU POWER GRID CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUIZHOU POWER GRID CO LTD
Filing Date
2025-12-26
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In existing technologies, the ranking of click-through rates for short video ads relies on user-side evaluation information and lacks the influence of user data from offline scenarios. Traditional solutions struggle to achieve fine-grained differentiation and ranking consistency, resulting in significant discrepancies between the ranking results and the actual campaign performance.

Method used

By extracting features from short video ads, obtaining image and audio features, performing singular value decomposition and self-attention mechanism aggregation, constructing popular and unpopular features, combining difference scoring and multilayer perceptron, calculating matching score, and using neural network model for training and ranking.

Benefits of technology

It enables fine-grained differentiation and consistent sorting of short video ads in offline scenarios, improving the accuracy of scoring and the precision of sorting, and adapting to the evaluation needs before ad delivery.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122199061A_ABST
    Figure CN122199061A_ABST
Patent Text Reader

Abstract

The application discloses a short video advertisement click rate scoring and sorting method, system, device and medium, and belongs to the technical field of short video scoring and sorting, which comprises the following steps: performing feature extraction on a short video advertisement and performing pretreatment to obtain image features and audio features; processing the image features and the audio features to obtain first aggregation features and second aggregation features and processing the first aggregation features and the second aggregation features to obtain popular features and unpopular features; constructing a difference score according to the popular features and the unpopular features; calculating a matching degree score of the first aggregation features and the second aggregation features, and combining the difference score to construct a predicted click rate score of the short video advertisement; and under the constraint of a loss function, using a trained neural network model to perform click rate offline sorting on the short video advertisement. The application solves the problems that the prior art relies on user evaluation information and the traditional video quality evaluation is not backward in the scoring mode.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of short video rating and ranking technology, specifically to a method, system, device, and medium for rating and ranking short video ad click-through rates. Background Technology

[0002] At present, short video ads have become the core carrier of digital marketing, and their offline click-through rate ranking is a key step in creative selection and material quality inspection before ad placement.

[0003] In existing technologies, most ranking schemes rely on user click data and behavioral preferences, lacking the influence of offline user data scenarios. At the same time, traditional schemes mostly adopt coarse-grained classification or tiered scoring methods, which make it difficult to distinguish between similar advertising materials in a fine-grained manner, resulting in significant shortcomings in practicality.

[0004] Furthermore, traditional video quality evaluation standards have a weak correlation with actual user click behavior, failing to effectively capture the core factors affecting ad attractiveness. This results in a significant discrepancy between the ranking results and the actual campaign performance, making it difficult to meet advertisers' marketing needs. Summary of the Invention

[0005] In view of the above-mentioned problems, the present invention is proposed.

[0006] Therefore, the technical problem solved by this invention is: how to establish a ranking mechanism for short video ads and rank them according to their click-through rate (CTR) scores through a short video ad click-through rate (CTR) scoring and ranking method, while simultaneously eliminating reliance on user-side evaluation information, improving the fine-grained differentiation of scores and the consistency of ranking, and adapting to offline scenario requirements.

[0007] To address the aforementioned technical problems, this invention provides the following technical solution: a method for ranking click-through rates (CTR) of short video advertisements, comprising the following steps: extracting and preprocessing features from the short video advertisements to obtain image features and audio features; processing the image features to obtain overall image features and image detail features, and aggregating the overall image features and image detail features to obtain a first aggregated feature; processing the audio features to obtain overall audio features and audio detail features, and aggregating the overall audio features and audio detail features to obtain a second aggregated feature; processing the first aggregated feature and the second aggregated feature to obtain popular features and unpopular features; constructing a difference score based on the popular features and unpopular features; calculating the matching score between the first aggregated feature and the second aggregated feature, and constructing a predicted CTR score for the short video advertisement based on the difference score; training a neural network model using the predicted CTR score under the constraint of a loss function, and using the trained neural network model to perform offline CTR ranking of the short video advertisements.

[0008] As a preferred embodiment of the short video ad click-through rate ranking method of the present invention, the steps of obtaining the first aggregated feature and the second aggregated feature include: wherein, the step of obtaining the first aggregated feature includes performing singular value decomposition on the image features; separating the overall image features and image detail features according to a preset ratio, and performing temporal aggregation on the overall image features and image detail features respectively using a self-attention mechanism to obtain the first aggregated feature; wherein, the step of obtaining the second aggregated feature includes performing the singular value decomposition on the audio features; separating the overall audio features and audio detail features according to the preset ratio, and performing temporal aggregation on the overall audio features and audio detail features respectively using the self-attention mechanism to obtain the second aggregated feature.

[0009] As a preferred embodiment of the short video ad click-through rate (CTR) ranking method described in this invention, the step of obtaining popular and unpopular features includes: extracting overall image aggregation features and image detail aggregation features from the first aggregation features, and extracting overall audio aggregation features from the second aggregation features; inputting the overall image aggregation features, image detail aggregation features, and overall audio aggregation features into a preset linear transformation structure for processing, and outputting the popular and unpopular features of the overall image aggregation features, image detail aggregation features, and overall audio aggregation features. The beneficial effects of this preferred embodiment are that it avoids the noise introduced by audio detail features; by processing various aggregation features through a preset linear transformation structure, it outputs the popular and unpopular features corresponding to each feature, directly linking the feature dimension with the CTR influencing factor, quantifying the positive and negative contribution of aggregation features to the CTR, and ensuring the accuracy of the predicted CTR score and the consistency of the ranking.

[0010] As a preferred embodiment of the short video ad click-through rate (CTR) ranking method described in this invention, the step of constructing a difference score includes: inputting the popular and unpopular features into a multilayer perceptron; scoring the popular and unpopular features using the multilayer perceptron to obtain popular feature scores and unpopular feature scores; and calculating the difference between the popular feature scores and unpopular feature scores corresponding to the overall image aggregation feature, image detail aggregation feature, and overall audio aggregation feature to obtain the difference score. The beneficial effect of this preferred embodiment is that by scoring the popular and unpopular features corresponding to the three types of aggregation features separately using a multilayer perceptron, and then calculating the score difference of each type of feature to obtain the difference score, the positive and negative contribution of each type of feature to the CTR is quantified. This allows the scoring results to be correlated with actual click behavior while maintaining the consistency between the scoring logic and the preceding feature processing steps, thus improving the accuracy of short video ad ranking in offline scenarios.

[0011] As a preferred embodiment of the short video ad click-through rate (CTR) ranking method described in this invention, the step of constructing the predicted CTR score for the short video ad includes: calculating the inner product of the overall image aggregation feature and the overall audio aggregation feature to obtain an overall matching score; calculating the inner product of the image detail aggregation feature and the audio detail aggregation feature to obtain a detail matching score; and summing the difference score, the overall matching score, and the detail matching score to obtain the predicted CTR score for the short video ad. The beneficial effects of this preferred embodiment are that by calculating the inner product of the image and audio at the overall and detail levels to obtain two types of matching scores, and combining the difference score to construct the predicted CTR score, feature contributions and matching information are integrated; the scoring logic aligns with the factors influencing short video ad clicks, making the prediction results more consistent with real-world scenarios and providing a quantitative basis for offline ranking.

[0012] As a preferred embodiment of the short video ad click-through rate (CTR) ranking method of the present invention, the step of offline ranking of the short video ads by CTR includes: training the neural network model using labeled short video ad samples under the constraint of the loss function; inputting the short video ads to be ranked into the trained neural network model to obtain the predicted CTR score corresponding to each short video ad to be ranked; arranging all the short video ads to be ranked in descending order of the predicted CTR scores, and outputting the offline ranking result.

[0013] In a preferred embodiment of the short video ad click-through rate (CTR) ranking method described in this invention, the loss function includes a primary loss function and an auxiliary loss function. The primary loss function extracts CTR samples from the training set of the neural network model based on the offline ranking results to form a video pair set. The auxiliary loss function selects the top q samples with the highest and lowest CTRs from the labeled short video ad samples to form a video pair set. The loss function is constructed by weighted fusion of the primary and auxiliary loss functions.

[0014] This invention provides a short video ad click-through rate scoring and ranking system.

[0015] To address the aforementioned technical problems, the present invention further provides the following technical solution: a short video ad click-through rate scoring and ranking system, comprising: a feature acquisition module, which extracts and preprocesses features from the short video ad to acquire image features and audio features; a first aggregated feature construction module, which processes the image features to acquire overall image features and image detail features, and aggregates the overall image features and image detail features to acquire a first aggregated feature; and a second aggregated feature construction module, which processes the audio features to acquire overall audio features and audio detail features, and aggregates the overall audio features and audio detail features to acquire a first aggregated feature; The system employs a multi-step process: an aggregation module to obtain a second aggregated feature; a welcome and unpopular feature construction module to process the first and second aggregated features to obtain popular and unpopular features; a difference score construction module to construct a difference score based on the popular and unpopular features; a predicted click-through rate (CTR) score calculation module to calculate the matching score between the first and second aggregated features and combine it with the difference score to construct the predicted CTR score for the short video ad; and a CTR offline ranking module to train a neural network model using the predicted CTR score under the constraint of a loss function, and then use the trained neural network model to perform offline CTR ranking for the short video ad.

[0016] The present invention provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the short video advertisement click-through rate scoring and ranking method.

[0017] The present invention provides a computer-readable storage medium having a computer program stored thereon, characterized in that, when the computer program is executed by a processor, it implements the steps of the aforementioned short video advertisement click-through rate scoring and ranking method.

[0018] The beneficial effects of this invention are as follows: Click-through rate (CTR) scoring and ranking are achieved through the image and audio features of short video ads themselves, broadening the scope of technology application and solving the problem of difficulty in evaluation due to lack of user data in offline scenarios; Singular value decomposition decouples the overall and detailed information of features, and difference scoring quantifies the positive and negative contributions of popular and unpopular features to CTR, overcoming the limitations of traditional techniques that use coarse-grained tiered scoring, and achieving differentiation between similar types of ads; A neural network model is jointly trained based on the main loss of video pairs and the auxiliary loss of the most difficult sample, and by integrating cross-modal matching scores, the contribution of single-modal features and inter-modal synergistic effects are analyzed, making the prediction results more closely match real click behavior; An end-to-end training framework is adopted, with a simple and efficient calculation process, which can be directly used for offline evaluation before ad placement. Attached Figure Description

[0019] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0020] Figure 1 The above is a flowchart of a short video ad click-through rate ranking method provided in one embodiment of the present invention.

[0021] Figure 2 This is a neural network structure diagram for predicting video click-through rate scores in a short video advertising click-through rate ranking method provided in one embodiment of the present invention.

[0022] Figure 3 This diagram illustrates the main loss of a click-through rate (CTR) ranking method for short video ads, as provided in an embodiment of the present invention, based on CTR-ranked sample pairs. Detailed Implementation

[0023] To make the above-mentioned objects, features, and advantages of the present invention more apparent and understandable, specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the protection scope of the present invention.

[0024] Example 1, referring to Figure 1 This is the first embodiment of the present invention, which provides a method for ranking short video ad click-through rates, including:

[0025] S100: Extract and preprocess features from short video ads to obtain image and audio features.

[0026] S200: Process the image features to obtain the overall image features and image detail features of the image features, and aggregate the overall image features and image detail features of the image features to obtain the first aggregated feature.

[0027] S300: Process the audio features to obtain the overall audio features and audio detail features of the audio features, and aggregate the overall audio features and audio detail features of the audio features to obtain the second aggregated features.

[0028] S400: Process the first aggregated feature and the second aggregated feature to obtain popular features and unpopular features.

[0029] S500: Construct a difference score based on popular and unpopular features.

[0030] S600: Calculate the matching score between the first aggregated feature and the second aggregated feature, and combine the difference score to construct the predicted click-through rate score of the short video ad.

[0031] S700: Under the constraint of the loss function, the neural network model is trained using the predicted click-through rate score, and the trained neural network model is used to rank short video ads offline by click-through rate.

[0032] It should be noted that the existing offline ranking of click-through rates for short video ads relies on user-side evaluation information, which makes it unsuitable for offline scenarios due to a lack of user data; traditional video quality evaluation standards are not sufficiently correlated with actual click behavior and cannot accurately reflect the actual situation of ads attracting user clicks; the existing technology uses coarse-grained classification and tiered scoring methods, which cannot distinguish samples in a fine-grained manner that meets the requirements.

[0033] Therefore, addressing the aforementioned issues of existing technologies relying on user reviews and traditional video quality evaluation methods being outdated and inefficient, this technical solution constructs a short video ad click-through rate ranking method through steps S100-S700. First, features are extracted and preprocessed from the short video ads to obtain image and audio features. Then, the image features are processed to obtain overall image features and image detail features, which are then aggregated to obtain the first aggregated feature. Finally, the audio features are processed to obtain overall audio features and audio detail features. The system first identifies segment features and aggregates the overall audio features with audio detail features to obtain a second aggregated feature. Then, it processes the first and second aggregated features to obtain popular and unpopular features. Based on this, a difference score is constructed based on the popular and unpopular features. Next, the matching score between the first and second aggregated features is calculated, and the difference score is combined to construct a predicted click-through rate (CTR) score for short video ads. Finally, under the constraint of a loss function, the predicted CTR score is used to train a neural network model, and the trained neural network model is used to perform offline CTR ranking of short video ads.

[0034] Example 2, refer to Figures 1 to 3 This is the second embodiment of the present invention, which provides a method for ranking and scoring short video ad click-through rates.

[0035] In this embodiment of the invention, in step S100, features of the short video advertisement are extracted and preprocessed to obtain image features and audio features.

[0036] Specifically, for each short video ad in the short video ad set, a pre-trained image feature extraction network is used to encode the video frames to obtain an image feature matrix; a pre-trained audio feature extraction network is used to encode the video audio to obtain an audio feature matrix.

[0037] Furthermore, by padding, the image and audio features of all videos are unified to the longest video in the dataset. Then, a linear layer is used to unify the image and audio features in terms of dimension, resulting in preprocessed image and audio features.

[0038] In this embodiment of the invention, step S200 processes the image features to obtain the overall image features and image detail features, and aggregates the overall image features and image detail features to obtain the first aggregated feature, including the following steps A1~A3:

[0039] A1: Perform singular value decomposition on image features.

[0040] Specifically, the singular value decomposition takes the following form:

[0041] ;

[0042] In the formula, express eigenvalue matrix of order, express Left singular vector matrix of order 1 express A singular value diagonal matrix of order 1. express The conjugate transpose of a right-singular vector matrix of order 1.

[0043] Furthermore, The specific manifestations are as follows:

[0044] ;

[0045] In the formula, (Sorted in descending order) This represents the total number of singular values. express The conjugate transpose of a right-singular vector matrix of order 1.

[0046] Furthermore, image features are reconstructed based on singular value decomposition theory. The specific form of the reconstruction is as follows:

[0047] ;

[0048] ;

[0049] In the formula, Indicates including the preceding A high singular value, Indicates inclusion A low singular value, Represents the overall information of the feature matrix. Representing local information of the feature matrix, express The conjugate transpose of a right-singular vector matrix of order 1.

[0050] In one possible implementation, image feature extraction can also be replaced by a convolutional neural network. The convolutional neural network first samples the video frames one by one, and after inputting them into the convolutional neural network, it extracts local features through multiple sets of convolutional layers. The gradient vanishing problem in deep networks is solved by residual connections. Then, the image feature matrix with fixed dimensions is output through a global average pooling layer. Subsequently, the image features are finally obtained by using the same padding length and linear layer with the same dimension as in this technical solution.

[0051] In another possible implementation, image feature extraction can also be replaced by a visual self-attention mechanism. The visual self-attention mechanism divides video frames into fixed-size image blocks, transforms them into embedding vectors through a linear layer and adds positional encoding, inputs them into the encoder layer, captures the global correlation between image blocks through a multi-head self-attention mechanism, and then outputs the image feature matrix through a fully connected layer, achieving the same image features as the original scheme.

[0052] A2: Separate the overall features of the image from the detailed features of the image according to a preset ratio.

[0053] Specifically, the specific manifestations of separation are as follows:

[0054] ;

[0055] In the formula, This represents the segmentation ratio hyperparameter. This represents the total number of singular values. This indicates the floor function.

[0056] A3: A self-attention mechanism is used to perform temporal aggregation of overall image features and image detail features respectively.

[0057] Specifically, the manifestation of temporal aggregation using the self-attention mechanism is as follows:

[0058] ;

[0059] In the formula, This represents the feature matrix output after aggregation via the self-attention mechanism. , and These are query, key, and value matrices, respectively. These are the learnable weight matrices for the query, key, and value matrices, respectively. Represents the input feature matrix. This represents the factor that scales the attention score.

[0060] In one possible implementation, the self-attention mechanism aggregation can also be replaced by a gated loop unit. The gated loop unit inputs the decoupled overall image, detailed features, overall audio, and detailed features into the gated loop unit at time steps, resets the gate to control the forgetting degree of historical information, updates the gate to determine the fusion ratio of the current feature and the historical state, captures the temporal dependency relationship through the iterative update of the hidden state, and finally outputs the aggregated features of fixed dimensions.

[0061] In another possible implementation, the self-attention mechanism aggregation can also be replaced by a convolutional neural network. The convolutional neural network treats the decoupled temporal features as a temporal sequence, uses convolutional kernels to slide along the time dimension, extracts the feature associations within the local time window, processes them through normalization and activation functions, and outputs aggregated features of fixed dimensions through a global max pooling layer.

[0062] In this embodiment of the invention, in step S300, the audio features are processed to obtain the overall audio features and audio detail features of the audio features, and the overall audio features and audio detail features of the audio features are aggregated to obtain a second aggregated feature.

[0063] It should be noted that the steps for obtaining the second aggregation feature are the same as those for obtaining the first aggregation feature, and will not be repeated here.

[0064] In an embodiment of the present invention, reference Figure 2 In step S400, the first aggregated feature and the second aggregated feature are processed to obtain popular and unpopular features, including the following steps B1~B2:

[0065] B1: Extract overall image aggregation features and image detail aggregation features from the first aggregation features, and extract overall audio aggregation features from the second aggregation features.

[0066] It should be noted that the first aggregated feature is directly aggregated from the overall image feature and the image detail feature, so the overall image aggregated feature and the image detail aggregated feature can be directly separated from the first aggregated feature.

[0067] Furthermore, since audio detail features are prone to noise, only the overall image aggregation features are extracted from the second aggregation features.

[0068] B2: Input the overall image aggregation features, image detail aggregation features, and overall audio aggregation features into a preset linear transformation structure for processing, and output the popular and unpopular features of the overall image aggregation features, image detail aggregation features, and overall audio aggregation features.

[0069] Specifically, the processing within the linear transformation structure takes the following form:

[0070] ;

[0071] ;

[0072] In the formula, and These represent popular and unpopular traits, respectively. and This represents the learnable weight matrix. and This represents the corresponding bias vector.

[0073] In one implementation, the linear transformation structure can also be replaced by a dynamic affine transformation layer. The dynamic affine transformation layer scales the input aggregated features by introducing learnable scalar parameters, enhances the nonlinear discriminative power of the features through activation functions, and then performs affine transformation using learnable weight vector γ and bias vector β. After inputting the overall image aggregated features, image detail aggregated features, and overall audio aggregated features, the two sets of dynamic affine transformation layers output the corresponding feature representations respectively.

[0074] In another possible implementation, the linear transformation structure can be replaced by a one-dimensional convolutional layer. The one-dimensional convolutional layer uses a one-dimensional convolutional kernel with a kernel size of 1, based on the overall image aggregation features, image detail aggregation features, and overall audio aggregation features. When the input feature matrix passes through the convolutional layer, the features at each temporal position are multiplied element-wise with the convolutional kernel weights and then summed. A bias term is then added to output the popular and unpopular features corresponding to the overall image aggregation features, image detail aggregation features, and overall audio aggregation features.

[0075] In this embodiment of the invention, step S500, which constructs a difference score based on popular and unpopular features, includes the following steps C1 to C3:

[0076] C1: Input popular and unpopular features into the multilayer perceptron.

[0077] Specifically, the popular and unpopular features corresponding to the overall image aggregation features, image detail aggregation features, and overall audio aggregation features output in step B2 are respectively input into a preset multilayer perceptron.

[0078] C2: Use a multilayer perceptron to score popular and unpopular features, and obtain scores for popular and unpopular features.

[0079] Specifically, the layer perceptron performs scoring operations on the input features through nonlinear transformations, and for each type of aggregated feature, it identifies the most popular feature. With unpopular characteristics Each outputs its corresponding quantitative score.

[0080] C3: Calculate the difference between the popular feature score and the unpopular feature score corresponding to the overall image aggregation feature, the image detail aggregation feature, and the overall audio aggregation feature, and obtain the difference score.

[0081] Specifically, the specific form of obtaining the difference score is as follows:

[0082] ;

[0083] In the formula, Indicates the difference score. This represents the quantitative score output by the multilayer perceptron for the popular feature p corresponding to this type of aggregated feature. This represents the quantitative score output by the multilayer perceptron for the unpopular feature u corresponding to this type of aggregated feature.

[0084] It should be noted that when and Source , , or When, the scores are expressed as follows: , , ,as well as For audio information, focusing only on the overall picture is sufficient. Overemphasizing its complex details can hinder the overall performance of the model because it introduces more noise. Used.

[0085] in, Represents overall image information. Indicates image detail information, This indicates the overall audio information. This indicates detailed audio information.

[0086] In one possible implementation, the difference score can also be replaced by a confirmatory regression equation, which performs scoring by constructing an unconstrained equation. By using click-through rate as the dependent variable and popular feature p and unpopular feature u as independent variables, a linear regression equation is established. By testing whether the equation meets the expected pattern, the result of the association between the feature and the click-through rate is used as the difference score.

[0087] In another possible implementation, the difference score can be replaced by relative entropy, which uses a distribution difference measure to perform the score. By mapping the popular feature p and the unpopular feature u to probability distributions respectively, we obtain the distribution P corresponding to p and the distribution Q corresponding to u; calculate the relative entropy divergence and use the KL divergence value as the difference score.

[0088] In this embodiment of the invention, step S600 calculates the matching score between the first aggregated feature and the second aggregated feature, and combines the difference score to construct the predicted click-through rate score of the short video advertisement, including the following steps D1~D3:

[0089] D1: Calculate the inner product of the overall image aggregate features and the overall audio aggregate features to obtain the overall matching score.

[0090] Specifically, the overall image aggregation feature after self-attention aggregation in the first aggregation feature is selected as the image feature, and the overall audio aggregation feature after self-attention aggregation in the second aggregation feature is selected as the audio feature.

[0091] Furthermore, the specific form of inner product calculation is as follows:

[0092] ;

[0093] In the formula, This represents the inner product of the overall aggregated features of the image and the overall aggregated features of the audio. and These represent image features and audio features, respectively.

[0094] D2: Calculate the inner product of the image detail aggregation features and the audio detail aggregation features to obtain the detail matching score.

[0095] It should be noted that the image detail aggregation feature after self-attention aggregation in the first aggregation feature is selected as the image feature, and the audio detail aggregation feature after self-attention aggregation in the second aggregation feature is selected as the audio feature.

[0096] Furthermore, after calculation using the same formula as in step D1, the inner product of the image detail aggregation feature and the audio detail aggregation feature is obtained.

[0097] D3: Sum the difference score, overall match score, and detail match score to obtain the predicted click-through rate score for the short video ad.

[0098] Specifically, the predicted click-through rate score is represented as follows:

[0099] ;

[0100] In the formula, This represents the predicted click-through rate score for the final output short video ad. This represents a single score that is included in the summation.

[0101] In an embodiment of the present invention, reference Figure 3 In step S700, under the constraint of the loss function, the neural network model is trained using the predicted click-through rate score. The trained neural network model is then used to perform offline ranking of short video ads based on their click-through rates, including the following steps E1~E3:

[0102] E1: Under the constraint of the loss function, the neural network model is trained using labeled short video ad samples.

[0103] It should be noted that the loss function is mainly composed of the main loss function constructed from short video ads and the auxiliary loss function of labeled short video ad samples.

[0104] The main loss function is specifically represented as follows:

[0105] ;

[0106] In the formula, Indicates the main losses, This refers to video pairs in the training set. This represents a set of video pairs consisting of high / low click-through rate samples. This represents the number of samples in the set D of video pairs. Indicates positive samples The corresponding predicted click-through rate score, Indicates negative samples The corresponding predicted click-through rate score, This represents the Sigmoid activation function.

[0107] It should be noted that the video pairs are selected based on the offline sorting results, choosing the two short video ads that are at the beginning and end of the sorting order. The position numbers of the two short video ads in each video pair correspond to each other. For example, the first ad in the forward order corresponds to the first ad in the reverse order, and the second ad in the forward order corresponds to the second ad in the reverse order.

[0108] The auxiliary loss function is expressed as follows:

[0109] ;

[0110] In the formula, Indicates auxiliary loss, This represents the set of all rating components. Represents a single rating component. This represents the set of video pairs selected from the regions with the highest and lowest click-through rates. Indicates the first Positive and negative samples in a video pair and These represent the positive and negative samples in the scoring component, respectively. The score below, This represents the sigmoid activation function.

[0111] It should be noted that in this implementation scheme, the quantity of q is set to 30.

[0112] Furthermore, the main loss function and the auxiliary loss function are fused together, specifically in the following form:

[0113] ;

[0114] In the formula, Represents the loss function. Represents the main loss function. Denotes the auxiliary loss function. This represents the auxiliary loss weight.

[0115] E2: Input the short video ads to be sorted into the trained neural network model to obtain the predicted click-through rate score for each short video ad to be sorted.

[0116] Specifically, for the set of short video ads to be sorted, the predicted click-through rate (CTR) score is calculated using the predicted CTR score calculation formula in step D3, following steps S100-S600.

[0117] E3: Sort all short video ads to be sorted in descending order of predicted click-through rate score, and output the offline sorting results.

[0118] It should be noted that the neural network model in this technical solution includes a feature preprocessing module, an aggregated feature construction module, a scoring calculation module, and a model training optimization module.

[0119] The feature preprocessing module is designed to acquire multimodal features of uniform specifications. The input short video advertisement is sampled once every 10 frames, with a maximum of 60 frames. Within the module, image and audio features of different lengths are uniformized to the maximum length through padding. Then, a linear layer is used to unify the dimensions of the two to 128, forming a feature standardization submodule. The output is an N×L×d image information matrix and an audio information matrix. In this technical solution, the uniform dimension of 128 can reduce the computational complexity of cross-modal processing.

[0120] The aggregation feature construction module is designed to achieve intramodal feature decoupling and temporal aggregation. It uses singular value decomposition (SVD) to process the image and audio feature matrices, with the segmentation ratio hyperparameter set to a 0.5 ratio for overall image information and a 0.7 ratio for overall audio information. Within the module, SVD decomposes image features into overall image features and image detail features, and audio features into overall audio features and audio detail features. Then, a self-attention mechanism is used to perform temporal aggregation on the four types of decoupled features, forming a feature decoupling aggregation submodule. This submodule outputs overall image aggregation features, image detail aggregation features, overall audio aggregation features, and audio detail aggregation features. The combination of SVD and self-attention mechanism in this technical solution can strengthen the feature correlation in the temporal dimension.

[0121] The scoring calculation module generates the predicted click-through rate (CTR) score for short video ads. It processes overall image features, image detail features, overall audio features, and audio detail features, using linear transformations and a multilayer perceptron to construct difference scores. It then uses inner product calculations to construct a matching score. Within the module, two independent linear transformation layers output popular and unpopular features corresponding to each type of aggregation feature. With an input layer dimension of 128, a hidden layer dimension of 256, and an output layer dimension of 1, the multilayer perceptron uses the ReLU activation function to score the two types of features and calculate the difference, obtaining the difference scores for the overall image, image detail, and overall audio. These are then used to calculate the overall matching score and detail matching score. Finally, a summation layer adds the five scores together to form a comprehensive scoring submodule, which outputs the predicted CTR score.

[0122] The model training and optimization module is responsible for model training and offline ranking. Within this module, the main loss calculation layer performs comparative learning training on video pairs extracted from the training set, and the auxiliary loss calculation layer performs key sample optimization on video pairs consisting of the top 30 and bottom click-through rates. The joint loss function guides the iterative training of the model. The training parameters use the Adam optimizer with a learning rate of 0.001 and 1000 iterations. This constitutes the model training and ranking submodule. After the short video ads to be ranked are input into the trained model, the ranking results are output in descending order of predicted click-through rate scores. The joint loss function in this technical solution can be adapted to the evaluation needs of offline scenarios without user information.

[0123] In summary, this invention utilizes the image and audio features of short video ads to rank click-through rates (CTR), broadening the application scope of the technology and solving the problem of insufficient user data for evaluation in offline scenarios. It decouples the overall and detailed information of features through singular value decomposition and quantifies the positive and negative contributions of popular and unpopular features to CTR through difference scoring, overcoming the limitations of traditional coarse-grained scoring and achieving differentiation between similar ads. It employs a neural network model jointly trained based on the primary loss of video pairs and the auxiliary loss of the most difficult sample. By integrating cross-modal matching scores, it analyzes the contribution of single-modal features and inter-modal synergistic effects, making the prediction results more closely reflect real click behavior. The end-to-end training framework is simple and efficient, allowing for direct offline evaluation before ad placement.

[0124] Example 3 is the third embodiment of the present invention, which provides a short video ad click-through rate scoring and ranking system, including...

[0125] The feature acquisition module extracts and preprocesses features from short video ads to obtain image and audio features;

[0126] The first aggregated feature construction module processes the image features to obtain the overall image features and image detail features of the image features, and aggregates the overall image features and image detail features of the image features to obtain the first aggregated feature;

[0127] The second aggregation feature construction module processes the audio features to obtain the overall audio features and audio detail features of the audio features, and aggregates the overall audio features and audio detail features of the audio features to obtain the second aggregation feature.

[0128] The welcome and unwelcome feature building module processes the first aggregated feature and the second aggregated feature to obtain the welcome and unwelcome features;

[0129] The difference score building module constructs difference scores based on popular and unpopular features;

[0130] The predicted click-through rate score calculation module calculates the matching score between the first aggregated feature and the second aggregated feature, and combines the difference score to construct the predicted click-through rate score of the short video ad.

[0131] The offline click-through rate (CTR) ranking module trains a neural network model using predicted CTR scores under the constraint of a loss function, and then uses the trained neural network model to rank short video ads offline based on their CTR.

[0132] Example 4, the fourth embodiment of the present invention, differs from the previous three embodiments in that: if the function is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0133] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-including system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device.

[0134] More specific examples of computer-readable media (a non-exhaustive list) include: electrical connections (electronic devices) having one or more wires, portable computer disk drives (magnetic devices), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Furthermore, computer-readable media can even be paper or other suitable media on which the program can be printed, because the program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in computer memory.

[0135] It should be understood that various parts of the present invention can be implemented in hardware, software, firmware, or a combination of all three. In the above embodiments, multiple steps or methods can be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.

[0136] Example 5, the fifth embodiment of the present invention, provides a method for ranking and scoring short video ad click-through rates. To verify the beneficial effects of the present invention, scientific demonstration is carried out through experiments.

[0137] 1. Dataset Setup:

[0138] • Dataset: Short video platform dataset (TikTok)

[0139] • Video sampling: Sampled once every 10 frames.

[0140] • Data partitioning: Training set: Test set = 8:2

[0141] • Maximum frame rate: 60

[0142] 2. Feature extraction parameters:

[0143] Image features: Feature extraction using pre-trained MobileNet-V2

[0144] Audio features: Features extracted using pre-trained VGGish

[0145] Unified Dimension: (Using linear layers to unify image and audio features to the same dimension)

[0146] Padding: Videos of different lengths are uniformly padded to the maximum length.

[0147] 3. Feature extraction parameters:

[0148] SVD Decomposition: Singular Value Partition Ratio via Hyperparameters control

[0149] Optimal segmentation ratio:

[0150] Overall audio information ratio:

[0151] Overall information proportion of the image:

[0152] MLP parameters: Input layer dimensions (Feature dimension), hidden layer dimension (Learnable parameters) , Output layer dimension 1 (learnable parameters) , ), using the ReLU activation function.

[0153] Self-Attention parameters: three learnable projection matrices ( (for key / query dimension), scaling factor The output dimension is the same as the input dimension. ).

[0154] 4. Training parameters:

[0155] Optimizer: Adam Optimizer

[0156] Learning rate: 0.001

[0157] Training iterations: 1000

[0158] Auxiliary loss weights:

[0159] Sample selection: Selected from both high and low click-through rate regions. Sample

[0160] 5. Performance Indicators:

[0161] Normalized Discounted Cumulative Gain (NDCG): 0.9719. NDCG is a commonly used metric to evaluate ranking results and measures their quality; a higher value is better. Compared to benchmark methods, HIDRO-VQA... This invention improves the NDCG index by 8.19%.

[0162] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.

Claims

1. A method for ranking and scoring short video ad click-through rates, characterized in that: include, The short video advertisement is subjected to feature extraction and preprocessing to obtain image features and audio features; The image features are processed to obtain the overall image features and image detail features of the image features, and the overall image features and image detail features of the image features are aggregated to obtain the first aggregated feature; The audio features are processed to obtain the overall audio features and audio detail features of the audio features, and the overall audio features and audio detail features of the audio features are aggregated to obtain a second aggregated feature; The first aggregated feature and the second aggregated feature are processed to obtain popular features and unpopular features; A difference score is constructed based on the popular and unpopular features; Calculate the matching score between the first aggregated feature and the second aggregated feature, and combine the difference score to construct the predicted click-through rate score of the short video ad; Under the constraint of the loss function, the predicted click-through rate score is used to train the neural network model, and the trained neural network model is used to sort the click-through rates of the short video ads offline.

2. The short video ad click-through rate scoring and ranking method as described in claim 1, characterized in that, The steps for obtaining the first aggregated feature and the second aggregated feature include: The step of obtaining the first aggregated feature includes performing singular value decomposition on the image feature; The overall image features and image detail features are separated according to a preset ratio, and the overall image features and image detail features are temporally aggregated using a self-attention mechanism to obtain the first aggregated feature. The step of obtaining the second aggregated feature includes performing the singular value decomposition on the audio feature; The overall audio features and detailed audio features are separated according to the preset ratio, and the overall audio features and detailed audio features are temporally aggregated using the self-attention mechanism to obtain the second aggregated feature.

3. The short video ad click-through rate scoring and ranking method as described in claim 2, characterized in that, The steps to obtain popular and unpopular traits include: Extract overall image aggregation features and image detail aggregation features from the first aggregation features, and extract overall audio aggregation features from the second aggregation features; The overall image aggregation features, image detail aggregation features, and overall audio aggregation features are input into a preset linear transformation structure for processing, and the popular and unpopular features of the overall image aggregation features, image detail aggregation features, and overall audio aggregation features are output.

4. The short video ad click-through rate scoring and ranking method as described in claim 3, characterized in that, The steps to construct a difference score include: The popular and unpopular features are input into a multilayer perceptron; The popular features and unpopular features are scored using the multilayer perceptron to obtain popular feature scores and unpopular feature scores. Calculate the difference between the popular feature score and the unpopular feature score corresponding to the overall image aggregation feature, the image detail aggregation feature, and the overall audio aggregation feature, and obtain the difference score.

5. The short video ad click-through rate scoring and ranking method as described in claim 4, characterized in that, The steps for constructing the predicted click-through rate score of the short video ad include: Calculate the inner product of the overall image aggregation features and the overall audio aggregation features to obtain the overall matching score; Calculate the inner product of the image detail aggregation features and the audio detail aggregation features to obtain the detail matching score; The predicted click-through rate score of the short video ad is obtained by summing the difference score, the overall matching score, and the detailed matching score.

6. The short video ad click-through rate scoring and ranking method as described in claim 5, characterized in that, The step of offline ranking of the click-through rates of the short video ads includes: Under the constraints of the loss function, the neural network model is trained using labeled short video ad samples; The short video ads to be sorted are input into the trained neural network model to obtain the predicted click-through rate score for each short video ad to be sorted. Arrange all short video ads to be sorted in descending order of predicted click-through rate scores, and output the offline sorting results.

7. The short video ad click-through rate scoring and ranking method as described in claim 6, characterized in that, The loss function includes: Main loss function and auxiliary loss function; The main loss function extracts click rate samples from the training set of the neural network model to form a video pair set based on the offline sorting results; The auxiliary loss function selects the top q samples with the highest and lowest click-through rates from the labeled short video ad samples to form a video pair set. The loss function is constructed by weighted fusion of the main loss function and the auxiliary loss function.

8. A short video ad click-through rate (CTR) scoring and ranking system, employing the short video ad CTR scoring and ranking method as described in any one of claims 1 to 7, characterized in that, include: The feature acquisition module extracts and preprocesses features from the short video advertisement to obtain image and audio features; The first aggregated feature construction module processes the image features to obtain the overall image features and image detail features of the image features, and aggregates the overall image features and image detail features of the image features to obtain the first aggregated feature; The second aggregated feature construction module processes the audio features to obtain the overall audio features and audio detail features of the audio features, and aggregates the overall audio features and audio detail features of the audio features to obtain the second aggregated feature; The welcome and unwelcome feature construction module processes the first aggregated feature and the second aggregated feature to obtain the welcome and unwelcome features; The difference score construction module constructs a difference score based on the popular and unpopular features; The predicted click-through rate score calculation module calculates the matching score between the first aggregated feature and the second aggregated feature, and constructs the predicted click-through rate score of the short video ad by combining the difference score. The offline click-through rate (CTR) ranking module trains a neural network model using the predicted CTR score under the constraint of a loss function, and then uses the trained neural network model to perform offline CTR ranking of the short video ads.

9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the short video advertisement click-through rate scoring and ranking method according to any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the short video advertisement click-through rate scoring and ranking method according to any one of claims 1 to 7.