Method and system for generating personalized ad creative based on multi-modal content understanding
By constructing a multimodal association matrix and a target audience attention field distribution map, the problems of time-consuming and laborious manual planning and inaccurate advertising creatives are solved, enabling the rapid generation and efficient delivery of personalized advertising creatives.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHENGDU CHOPPING BOARD TECHNOLOGY CO LTD
- Filing Date
- 2026-03-25
- Publication Date
- 2026-06-19
AI Technical Summary
Existing methods for generating advertising creative rely on manual planning, which is time-consuming and labor-intensive. It is difficult to fully and accurately grasp the multi-dimensional information of the product and the interests and preferences of the target audience, resulting in poor advertising effectiveness.
By receiving product attribute description text, real-object photos, and functional demonstration videos, a functional semantic, appearance visual, and scene temporal correlation matrix is constructed. A three-dimensional correlation aligner is used to achieve cross-modal correlation resonance, and personalized advertising creatives are generated by combining the target audience's attention field distribution map.
It enables the rapid and efficient generation of high-quality advertising creatives that meet the personalized needs of the target audience, thereby improving the accuracy and effectiveness of advertising.
Smart Images

Figure CN122243578A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of digital marketing technology, and more specifically, to a method and system for generating personalized advertising creatives based on multimodal content understanding. Background Technology
[0002] In today's digital marketing era, personalized advertising has become a key means of attracting consumer attention and improving advertising effectiveness. Traditional advertising creative generation primarily relies on manual planning and creation. Advertisers or advertising planners need to deeply understand product characteristics and target audience needs, then rely on experience and creativity to conceive advertising content. However, this approach has many limitations. On the one hand, the manual creation process is time-consuming, labor-intensive, and inefficient, making it difficult to quickly respond to market changes and the demand for large-scale advertising. On the other hand, due to the limitations of human subjective cognition and experience, it is difficult to comprehensively and accurately grasp the relationship between multi-dimensional product information and the complex interests and preferences of the target audience, resulting in advertising creatives that may not accurately reach the target audience, leading to unsatisfactory advertising results.
[0003] With the development of artificial intelligence technology, some advertising creative generation methods based on single-modal data (such as text or images) have gradually emerged. However, these methods only utilize partial information about the product and cannot fully explore the rich information contained in multimodal data such as product attribute descriptions, physical images, and functional demonstration videos. They also cannot adequately consider the attention preferences reflected in the historical interaction behavior of the target audience on advertising platforms. Consequently, they struggle to generate advertising creatives that truly meet the personalized needs of the target audience and fail to satisfy the high demands of modern digital marketing for advertising precision and personalization. Summary of the Invention
[0004] In view of this, the purpose of this application is to provide a method and system for generating personalized advertising creatives based on multimodal content understanding.
[0005] In conjunction with the first aspect of this application, a method for generating personalized advertising creatives based on multimodal content understanding is provided, applied to a system for generating personalized advertising creatives based on multimodal content understanding, the method comprising:
[0006] Receive the original product information package sent by the advertiser, the original product information package includes a product attribute description text stream, a product real object image stream, and a product function demonstration video stream;
[0007] Extract core product function keywords from the product attribute description text stream, construct a function semantic association matrix based on the core product function keywords, extract product appearance visual elements from the product physical shooting image stream, construct an appearance visual association matrix based on the product appearance visual elements, extract product usage scenario time sequence segments from the product function demonstration video stream, and construct a scenario time sequence association matrix based on the product usage scenario time sequence segments.
[0008] The functional semantic association matrix, the appearance visual association matrix, and the scene temporal association matrix are input into a preset 3D association aligner for cross-modal association resonance processing to generate a set of resonance feature units with resonance intensity exceeding a preset resonance threshold.
[0009] Obtain the historical interaction behavior trajectory of the target audience on the advertising platform, extract the audience attention migration path from the historical interaction behavior trajectory, and construct the audience attention field distribution map based on the audience attention migration path;
[0010] The set of resonant feature units is projected onto the audience attention field distribution map, the field strength coupling coefficient of each resonant feature unit in the audience attention field distribution map is calculated, the set of resonant feature units is filtered according to the field strength coupling coefficient, and a field coupling feature unit sequence matching the audience attention field distribution map is generated.
[0011] The field coupling feature unit sequence is input into a preset creative generation decoder, which outputs a text creative fragment sequence, an image creative fragment sequence, and a video creative fragment sequence. The text creative fragment sequence, the image creative fragment sequence, and the video creative fragment sequence are interwoven and spliced according to the arrangement order of the field coupling feature unit sequence to generate a personalized advertising creative unit.
[0012] In conjunction with the second aspect of this application, a personalized advertising creative generation system based on multimodal content understanding is provided. The personalized advertising creative generation system based on multimodal content understanding includes a machine-readable storage medium and a processor. The machine-readable storage medium stores machine-executable instructions. When the processor executes the machine-executable instructions, the personalized advertising creative generation system based on multimodal content understanding implements the aforementioned personalized advertising creative generation method based on multimodal content understanding.
[0013] In conjunction with a third aspect of this application, a computer-readable storage medium is provided, wherein computer-executable instructions are stored therein, and when the computer-executable instructions are executed, the aforementioned method for generating personalized advertising creatives based on multimodal content understanding is implemented.
[0014] Combining any of the above aspects, by receiving raw product information packages containing product attribute description text streams, product physical image streams, and product function demonstration video streams, key information from each modal data is extracted and corresponding correlation matrices are constructed, enabling in-depth exploration of the intrinsic connections between different aspects of product information. A 3D correlation aligner is used for cross-modal correlation resonance processing to generate a set of resonant feature units, effectively integrating multimodal information and enhancing a comprehensive understanding of product features. The historical interaction behavior trajectory of the target audience on the advertising platform is obtained, and an audience attention field distribution map is constructed, accurately grasping the target audience's attention preferences. The set of resonant feature units is projected onto the audience attention field distribution map, and a field-coupled feature unit sequence is generated through filtering, achieving precise matching between product features and the target audience's attention preferences. Finally, the creative generation decoder outputs creative fragment sequences from different modalities and interweaves and splices them to generate personalized advertising creative units. This enables the rapid and efficient generation of high-quality advertising creatives that meet the personalized needs of the target audience, greatly improving the accuracy and attractiveness of advertising, and effectively enhancing advertising effectiveness and marketing efficiency. Attached Figure Description
[0015] Figure 1 This application provides a flowchart illustrating the personalized advertising creative generation method based on multimodal content understanding. Detailed Implementation
[0016] Figure 1 This illustration shows a flowchart of a personalized advertising creative generation method based on multimodal content understanding provided in an embodiment of this application, which includes the following details:
[0017] First, step S110 is executed to receive the original product information package sent by the advertiser. The original product information package includes a product attribute description text stream, a product real object image stream, and a product function demonstration video stream.
[0018] Specifically, in this embodiment, the advertiser is a smart home appliance manufacturer, which uploads an original product information package through the advertiser's terminal system. This original product information package is a structured data collection containing three core data streams. The first is a product attribute description text stream, which is a time series composed of multiple text files. Each text file records one or more attributes of the smart refrigerator in detail. For example, the first text file describes parameters such as the refrigerator's total volume, freezer volume, and energy efficiency rating. The second text file may detail the inverter compressor technology used and the principle of its sterilization and deodorization system. Subsequent text files may also contain information such as the product's external dimensions, installation requirements, and warranty policies. These text files form a streaming input according to the order in which the product information is organized. The second is a product image stream, containing high-resolution photos of the smart refrigerator taken from different angles and under different lighting conditions. For example, the first frame is a standard view of the refrigerator's front, the second frame shows the internal compartment layout with the door open, and the third frame shows its ultra-thin design from the side. Subsequent images may include close-ups of details such as the control panel display, the texture of the door handles, and the internal LED lighting. These images are arranged in the order they were captured or in a preset display order. The third is a product function demonstration video stream, composed of multiple short video clips showcasing the product's core usage scenarios. For example, the first video clip demonstrates a user querying food information via the touchscreen on the refrigerator door or a voice assistant; the second video clip demonstrates the refrigerator's internal camera automatically recognizing and recording the type and expiration date of the food placed inside; and the third video clip demonstrates a user remotely adjusting the refrigerator temperature or checking food inventory via a mobile app. These video clips are stitched together in the logical order of the function demonstrations. The advertiser's system packages and sends the above data streams to the ad creative generation server executing this method via the network.
[0019] Next, the process proceeds to the step of deep semantic and feature extraction and association structure construction of the aforementioned multimodal data, namely step S120. This step aims to extract structural association information that can characterize the core characteristics of the product from the received data streams of the three different modalities. Specifically, step S120 further includes the following parallel sub-steps S121, S122, and S123.
[0020] Step S121: Extract core product function keywords from the product attribute description text stream, and construct a functional semantic association matrix based on the core product function keywords.
[0021] In this embodiment, a series of natural language processing is first performed on the product attribute description text stream. Step S1211 is executed to filter out the stop words in the product attribute description text stream, removing the stop words without actual semantic meaning in the product attribute description text stream, and generating a filtered product attribute description text stream. The server reads the received text stream and loads the content of each text file therein as a long string. The system has a built-in stop word list, which contains common but business-meaningless words, auxiliary words, prepositions, and conjunctions such as "de", "di", "de", "le", "shi", "zai", "and", "a". Each text file content is traversed, and all the words appearing in the stop word list are removed, obtaining a text sequence that only retains the content words. Next, step S1212 is executed to perform word segmentation and cutting on the filtered product attribute description text stream, cutting the filtered product attribute description text stream into multiple independent text vocabulary units. Using a word segmentation tool based on the hidden Markov model or conditional random field, the continuous text string after stop word filtering is cut into individual words according to the word formation rules of Chinese or English. For example, for the text describing "This refrigerator uses variable frequency compressors and frost-free air-cooling technology", after filtering and word segmentation, multiple vocabulary units such as "refrigerator", "uses", "variable frequency", "compressor", "air-cooling", "frost-free", "technology" will be obtained.
[0022] Then, step S1213 is executed, where each text vocabulary unit is input into a pre-trained part-of-speech tagger, which outputs the part-of-speech tag for each text vocabulary unit. Text vocabulary units with part-of-speech tags of nouns or verbs are selected from all text vocabulary units as candidate functional keywords. This part-of-speech tagger is a deep learning model based on a bidirectional long short-term memory network and a conditional random field. Its input is the word segmentation sequence generated in step S1212, and its output is the part-of-speech tag assigned to each vocabulary unit, such as noun (NN), verb (VV), adjective (JJ), etc. All vocabulary units with tags of NN or VV are selected to form a set of candidate functional keywords. Next, step S1214 is executed, where each candidate functional keyword is input into a pre-trained semantic role analyzer, which identifies the semantic role of each candidate functional keyword in the product attribute description text stream. Candidate functional keywords with semantic roles as core arguments are selected from the candidate functional keywords as core product functional keywords. This semantic role analyzer is based on a deep bidirectional Transformer encoder model. It analyzes the semantic role of each candidate keyword in its sentence, such as "agent (A0)," "patient (A1)," "tool (INSTR)," and "location (LOC)." Only words that function as core arguments (such as A0, A1, INSTR) are retained because these words are more likely to represent the core function or target of the product. For example, in the sentence "compressor (A0) drives refrigerant (A1) circulation," both "compressor" and "refrigerant" are retained as core product function keywords, while words such as "circulation" may be excluded as predicates.
[0023] After identifying the core keywords, step S1215 is executed to calculate the co-occurrence distance between any two core product function keywords in the product attribute description text stream. A core product function keyword co-occurrence matrix is constructed based on this distance, where the row and column indices of the matrix are both core product function keywords, and the matrix element values are the reciprocals of their corresponding co-occurrence distances. A sliding window with a size of W sentences is set. The entire text stream is traversed, and for any two core product function keywords Ki and Kj, the number of times they co-occur within the same sliding window is counted. However, for a more refined approach, co-occurrence distance is used. If Ki and Kj appear in the same sentence, their distance is defined as 1; if they appear in adjacent sentences, the distance is 2; if they are separated by one sentence, the distance is 3, and so on. The distance value D_ij_n (where n represents the nth co-occurrence event) of all co-occurrence events is recorded. In the co-occurrence matrix M_text, the value M_text_ij of the element in the i-th row and j-th column is obtained by summing all co-occurrence events using the formula M_text_ij=Σ(1 / D_ij_n). If two words never co-occur within the window, then M_text_ij=0. This matrix is symmetric, and the diagonal elements can be set to 0 or 1; typically, setting them to 0 indicates the relevance of the word to itself.
[0024] Next, step S1216 is executed, performing matrix spectral decomposition on the co-occurrence matrix of the core product function keywords to extract the eigenvalues and corresponding eigenvectors of the co-occurrence matrix of the core product function keywords. The top few eigenvectors are selected as basis vectors based on the magnitude of the eigenvalues. Eigenvalue decomposition is performed on the constructed M_text matrix to obtain a series of eigenvalues λ_i and corresponding eigenvectors ν_i. The eigenvalues are sorted from largest to smallest, and the eigenvectors corresponding to the K largest eigenvalues are selected as eigenvectors ν_1, ν_2, ..., ν_K. These eigenvectors constitute a basis for a new semantic space. The value of K can be determined based on the cumulative variance contribution rate; for example, a K value is chosen such that the sum of the top K eigenvalues accounts for more than a preset threshold (e.g., 85%). Finally, step S1217 is executed, using the projection coefficient of each core product function keyword onto the basis vectors as the association dimension coordinate of that core product function keyword. The functional semantic association matrix is constructed based on the association dimension coordinates of all core product function keywords, with each row of the functional semantic association matrix corresponding to an association dimension coordinate vector of a core product function keyword. For each word Wi in the core product function keyword list, its original representation can be a one-hot vector, but here it is projected onto a space spanned by ν_1 to ν_K. Specifically, the projection coefficients of the co-occurrence vector of word Wi (i.e., the i-th row of the M_text matrix) onto the aforementioned basis vectors are calculated. Let the original row vector of word Wi be Row_i (dimension N, where N is the total number of core words), then its projection coefficient p_ik on the k-th basis vector ν_k = Row_i·ν_k (dot product). Thus, word Wi is represented as a K-dimensional real-valued vector P_i = [p_i1, p_i2, ..., p_iK]. Stacking the P_i vectors corresponding to all N core words row by row forms a matrix of shape [N, K], which is the functional semantic association matrix F_semantic. Each row of this matrix represents the coordinates of a core product function keyword on the mined latent semantic association dimension.
[0025] Step S122: Extract the visual elements of the product appearance from the product's actual photographic image stream, and construct an appearance visual association matrix based on the visual elements of the product appearance.
[0026] This step runs parallel to step S121, processing the image stream. First, step S1221 is executed, performing image color space conversion on each frame of the product image stream, transforming each frame from the original color space to a preset target color space to generate a product image in the target color space. Assuming the original image is in RGB color space, it is converted to CIEL*a*b* color space. The conversion process includes: first, converting the RGB values to XYZ tristimulus values through a nonlinear transformation, and then converting the XYZ values to L* (brightness), a* (red-green hue), and b* (yellow-blue hue) components. The L*a*b* space design is closer to human visual perception, and its color difference calculation is more linear, which is beneficial for subsequent feature extraction. Next, step S1222 is executed, performing image frequency domain transformation on each frame of the product image in the target color space, transforming each frame from the spatial domain to the frequency domain to generate a frequency domain spectrum of each frame. Two-dimensional discrete Fourier transforms are applied to the L*, a*, and b* channels of each frame of the image. For an image channel f(x, y) of size M×N, its discrete Fourier transform F(u, v) is calculated based on the formula F(u, v) = Σ_{x=0}^{M-1}Σ_{y=0}^{N-1}f(x, y)*exp(-j*2π*(u*x / M+v*y / N)). The transform result F(u, v) is a complex matrix representing the amplitude and phase of the image in the frequency components (u, v). The frequency domain spectrum of the image is obtained by combining or averaging the amplitude spectra or power spectra (squared amplitudes) of all channels.
[0027] Then, step S1223 is executed, extracting the concentrated spectral energy region from the frequency domain spectrum of each frame of the product image, and identifying the frequency components corresponding to the concentrated spectral energy region as the dominant frequency components of that frame of the product image. In the frequency spectrum, energy is usually concentrated in the low-frequency region, corresponding to the smooth areas and main contours of the image. By setting an energy threshold, for example, finding a minimum frequency radius R, such that the proportion of spectral energy contained within this radius to the total energy exceeds a preset threshold (e.g., 90%), the low-frequency region defined by this radius R is the concentrated spectral energy region, and the frequency components corresponding to this region (i.e., frequencies from 0 to R) are identified as the dominant frequency components of the image. These dominant frequency components determine the main texture roughness and structural scale of the image. Next, step S1224 is executed, reconstructing the image of each frame of the product image based on the dominant frequency components, filtering out secondary frequency components other than the dominant frequency components, and generating a visual dominant frequency reconstructed image of each frame of the product image. This is achieved by constructing an ideal low-pass filter. The filter's transfer function H(u, v) is 1 when the magnitude of the frequency (u, v) is less than or equal to R, and 0 otherwise. The frequency domain representation F(u, v) of the original image is multiplied by H(u, v) to obtain the filtered frequency domain representation G(u, v) = F(u, v) * H(u, v). Then, a two-dimensional inverse discrete Fourier transform is performed on G(u, v) to obtain the reconstructed spatial domain image g(x, y), which preserves the main structure and contour of the product while filtering out high-frequency details and noise.
[0028] Next, step S1225 is executed, where edge contour extraction is performed on the visual dominant frequency reconstructed image of each frame of the product photograph. Pixels with abrupt changes in grayscale gradient in the visual dominant frequency reconstructed image are identified, and these pixels are connected to form a closed contour line, which serves as the product appearance contour line for that frame of the product photograph. The Canny edge detection algorithm is used. First, the reconstructed image g(x, y) is smoothed using Gaussian filtering, and then the gradient magnitude and direction of each pixel are calculated. The gradient magnitude can be calculated using operators such as the Sobel operator. Next, non-maximum suppression is performed, that is, only local maxima points in the gradient direction are retained as candidate edge points. Finally, a dual thresholding process is used: a high threshold is used to detect strong edges, and a low threshold is used to connect strong edge points to form a continuous contour. Ultimately, a series of closed or open curves composed of pixel coordinates are obtained. Among these, the most likely closed product appearance contour line is selected through morphological operations and prior knowledge (e.g., the product is usually located in the center of the image and the area enclosed by the contour line is large). Then, step S1226 is executed, segmenting the main product region image from the corresponding visual main frequency reconstructed image based on the area enclosed by the product appearance outline. The color histogram distribution features, texture periodicity distribution features, and shape invariant moment features of the main product region image are extracted as the visual elements of the product appearance in that frame of the actual product image. Once the outline is obtained, a mask is generated, marking pixels inside the outline as 1 and pixels outside as 0. The mask is multiplied with the reconstructed image to segment the main product region. For this region, its color histogram distribution features are calculated; that is, in the L*a*b* space, the L*, a*, and b* channels are quantized into several intervals (e.g., 8 intervals), and the number of pixels in each interval is counted to obtain a three-dimensional histogram, which is then flattened into a vector. The texture periodicity distribution features can be extracted by performing a local Fourier transform on the main product region again or by using a Gabor filter bank to obtain the texture energy response at different scales and directions, forming a texture feature vector. Shape-invariant moment features, or Hu moments, are seven moments constructed from second- and third-order central moments based on a region. These moments remain invariant under image translation, rotation, and scaling transformations, and are used to describe the overall shape of the product. These three feature vectors are concatenated to form a comprehensive visual feature vector, which serves as the visual element representing the product's appearance in that frame of the image.
[0029] Finally, step S1227 is executed, whereby the visual elements of the product appearance corresponding to all frames of actual product images are vectorized to generate a visual element vector corresponding to each visual element. All visual element vectors are arranged in frame order to construct the appearance visual association matrix. Each row of the appearance visual association matrix corresponds to a visual element vector of one frame of actual product images. Assuming there are M frames in total, and the dimension of the visual feature vector extracted from each frame is D (e.g., color histogram feature dimension + texture feature dimension + shape moment feature dimension), these feature vectors are stacked in the original order of the images in the stream to form a matrix V_appearance of shape [M, D]. This matrix is the appearance visual association matrix, where each row represents the quantified association representation of the product appearance in that frame across multiple visual dimensions.
[0030] Step S123: Extract the product usage scenario time sequence segments from the product function demonstration video stream, and construct a scenario time sequence association matrix based on the product usage scenario time sequence segments.
[0031] This step runs in parallel with steps S121 and S122, processing the video stream. First, step S1231 is executed to detect scene switching in the product function demonstration video stream. The inter-frame difference between adjacent video frames in the product function demonstration video stream is calculated, and adjacent frames with an inter-frame difference exceeding a preset scene switching threshold are marked as scene switching points. The video stream is decoded into a continuous sequence of image frames. For two adjacent frames, their difference in color histograms is calculated. One approach is to calculate the chi-square distance or Bach distance of the normalized RGB histograms of the two frames. Another, more robust method is to calculate the number of matching local feature points (such as SIFT or ORB features) between the two frames; the fewer the matching points, the greater the difference. A scene switching threshold T_scene is set. When the calculated inter-frame difference is greater than T_scene, a scene switching is determined to have occurred between the adjacent frames, and the index position of the next frame is marked as a scene switching point. Next, step S1232 is executed, dividing the product function demonstration video stream into multiple initial video scene segments based on the scene switching points. Each initial video scene segment contains a set of coherent video frame sequences. Based on all scene switching points marked in step S1231, the video stream is divided into continuous segments. For example, if the switching point is between frame A and frame A+1, or between frame B and frame B+1, the video stream is divided into the first segment from frame 1 to frame A, the second segment from frame A+1 to frame B, and so on. Within each segment, the video frame content has a high degree of coherence, collectively describing a relatively independent demonstration scene.
[0032] Then, step S1233 is executed to extract keyframes within each initial video scene segment, calculate the intra-frame information entropy of each video frame in each initial video scene segment, and select several video frames with the highest intra-frame information entropy from each initial video scene segment as key representative frames of that initial video scene segment. For each initial video scene segment, the intra-frame information entropy of each frame it contains is calculated. Information entropy H = -Σ_{i=0}^{255}p(i)*log2(p(i)), where p(i) is the probability of gray level i appearing in the image gray-level histogram. The higher the information entropy, the greater the amount of information contained in the image and the richer the texture. All frames in the segment are sorted from high to low information entropy, and the frames with the top P% (e.g., the top 20%) of entropy values are selected, or the K frames with the highest entropy values are directly selected as key representative frames of that scene segment. These frames best represent the visual content of the scene. Next, steps S1234 are executed, arranging all key representative frames of each initial video scene segment in chronological order to generate a key representative frame sequence for each initial video scene segment. This extracts a more compact keyframe sequence for each original scene segment while still preserving its chronological order.
[0033] Next, step S1235 is executed, where each key representative frame sequence is input into a pre-trained scene semantic encoder. The scene semantic encoder's 3D convolutional layers jointly extract spatiotemporal features from the key representative frame sequences, generating a scene spatiotemporal feature tensor for each initial video scene segment. This scene semantic encoder is a model based on a 3D convolutional neural network (3DCNN), such as a variant of C3D or I3D. Its input is a tensor of shape [T, H, W, C], where T is the number of key representative frames, H and W are the frame height and width, and C is the number of channels (e.g., RGB three-channel). The 3D convolutional layers capture motion information and temporal changes in the video segment by performing convolution operations simultaneously in the temporal and spatial dimensions. For example, the first 3D convolutional layer uses a 3×3×3 kernel with a stride of 1 to convolve the input tensor, outputting a set of spatiotemporal feature maps that contain both spatial visual patterns and the changes of these patterns over time. The model may consist of multiple stacked 3D convolutional layers, pooling layers, and non-linear activation layers. Finally, after the last 3D convolutional layer or global spatiotemporal pooling layer, a tensor with highly abstract semantics is output. Its shape can be [T', H', W', C'] or directly a higher-dimensional compact tensor, denoted as Tensor_scene. Then, step S1236 is executed to perform tensor dimensionality reduction on the scene spatiotemporal feature tensor of each initial video scene segment, compressing the scene spatiotemporal feature tensor into a one-dimensional scene feature vector, which serves as the product usage scene time sequence segment corresponding to the initial video scene segment. For a tensor with shape [T', H', W', C'], a global average pooling or global max pooling operation can be applied first, pooling in the spatial dimensions H' and W' to obtain a matrix with shape [T', C']. Then, this matrix is expanded along the temporal dimension T' or pooled again to finally obtain a one-dimensional feature vector with a length of L=T'*C' or C'. This one-dimensional vector Vector_scene encapsulates the spatiotemporal information of the entire video scene segment, representing a temporal segment of a product usage scenario.
[0034] Finally, step S1237 is executed, arranging all the product usage scenario time-series segments corresponding to the initial video scene segments according to the order in which the scenes appear in the video stream, and constructing the scene time-series association matrix. Each row of the scene time-series association matrix corresponds to a scene feature vector of a product usage scenario time-series segment. Assuming the video stream is divided into S initial video scene segments, and each segment generates a scene feature vector of length L, these vectors are stacked according to the playback order of the scenes in the original video to form a matrix T_scene of shape [S, L]. This is the scene time-series association matrix, where each row corresponds to a feature representation of a temporally continuous and semantically complete product usage scenario.
[0035] Thus far, through steps S121, S122, and S123, we have constructed the functional semantic association matrix F_semantic, the appearance visual association matrix V_appearance, and the scene temporal association matrix T_scene from the three modalities of text, image, and video, respectively.
[0036] Next, step S130 is executed, in which the functional semantic association matrix, the appearance visual association matrix and the scene temporal association matrix are input into a preset three-dimensional association aligner for cross-modal association resonance processing, generating a set of resonance feature units with resonance intensity exceeding a preset resonance threshold.
[0037] This step aims to uncover the deep, mutually reinforcing relationships among the three modalities. Specifically, step S130 includes the following sub-steps. First, step S131 is executed, where each row vector in the functional semantic association matrix is treated as a text modality association source, each row vector in the appearance visual association matrix is treated as an image modality association source, and each row vector in the scene temporal association matrix is treated as a video modality association source. That is, each row of F_semantic (representing the semantic coordinates of a core functional keyword) is considered a text source point, each row of V_appearance (representing the product visual features of a frame of image) is considered an image source point, and each row of T_scene (representing the spatiotemporal features of a usage scene) is considered a video source point.
[0038] Next, step S132 is executed to calculate the first cross-modal association distance between each text modality association source and each image modality association source, and to construct a text-image association distance matrix based on the first cross-modal association distance. For any text source point Txt_i (dimension K) and any image source point Img_j (dimension D), since they may have different dimensions, the Euclidean distance cannot be directly calculated. Therefore, they are first mapped to a common association space through a learnable or fixed linear transformation. Two mapping matrices W_t2c (shape [K, C]) and Wi2c (shape [D, C]) can be pre-trained to map both text and image features to a C-dimensional common space. Then, in the common space, their cosine distance or Euclidean distance is calculated as the association distance d_ti_ij. For example, the Euclidean distance of the mapped vectors is calculated as: d_ti_ij = ||W_t2c*Txt_i - W_i2c*Img_j||_2. Perform this calculation on all I text source points and J image source points to obtain a matrix D_ti of shape [I, J], whose elements are the corresponding first cross-modal association distances.
[0039] Simultaneously, step S133 is executed to calculate the second cross-modal association distance between each text modal association source and each video modal association source, and to construct a text-video association distance matrix based on the second cross-modal association distance. Similarly, using another mapping matrix W_t2c' (which may or may not be shared with the above W_t2c) and W_v2c (of shape [L, C]), the text source point Txt_i and the video source point Vid_k (of length L) are mapped to a common space, the distance d_tv_ik is calculated, and a text-video association distance matrix D_tv of shape [I, K] is constructed, where K is the number of video source points (scene segments). Similarly, step S134 is executed to calculate the third cross-modal association distance between each image modal association source and each video modal association source, and to construct an image-video association distance matrix D_iv of shape [J, K] based on the third cross-modal association distance.
[0040] Then, step S135 is executed, whereby the text-image association distance matrix, the text-video association distance matrix, and the image-video association distance matrix are input into the resonant kernel calculation unit of the 3D association aligner. The resonant kernel calculation unit performs 3D tensor superposition on the three association distance matrices to generate a 3D association resonant tensor. The resonant kernel calculation unit first expands the three distance matrices into 3D tensors. For example, D_ti([I, J]) is expanded into a tensor of shape [I, J, 1], and then repeated K times along a new dimension to obtain a tensor T_ti of shape [I, J, K], where each slice at depth k is the same as D_ti. Similarly, D_tv([I, K]) is expanded into [I, 1, K] and repeated J times to obtain a tensor T_tv of shape [I, J, K]. D_iv([J, K]) is expanded into [1, J, K] and repeated I times to obtain a tensor T_iv of shape [I, J, K]. Then, these three tensors are superimposed, for example, through weighted summation or direct averaging. If weighted summation is used, with weights α, β, and γ, the three-dimensional correlation resonance tensor R = α*T_ti + β*T_tv + γ*T_iv. Each element R_ijk in R comprehensively reflects the overall cross-modal correlation distance between the text source point i, the image source point j, and the video source point k. The smaller the distance, the stronger the resonance among the three.
[0041] Next, step S136 is executed to perform tensor decomposition on the three-dimensional correlation resonance tensor, extract the principal component factors of the three-dimensional correlation resonance tensor, and calculate the comprehensive resonance intensity between each text modality correlation source, each image modality correlation source, and each video modality correlation source based on the principal component factors. The three-dimensional tensor R is decomposed into R≈Σ_{r=1}^{R}λ_r*(a_r∘b_r∘c_r) using CANDECOMP / PARAFAC(CP) decomposition, where ∘ represents the vector outer product, a_r (length I), b_r (length J), and c_r (length K) are the factor vectors obtained from the decomposition, and λ_r is the weight. These factor vectors can be considered as "resonance modes" mined from the three modalities. For a specific triple (i, j, k), its overall resonance intensity can be obtained by reconstructing its corresponding contribution, for example, the intensity S_ijk=Σ_{r=1}^{R}λ_r*a_r(i)*b_r(j)*c_r(k). The larger this S_ijk value, the greater the contribution of the triple to the main resonance mode, that is, the stronger the resonance.
[0042] Then, step S137 is executed, where each text modal association source, each image modal association source, and each video modal association source are paired to generate multiple cross-modal association source pairs, each corresponding to a comprehensive resonance intensity. Here, triples containing all three modal source points need to be generated. In fact, S_ijk in step S136 implicitly assigns a comprehensive resonance intensity to each possible triple (Txt_i, Img_j, Vid_k). Next, step S138 is executed, filtering out cross-modal association source pairs whose comprehensive resonance intensity exceeds a preset resonance threshold. The text modal association source, image modal association source, and video modal association source in each filtered cross-modal association source pair are triplet-bound to generate a resonance feature unit. All resonance feature units constitute the resonance feature unit set. A resonance threshold θ_resonance is set. All possible (i, j, k) combinations are traversed, and those combinations where S_ijk > θ_resonance are selected. For each selected combination (i*, j*, k*), the corresponding text modality correlation source (i*th row vector of the F_semantic matrix), image modality correlation source (j*th row vector of the V_appearance matrix), and video modality correlation source (k*th row vector of the T_scene matrix) are bound together to form a resonant feature unit U=(F_i*, V_j*, T_k*). These units together constitute the resonant feature unit set U_set. Each unit in this set represents a fragment of product information that is highly correlated and mutually corroborative in terms of textual function, visual appearance, and scene application.
[0043] Simultaneously, to personalize the final creative and target the audience, it is necessary to analyze and model the target audience. Step S140 involves obtaining the historical interaction behavior trajectory of the target audience on the advertising platform, extracting the audience attention migration path from the historical interaction behavior trajectory, and constructing an audience attention field distribution map based on the audience attention migration path. This step includes a series of sub-steps.
[0044] Step S140 first executes step S141, retrieving historical interaction behavior logs of the target audience within a preset historical time period from the advertising platform's backend database. In this embodiment, the target audience is defined as a set of users who have viewed or clicked on smart home and home appliance advertisements within the past three months. All interaction behavior logs of these users within the specified time period are extracted from the database. Each log record includes: a unique audience identifier (UID) for the user, a timestamp accurate to the second recording the behavior, a unique identifier (AID) for the advertisement the user interacted with, and the interaction behavior type (Type), such as "Impression," "Click," "Stay" (for more than X seconds), or "Conversion" (e.g., placing an order). Next, step S142 executes, grouping the interaction behavior records in the historical interaction behavior logs according to the audience identifier, grouping interaction behavior records with the same audience identifier into the same audience's historical interaction behavior subset. Using the MapReduce approach, with the UID as the key, all log records are aggregated to obtain a list of individual behavior records for each user. Then, step S143 is executed, whereby the interaction records in each audience's historical interaction behavior subset are sorted in ascending order according to the timestamp of the behavior, generating a time-sorted interaction behavior list for each audience. For each user's list, a quicksort algorithm is used to sort the records using the timestamp as the key, ensuring that the behavior records are arranged in chronological order.
[0045] Next, step S144 is executed, traversing the time-sorted interaction behavior list for each audience member, and sequentially extracting the interacted ad identifier from each interaction behavior record to form the audience's initial ad interaction trajectory sequence. For example, for a user, the AIDs corresponding to their sorted interaction records are Ad_101, Ad_205, Ad_101, Ad_307, ..., then their initial trajectory is a sequence of ad IDs [Ad_101, Ad_205, Ad_101, Ad_307, ...]. Then, step S145 is executed, parsing the ad content type for each interacted ad identifier in the initial ad interaction trajectory sequence, determining the ad content type corresponding to each interacted ad identifier, and converting the initial ad interaction trajectory sequence into an audience attention type migration path arranged chronologically by ad content type identifiers. In the backend of the advertising platform, each ad ID is associated with rich metadata, including its content category tags, such as "refrigerator," "washing machine," "smart speaker," "kitchen appliance," "promotional offers," etc. By querying this metadata, each AID in the trajectory is replaced with its corresponding content type tag. Thus, the user's trajectory becomes ["refrigerator", "washing machine", "refrigerator", "smart speaker", ...]. This sequence of content type tags intuitively reflects the user's attention migration across different product categories or advertising themes.
[0046] Next, step S146 is executed to smooth the audience attention type migration path, removing noise type points in the path whose duration is less than a preset duration threshold, and generating a smoothed audience attention migration path. In a user's interaction sequence, there may be some accidental, brief clicks, such as a user accidentally clicking on an ad they are not interested in. To eliminate this noise, smoothing is performed. A time window T_window (e.g., 5 minutes) and a minimum duration threshold T_min (e.g., 30 seconds) are set. The user's time-sorted sequence is scanned. If the total duration of consecutive occurrences of a type tag (the time span from its first appearance to jumping to the next type) is less than T_min, the record is considered a noise point and removed from the sequence, or it is merged with the preceding and following mainstream types. For example, if a user is watching a refrigerator ad sequence and suddenly clicks on a washing machine ad that lasts only 10 seconds, and then immediately returns to the refrigerator ad, this "washing machine" point can be removed. The smoothed sequence more accurately reflects the user's continuously focused interest migration path.
[0047] After smoothing the trajectories of all audiences, step S147 is executed. The smoothed audience attention migration paths of all audiences are spatially superimposed, and the number of audiences appearing at each time point along each audience attention migration path is counted. The attention field strength value at each time point is calculated based on the number of audiences. First, the paths of all individuals need to be aligned to a unified time axis. This time axis can be absolute time (e.g., 24 hours in a day) or relative time (e.g., the time calculated from when a user first encounters a certain type of advertisement). To construct a universal attention field, an absolute time axis is more commonly used, but the differences in active times among different users need to be considered. One approach is to divide a 24-hour day into continuous time slices of length T_unit (e.g., 15 minutes). For each user's smoothed path, each time point along the path is traversed to determine which time slice that time point falls within. The number of users whose attention falls on a specific time slice t is counted as Count_t. Then, the attention field strength value Field_t at time slice t is calculated as Field_t = Count_t / total number of users, which is the normalized density of the advertisements that this group pays attention to within that time slice. Finally, step S148 is executed to construct the audience attention field distribution map with time as the horizontal axis and attention field strength value as the vertical axis. Each time point in the audience attention field distribution map corresponds to an attention field strength value. This distribution map is essentially a function or discrete sequence with time as the independent variable and field strength as the dependent variable. For example, it can be represented as {(t1, Field_1), (t2, Field_2), ..., (tN, Field_N)}, where t_i is the representative time point of the i-th time slice (such as the midpoint of the time slice), and Field_i is the calculated field strength value. This distribution map reveals the concentration and dispersion patterns of the target audience's attention in the time dimension; for example, a peak in attention may occur at a certain time during a weekday evening.
[0048] Next, step S150 is executed, projecting the set of resonant feature units onto the audience attention field distribution map, calculating the field strength coupling coefficient of each resonant feature unit in the audience attention field distribution map, and filtering the set of resonant feature units based on the field strength coupling coefficient to generate a field coupling feature unit sequence that matches the audience attention field distribution map. This step is a crucial link connecting the product's intrinsic characteristics with the audience's external attention distribution.
[0049] Step S150 first executes step S151, parsing the text modality association source, image modality association source, and video modality association source contained in each resonant feature unit, and extracting the original product information timestamp corresponding to each resonant feature unit. Recalling the composition of the resonant feature unit U, it is bound together by a row vector F_i from F_semantic, a row vector V_j from V_appearance, and a row vector T_k from T_scene. These three vectors originate from a text block in the original text stream, a frame in the original image stream, and a scene segment in the original video stream, respectively. The time index of the original data has been preserved when constructing these matrices. Therefore, the time position T_txt_i of the text segment corresponding to F_i in the text stream, the time position T_img_j of the image frame corresponding to V_j in the image stream, and the time position T_vid_k of the video scene corresponding to T_k in the video stream can be obtained. For a unit U, its original product information timestamp can be defined as an aggregation of these three time positions, such as taking the average of the three (T_txt_i+T_img_j+T_vid_k) / 3, or taking the median. This timestamp represents the approximate position of the product information described by the resonant feature unit on the overall timeline of the product introduction.
[0050] Next, step S152 is executed, mapping the original product information timestamp corresponding to each resonant feature unit to the time axis of the audience attention field distribution map, thus determining the projection time point of each resonant feature unit in the audience attention field distribution map. Since the time axis of the attention field distribution map is also defined in absolute time (such as a moment in a day), while the original product information timestamp is relative to the sequential time within the product information package, the two have different dimensions. Therefore, mapping is necessary. One mapping method is based on advertising placement strategies. For example, which time of day is planned for advertising? Suppose the placement strategy decides to place advertisements mainly during the prime time period of 7 pm to 11 pm. Then, the internal time axis of the product information package can be linearly mapped to this placement period. If the total duration of the product information package is T_total, then the internal timestamp t is mapped to the placement start time + (t / T_total) * placement period length. Thus, each resonant feature unit U obtains a projection time point t_proj on the audience attention field distribution map.
[0051] Then, step S153 is executed, reading the attention field strength value corresponding to each projection time point from the audience attention field distribution map, and using this attention field strength value as the initial field strength coupling coefficient of the resonant feature unit. Based on t_proj, the time slice to which it belongs is found, and the field strength value Field(t_proj) of that time slice is read. This value Field_initial_coupling=Field(t_proj) is the initial field strength coupling coefficient of the unit, which measures the degree of fit between the unit and the audience's attention concentration if the unit is placed at this time point. Next, step S154 is executed, calculating the internal resonance consistency score between the text modality association source, image modality association source, and video modality association source in each resonant feature unit, and multiplying the internal resonance consistency score by the initial field strength coupling coefficient to generate the final field strength coupling coefficient of the resonant feature unit. The internal resonance consistency score measures the degree to which the three modal information within a unit corroborates and coordinates with each other. This score can be based on the comprehensive resonance intensity S_ijk of the unit calculated in step S136. Since S_ijk is already a numerical value reflecting the strength of the triplet association, it can be directly used as the internal resonance consistency score. Alternatively, the cosine similarity between each pair of F_i, V_j, and T_k can be calculated and then averaged. Here, S_ijk is used as the internal consistency score, Score_internal. The final field strength coupling coefficient C_final = Score_internal * Field_initial_coupling. This product ensures that the ultimately selected unit is both internally modally closely correlated and highly overlaps with the audience's peak attention periods.
[0052] Then, step S155 is executed, sorting all resonant feature units in descending order according to the final field strength coupling coefficient to generate a sorted list of resonant feature units. All units in U_set are sorted from largest to smallest according to their C_final values. Next, step S156 is executed, selecting resonant feature units from the sorted list whose final field strength coupling coefficient ranks within a predetermined range as candidate field coupling feature units. A number of units to be selected, N_candidate, is set, for example, N_candidate=50. The top 50 units in the sorted list are selected to form a candidate set U_candidate, representing the most valuable product information fragments for generating creative ideas. Finally, step S157 is executed, rearranging the candidate field coupling feature units according to their projection time point t_proj to generate a field coupling feature unit sequence matching the audience attention field distribution map. The units in U_candidate are sorted according to the order of their projection time points t_proj calculated in step S152, forming an ordered sequence U_seq=[U_seq_1, U_seq_2, ..., U_seq_N]. The order of this sequence indicates that in the final advertising creative, these information fragments will be presented sequentially according to the natural flow of audience attention timeline.
[0053] At this point, we have obtained an ordered field-coupled feature unit sequence U_seq that highly matches the audience's attention field. Next, we execute step S160, inputting the field-coupled feature unit sequence into a preset creative generation decoder. The creative generation decoder outputs a text creative fragment sequence, an image creative fragment sequence, and a video creative fragment sequence. We then interweave and splice the text creative fragment sequence, the image creative fragment sequence, and the video creative fragment sequence according to the arrangement order of the field-coupled feature unit sequence to generate personalized advertising creative units.
[0054] This step first requires using a creative generation decoder to decode the abstract resonant feature units back into specific, presentable creative fragments. Step S160 further includes sub-steps S161 to S164, which describe the decoding process and the final splicing generation, respectively.
[0055] In step S161, each field coupling feature unit in the field coupling feature unit sequence is sequentially input into the text decoding branch of the creative generation decoder. The text decoding branch outputs the text creative fragment corresponding to each field coupling feature unit. All text creative fragments are arranged in the order of the field coupling feature unit sequence to form a text creative fragment sequence.
[0056] Specifically, the creative generation decoder is a multimodal, multi-branch generative model. Its text decoding branch is a Transformer-based decoder structure. For the input field-coupled feature unit U_seq_n, it contains a text feature vector F_i, an image feature vector V_j, and a video feature vector T_k. The input to the text decoding branch includes not only U_seq_n itself but also typically a start symbol. This branch consists of multiple cascaded text decoding layers. In each text decoding layer, a self-attention mechanism is first used to encode a portion of the generated text sequence, followed by a cross-attention mechanism using the joint representation of U_seq_n as the query to extract relevant information from the input features. After iterations through multiple decoding layers, the last decoding layer outputs a probability distribution covering all words in the vocabulary. A greedy search or bundle search algorithm is used to sample or select the word with the highest probability from this probability distribution as the output. This process is repeated until an end symbol is generated. The final generated sequence of words constitutes the corresponding text creative fragment Txt_seg_n. Perform this operation on all N U_seq_n sequentially to obtain the text creative fragment sequence [Txt_seg_1, Txt_seg_2, ..., Txt_seg_N].
[0057] Simultaneously, step S162 is executed, in which each field coupling feature unit in the field coupling feature unit sequence is sequentially input into the image decoding branch of the creative generation decoder, and the image creative fragment corresponding to each field coupling feature unit is output through the image decoding branch. All image creative fragments are arranged in the order of the field coupling feature unit sequence to form an image creative fragment sequence.
[0058] The image decoding branch can employ an image generator based on a conditional generative adversarial network (GAN) or a diffusion model. For the input U_seq_n, it is first mapped to a latent space vector through a fully connected network, which serves as the conditional input to the generator. If a diffusion model is used, this conditional vector guides the inverse denoising process. The generator starts with random noise and, guided by the conditions, progressively denoises, ultimately generating an image that is semantically consistent with the input feature V_j and incorporates the functional and scene information implied by F_i and T_k. The generated image Img_seg_n is the image creative fragment corresponding to this unit. This process is repeated for all U_seq_n to obtain the image creative fragment sequence [Img_seg_1, Img_seg_2, ..., Img_seg_N].
[0059] Simultaneously, step S163 is executed, in which each field coupling feature unit in the field coupling feature unit sequence is sequentially input into the video decoding branch of the creative generation decoder, and the video creative segment corresponding to each field coupling feature unit is output through the video decoding branch. All video creative segments are arranged in the order of the field coupling feature unit sequence to form a video creative segment sequence.
[0060] The video decoding branch can employ a generative model based on 3D convolution and temporal attention mechanisms, such as VideoTransformer. For the input U_seq_n, its feature vector T_k is itself a spatiotemporal feature of a scene segment. The video decoding branch uses this feature as an initial state or condition to generate video content frame by frame. It may first generate keyframes, then generate intermediate frames through a frame interpolation network, and finally generate a short video segment Video_seg_n, which not only reproduces the core actions of the original demonstration scene, but may also incorporate selling information from the text description. By operating on all U_seq_n in sequence, a sequence of creative video segments [Video_seg_1, Video_seg_2, ..., Video_seg_N] is obtained.
[0061] Then, step S164 is executed, where the text creative fragment sequence, the image creative fragment sequence, and the video creative fragment sequence are interwoven and spliced according to the arrangement order of the field coupling feature unit sequence to generate a personalized advertising creative unit. This step merges the fragments of the three modalities into a coherent, multi-sensory advertisement. More detailed operations are involved in this process.
[0062] First, step S1641 is executed to obtain the projection time point corresponding to each field coupling feature unit in the field coupling feature unit sequence, and the expected presentation duration of each field coupling feature unit in the final advertising creative is calculated based on the projection time point. Referring back to step S152, each unit U_seq_n has a projection time point t_proj_n in the audience attention field, and the entire sequence is sorted by t_proj. The projection time difference Δt_n = t_proj_{n+1} - t_proj_n between two adjacent units can be calculated. This time difference can be considered as the time it takes for the audience's attention to shift from the information in unit n to the information in unit n+1. Therefore, Δt_n can be used as the expected presentation duration of the creative segment of unit U_seq_n in the final advertisement. For the last unit, a default duration can be set or the previous difference can be referenced.
[0063] Then, step S1642 is executed to perform semantic unit segmentation on the text creative fragment corresponding to each field coupling feature unit, generating a text semantic unit sequence. Based on the expected presentation duration, a target presentation duration is assigned to each text semantic unit, forming a text creative sub-fragment sequence. For the text fragment Txt_seg_n, natural language processing techniques, such as dependency parsing or a pre-trained language model, can be used to segment it into multiple semantically complete short sentences or phrases, such as "ultra-large capacity," "frequency conversion energy saving," and "precise temperature control." Assume that M text semantic units [TxtUnit_1, ..., TxtUnit_M] are segmented. Based on the expected presentation duration Δt_n, a sub-presentation duration is assigned to each text semantic unit. For example, if evenly distributed, each sub-duration = Δt_n / M. The aforementioned text unit sequence with duration annotations is the text creative sub-fragment sequence.
[0064] Next, step S1643 is executed, whereby the image creative fragment corresponding to each field-coupled feature unit is divided into visual units to generate an image visual unit sequence. Based on the expected presentation duration, a target presentation duration is assigned to each image visual unit, forming an image creative sub-fragment sequence. The image creative fragment Img_seg_n itself may be a static image, but it can be decomposed into multiple visual focal points. For example, through a visual attention model, salient regions in the image can be identified, such as the front of the refrigerator, the open door, the internal vegetable compartment, the control panel, etc. These salient regions can be cropped out separately, or focused sequentially through animation effects (such as zooming in or panning), forming multiple visual units [ImgUnit_1, ..., ImgUnit_P]. Similarly, based on Δt_n, a sub-presentation duration is assigned to each visual unit, forming an image creative sub-fragment sequence.
[0065] Then, step S1644 is executed, whereby the video creative segment corresponding to each field-coupled feature unit is divided into a sequence of video creative sub-segments according to the expected presentation duration. Each video creative sub-segment corresponds to video content with a unit presentation duration. The video segment Video_seg_n itself is already temporally sequential. It can be regarded as a whole, or it can be divided into multiple shorter shot sequences [VideoUnit_1, ..., VideoUnit_Q] according to shot transitions or action completion. The above shots naturally correspond to different sub-presentation durations.
[0066] Next, step S1645 is executed to perform intermodal alignment of the text creative sub-fragment sequence, image creative sub-fragment sequence, and video creative sub-fragment sequence corresponding to each field-coupled feature unit, so that the text creative sub-fragments, image creative sub-fragments, and video creative sub-fragments within the same time period are semantically complementary. For example, when a video sub-fragment shows the internal space of a refrigerator, the aligned text sub-fragment can display "extra-large storage space," and the image sub-fragment can highlight the internal volume diagram. The above alignment can be achieved by using a dynamic time warping algorithm to adjust the start and end times of each modal sub-fragment on the time axis, so that their combined effect is optimal and semantically consistent.
[0067] Then, step S1646 is executed: according to the arrangement order of the field coupling feature unit sequence, the aligned creative sub-segments corresponding to all field coupling feature units are concatenated end-to-end to generate a multimodal interwoven initial creative long sequence. The text, image, and video sub-segments of the first unit U_seq_1 are arranged according to the aligned timeline, and then the corresponding sub-segments of the second unit U_seq_2 are concatenated, and so on. In this way, a fully interwoven initial creative long sequence containing a text caption sequence, an image / animation sequence, and a background / main video sequence is generated.
[0068] Next, step S1647 is executed to perform sequence smoothing on the initial creative long sequence. Semantic transition points between adjacent creative sub-segments in the initial creative long sequence are detected, and semantic transition sub-segments are inserted at these points to generate a smoothed creative long sequence. Since the product selling points focused on by different units may differ significantly (e.g., jumping from "energy saving" to "intelligent control"), direct splicing may appear abrupt. By analyzing the semantic similarity of adjacent unit text segments (e.g., calculating the cosine distance of word vectors), if the similarity is below a threshold, it is determined to be a semantic transition point. At the transition point, some transitional content can be dynamically generated or selected from a material library, such as a fade-in / fade-out transition animation, accompanied by transitional text such as "Not only that, it can also...", making the entire creative narrative smoother and more natural.
[0069] Finally, step S1648 is executed to encapsulate the smoothed creative long sequence into a data format conforming to the advertising platform's delivery interface specifications, generating the personalized advertising creative unit. Video streams, image resources, timeline caption files, etc., are encapsulated according to the format required by the target advertising platform (such as a feed advertising platform or short video platform), for example, an MP4 container, and the corresponding metadata and click-through links are embedded. Ultimately, a complete personalized advertising creative unit that can be directly used for delivery is generated.
[0070] For example, to further improve advertising effectiveness and creative quality, this method may optionally include a creative optimization and iteration step S170 after generating creative units. This step is performed after step S160.
[0071] Step S170 involves pushing the personalized ad creative unit to multiple test traffic channels for limited traffic distribution, obtaining feedback data for effect evaluation and correction. This specifically includes the following sub-steps.
[0072] First, step S171 is executed, pushing the personalized ad creative unit to multiple test traffic channels for limited traffic distribution. Click-through rate (CTR), conversion rate (CTR), and user dwell time (DDT) data for the personalized ad creative unit during the limited traffic distribution period are obtained. For example, the generated creative unit is simultaneously distributed to three small traffic pools (A, B, and C), with a small budget allocated to each pool, and the distribution period is 24 hours. After the distribution ends, the total number of impressions and total number of clicks for the creative are obtained from the data reports of each channel, and the CTR is calculated as CTR = total clicks / total impressions. Conversion data, such as the number of orders or registrations, is obtained, and the conversion rate (CVR) is calculated as CVR = total conversions / total clicks (or total impressions). Simultaneously, the average DDT data (Stay_Duration) for all users on the creative is obtained.
[0073] Next, step S172 is executed, whereby the click-through rate data, conversion rate data, and user dwell time data are input into a preset creative performance evaluation model. The model then outputs a comprehensive performance score for the personalized advertising creative unit. The creative performance evaluation model can be a simple weighted summation model or a complex machine learning model. For example, the model can be defined as: Score = w1*(CTR - μ_CTR) / σ_CTR + w2*(CVR - μ_CVR) / σ_CVR + w3*(Stay_Duration - μ_Stay) / σ_Stay, where the weights w1, w2, and w3 are preset according to business objectives, and μ and σ are the industry average or historical average of the corresponding indicators, used for standardization. The Score output by the model is the comprehensive performance score of the creative.
[0074] Then, step S173 is executed, comparing the overall performance score with a preset performance achievement threshold. A threshold T_performance is set, for example, 0.75 (if the score is normalized to between 0 and 1). It is determined whether the score is greater than T_performance. When the overall performance score is greater than the performance achievement threshold, step S174 is executed, marking the personalized advertising creative unit as a target creative unit and storing it in the target creative library. If the creative performance is excellent, it is stored in a dedicated high-quality creative library for subsequent large-scale deployment or as a reference for other creatives.
[0075] When the overall performance score is not greater than the performance threshold, step S175 is executed to trace the creative defects of the personalized advertising creative unit, identifying the creative defect modalities and locations that lead to the low overall performance score. If the score is below the threshold, the problem needs to be identified. This can be achieved by A / B testing of each component of the creative unit or by using interpretability analysis techniques. For example, the text in the original creative can be replaced with another version of text, the image can be replaced with another image, and the video can be replaced with another segment. Small-scale tests can be conducted to observe which part of the replacement leads to a significant improvement in performance, thereby locating the defective modality (text, image, or video) and its approximate location on the creative timeline (corresponding to the segment generated by which field-coupled feature unit). The defect location is marked as Pos_flaw.
[0076] Next, step S176 is executed: based on the creative defect modality and the creative defect location, a target reference creative unit matching the creative defect modality is extracted from the target creative library, and a reference creative segment corresponding to the location is extracted from the target reference creative unit. From the target creative library established in step S174, successful creative cases similar to the current product and targeting a similar audience are searched. A reference creative unit is extracted from these. Then, based on the defect modality and location, a creative segment of the same modality at the time point corresponding to the defect location is extracted from this reference creative unit. For example, if it is found that the video portion of the third unit has poor performance, then the video segment Ref_Video_seg_3 of the third unit is extracted from the reference creative.
[0077] Finally, step S177 is executed, replacing the defective creative fragment at the corresponding position in the personalized ad creative unit with the reference creative fragment to generate a corrected personalized ad creative unit. This corrected personalized ad creative unit is then re-pushed to the test traffic channel for further limited traffic delivery. The Video_seg_3 corresponding to the original creative unit U_seq_3 is replaced with Ref_Video_seg_3. The corrected creative unit is repackaged and generated, and then tested again according to step S171 to evaluate its effectiveness until the desired effect is achieved or the number of iterations is exhausted. This closed-loop optimization process continuously improves the quality and effectiveness of ad creatives.
[0078] In some embodiments, the personalized advertising creative generation system based on multimodal content understanding for performing the above-described method can be any electronic device with data computing, processing, and storage capabilities. This personalized advertising creative generation system based on multimodal content understanding can be used to implement the personalized advertising creative generation method based on multimodal content understanding provided in the above embodiments.
[0079] Typically, a personalized advertising creative generation system based on multimodal content understanding includes a processor and memory. The processor may include one or more processing cores, such as a quad-core processor or an octa-core processor. The processor can be implemented using at least one hardware form of DSP (Digital Signal Processing), FPGA (Field Programmable Gate Array), or PLA (Programmable Logic Array). The processor may also include a main processor and coprocessors. The main processor, also known as a CPU (Central Processing Unit), is used to process data in the wake-up state; the coprocessor is a low-power processor used to process data in the standby state. In some embodiments, the processor may integrate a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content to be displayed on the screen. In some embodiments, the processor may also include an AI (Artificial Intelligence) processor, which handles computational operations related to machine learning.
[0080] The memory may include one or more computer-readable storage media, which may be non-transitory. The memory may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices or flash memory devices. In some embodiments, the non-transitory computer-readable storage media in the memory is used to store a computer program configured to be executed by one or more processors to implement the above-described method for generating personalized advertising creatives based on multimodal content understanding.
[0081] In an illustrative embodiment, a computer-readable storage medium is also provided, wherein a computer program is stored in the storage medium, and the computer program, when executed by a processor of a computer device, implements the aforementioned personalized advertising creative generation method based on multimodal content understanding. Optionally, the aforementioned computer-readable storage medium may be ROM (Read-Only Memory), RAM (Random Access Memory), CD-ROM (CompactDisc Read-Only Memory), magnetic tape, floppy disk, and optical data storage device, etc.
[0082] This application provides a computer program product, which includes computer-executable instructions or a computer program. When the computer-executable instructions or the computer program are executed by a processor, the processor will execute the personalized advertising creative generation method based on multimodal content understanding provided in this application.
[0083] This application provides a computer-readable storage medium storing computer-executable instructions or computer programs. When the computer-executable instructions or computer programs are executed by a processor, the processor will execute the personalized advertising creative generation method based on multimodal content understanding provided in this application.
[0084] In some embodiments, the computer-readable storage medium may be a read-only memory (ROM), random access memory (RAM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory, magnetic surface memory, optical disk, or CD-ROM, etc.; or it may be a device that includes one or any combination of the above-mentioned memories.
[0085] In some embodiments, computer-executable instructions may take the form of programs, software, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.
[0086] As an example, computer-executable instructions may, but do not necessarily, correspond to files in a file system. They may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a HyperText Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple co-located files (e.g., files that store one or more modules, subroutines, or code sections).
[0087] As an example, computer-executable instructions can be deployed to execute on a single electronic device, or on multiple electronic devices located at one location, or on multiple electronic devices distributed across multiple locations and interconnected via a communication network.
[0088] Finally, it should be noted that the above-disclosed embodiments are merely preferred embodiments of the present invention and are only used to illustrate the technical solutions of the present invention, not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for generating personalized advertising creatives based on multimodal content understanding, characterized in that, The method includes: Receive the original product information package sent by the advertiser, the original product information package includes a product attribute description text stream, a product real object image stream, and a product function demonstration video stream; Extract core product function keywords from the product attribute description text stream, construct a function semantic association matrix based on the core product function keywords, extract product appearance visual elements from the product physical shooting image stream, construct an appearance visual association matrix based on the product appearance visual elements, extract product usage scenario time sequence segments from the product function demonstration video stream, and construct a scenario time sequence association matrix based on the product usage scenario time sequence segments. The functional semantic association matrix, the appearance visual association matrix, and the scene temporal association matrix are input into a preset 3D association aligner for cross-modal association resonance processing to generate a set of resonance feature units with resonance intensity exceeding a preset resonance threshold. Obtain the historical interaction behavior trajectory of the target audience on the advertising platform, extract the audience attention migration path from the historical interaction behavior trajectory, and construct the audience attention field distribution map based on the audience attention migration path; The set of resonant feature units is projected onto the audience attention field distribution map, the field strength coupling coefficient of each resonant feature unit in the audience attention field distribution map is calculated, the set of resonant feature units is filtered according to the field strength coupling coefficient, and a field coupling feature unit sequence matching the audience attention field distribution map is generated. The field coupling feature unit sequence is input into a preset creative generation decoder, which outputs a text creative fragment sequence, an image creative fragment sequence, and a video creative fragment sequence. The text creative fragment sequence, the image creative fragment sequence, and the video creative fragment sequence are interwoven and spliced according to the arrangement order of the field coupling feature unit sequence to generate a personalized advertising creative unit.
2. The personalized advertising creative generation method based on multimodal content understanding according to claim 1, characterized in that, The step of extracting core product function keywords from the product attribute description text stream and constructing a functional semantic association matrix based on the core product function keywords includes: The product attribute description text stream is filtered for text stream stop words to remove stop words that do not have actual semantic meaning, and a filtered product attribute description text stream is generated. The filtered product attribute description text stream is segmented into multiple independent text word units. Each text vocabulary unit is input into a pre-trained part-of-speech tagger, which outputs the part-of-speech tag for each text vocabulary unit. Text vocabulary units with part-of-speech tags of nouns or verbs are selected from all text vocabulary units as candidate functional keywords. Each candidate functional keyword is input into a pre-trained semantic role analyzer to identify the semantic role that each candidate functional keyword plays in the product attribute description text stream. Candidate functional keywords with semantic roles as core arguments are selected from the candidate functional keywords as core product functional keywords. Calculate the co-occurrence distance between every two core product function keywords in the product attribute description text stream, and construct a core product function keyword co-occurrence matrix based on the co-occurrence distance. The row index and column index of the core product function keyword co-occurrence matrix are both core product function keywords, and the matrix element values are the reciprocals of the corresponding co-occurrence distances. Matrix spectral decomposition is performed on the co-occurrence matrix of the core product function keywords to extract the eigenvalues and corresponding eigenvectors of the co-occurrence matrix of the core product function keywords. The first few eigenvectors are selected as basis vectors based on the magnitude of the eigenvalues. The projection coefficient of each core product function keyword onto the basis vector is used as the association dimension coordinate of the core product function keyword. The functional semantic association matrix is constructed based on the association dimension coordinates of all core product function keywords. Each row of the functional semantic association matrix corresponds to the association dimension coordinate vector of a core product function keyword.
3. The personalized advertising creative generation method based on multimodal content understanding according to claim 1, characterized in that, The step of extracting product appearance visual elements from the product's actual photographic image stream and constructing an appearance visual association matrix based on these elements includes: Each frame of the product image in the product image stream is converted to a different color space, transforming each frame from the original color space to a preset target color space to generate a product image in the target color space. Perform image frequency domain transformation on each frame of the product photographed in the target color space, converting each frame of the product photographed from the spatial domain to the frequency domain, and generating the frequency domain spectrum of each frame of the product photographed. Extract the concentrated spectral energy region from the frequency domain spectrum of each frame of the product image, and identify the frequency component corresponding to the concentrated spectral energy region as the main frequency component of that frame of the product image. Based on the main frequency components, each frame of the product photograph is reconstructed, and secondary frequency components other than the main frequency components are filtered out to generate a visual main frequency reconstructed image of each frame of the product photograph. Edge contours are extracted from the visual main frequency reconstructed image of each frame of the product real-object image. Pixels with abrupt changes in grayscale gradient in the visual main frequency reconstructed image are identified and connected to form a closed contour line, which serves as the product appearance contour line of that frame of the product real-object image. Based on the area enclosed by the product's outline, the main product region image is segmented from the corresponding visual main frequency reconstructed image. The color histogram distribution features, texture periodic distribution features, and shape invariant moment features of the main product region image are extracted as the product appearance visual elements of the actual product photograph image of that frame. The visual elements of the product appearance corresponding to all frames of real product images are vectorized to generate a visual element vector corresponding to each product appearance visual element. All visual element vectors are arranged in frame order to construct the appearance visual association matrix. Each row of the appearance visual association matrix corresponds to a visual element vector of a frame of real product images.
4. The personalized advertising creative generation method based on multimodal content understanding according to claim 1, characterized in that, The step of extracting time-series segments of product usage scenarios from the product function demonstration video stream and constructing a scene time-series correlation matrix based on the time-series segments of product usage scenarios includes: The video stream scene switching detection is performed on the product function demonstration video stream. The inter-frame difference between adjacent video frames in the product function demonstration video stream is calculated. The positions of adjacent frames whose inter-frame difference exceeds the preset scene switching threshold are marked as scene switching points. Based on the scene switching point, the product function demonstration video stream is divided into multiple initial video scene segments, each of which contains a set of video frame sequences with coherent scenes. For each initial video scene segment, keyframes within the scene are extracted. The intra-frame information entropy of each video frame in each initial video scene segment is calculated. Several video frames with the highest intra-frame information entropy are selected from each initial video scene segment as key representative frames of that initial video scene segment. Arrange all key representative frames of each initial video scene segment in chronological order to generate a key representative frame sequence for each initial video scene segment. Each key representative frame sequence is input into a pre-trained scene semantic encoder. The key representative frame sequence is then subjected to joint spatiotemporal feature extraction through a 3D convolutional layer in the scene semantic encoder to generate a scene spatiotemporal feature tensor for each initial video scene segment. Tensor dimensionality reduction is performed on the scene spatiotemporal feature tensor of each initial video scene segment, and the scene spatiotemporal feature tensor is compressed into a one-dimensional scene feature vector, which serves as the product usage scene time sequence segment corresponding to the initial video scene segment. Arrange all the product usage scenario time-series segments corresponding to the initial video scene segments in the order in which the scenes appear in the video stream, and construct the scene time-series association matrix. Each row of the scene time-series association matrix corresponds to a scene feature vector of a product usage scenario time-series segment.
5. The personalized advertising creative generation method based on multimodal content understanding according to claim 1, characterized in that, The step of inputting the functional semantic association matrix, the appearance visual association matrix, and the scene temporal association matrix into a preset 3D association aligner for cross-modal association resonance processing to generate a set of resonance feature units with resonance intensity exceeding a preset resonance threshold includes: Each row vector in the functional semantic association matrix is used as a text modality association source, each row vector in the appearance visual association matrix is used as an image modality association source, and each row vector in the scene temporal association matrix is used as a video modality association source. Calculate the first cross-modal association distance between each text modality association source and each image modality association source, and construct a text-image association distance matrix based on the first cross-modal association distance; Calculate the second cross-modal association distance between each text modal association source and each video modal association source, and construct a text-video association distance matrix based on the second cross-modal association distance; Calculate the third cross-modal association distance between each image modal association source and each video modal association source, and construct an image-video association distance matrix based on the third cross-modal association distance; The text-image association distance matrix, the text-video association distance matrix, and the image-video association distance matrix are input into the resonance kernel calculation unit of the three-dimensional association aligner. The three-dimensional tensor superposition of the three association distance matrices is performed by the resonance kernel calculation unit to generate a three-dimensional association resonance tensor. Tensor decomposition is performed on the three-dimensional correlation resonance tensor to extract the principal component factors of the three-dimensional correlation resonance tensor. Based on the principal component factors, the comprehensive resonance intensity between each text modality correlation source, each image modality correlation source, and each video modality correlation source is calculated. Each text modal correlation source, each image modal correlation source, and each video modal correlation source are paired to generate multiple cross-modal correlation source pairs, and each cross-modal correlation source pair corresponds to a comprehensive resonance intensity; Cross-modal correlation source pairs with a comprehensive resonance intensity exceeding a preset resonance threshold are selected. The text modal correlation source, image modal correlation source, and video modal correlation source in each selected cross-modal correlation source pair are bound into triplet pairs to generate a resonance feature unit. All resonance feature units constitute the resonance feature unit set.
6. The personalized advertising creative generation method based on multimodal content understanding according to claim 1, characterized in that, The process of obtaining the historical interaction behavior trajectory of the target audience on the advertising platform, extracting the audience attention migration path from the historical interaction behavior trajectory, and constructing an audience attention field distribution map based on the audience attention migration path includes: The historical interaction behavior logs of the target audience within a preset historical time period are obtained from the backend database of the advertising platform. The historical interaction behavior logs include the audience identifier, behavior timestamp, interacted advertisement identifier, and interaction behavior type for each interaction behavior record. The interaction behavior records in the historical interaction behavior log are grouped according to the audience identifier, and the interaction behavior records with the same audience identifier are grouped into the historical interaction behavior subset of the same audience. For each audience's historical interaction behavior subset, the interaction behavior records are sorted in ascending order according to the timestamp of the behavior occurrence, generating a time-sorted interaction behavior list for each audience. Iterate through the time-sorted list of interactive behaviors for each audience, and extract the interactive ad identifier from each interactive behavior record to form the initial ad interaction trajectory sequence for that audience. For each interacted ad identifier in the initial ad interaction trajectory sequence, the ad content type is parsed to determine the ad content type corresponding to each interacted ad identifier, and the initial ad interaction trajectory sequence is converted into an audience attention type migration path arranged in chronological order by ad content type identifiers; The audience attention type migration path is smoothed by removing noise type points in the audience attention type migration path whose duration is less than a preset duration threshold, and a smoothed audience attention migration path is generated. The smoothed audience attention migration paths of all audiences are spatially superimposed, and the number of audiences appearing at each time position of each audience attention migration path is counted. The attention field strength value at each time position is calculated based on the number of audiences. The audience attention field distribution map is constructed with time as the horizontal axis and attention field strength value as the vertical axis. Each time point in the audience attention field distribution map corresponds to an attention field strength value.
7. The personalized advertising creative generation method based on multimodal content understanding according to claim 6, characterized in that, The step of projecting the set of resonant feature units onto the audience attention field distribution map, calculating the field strength coupling coefficient of each resonant feature unit in the audience attention field distribution map, filtering the set of resonant feature units based on the field strength coupling coefficient, and generating a field coupling feature unit sequence that matches the audience attention field distribution map includes: The text modal association source, image modal association source, and video modal association source contained in each resonant feature unit are analyzed, and the original product information timestamp corresponding to each resonant feature unit is extracted. The original product information timestamp represents the time position of the product attribute description text stream, product physical shooting image stream, or product function demonstration video stream from which the resonant feature unit originates in the original product information package. Map the original product information timestamp corresponding to each resonance feature unit to the time axis of the audience attention field distribution map to determine the projection time point of each resonance feature unit in the audience attention field distribution map; Read the attention field strength value corresponding to each projection time point from the audience attention field distribution map, and use the attention field strength value as the initial field strength coupling coefficient of the resonant feature unit; Calculate the internal resonance consistency score among the text modality association source, image modality association source, and video modality association source in each resonant feature unit, and multiply the internal resonance consistency score by the initial field strength coupling coefficient to generate the final field strength coupling coefficient of the resonant feature unit. Based on the final field strength coupling coefficient, all resonant feature units are sorted in descending order to generate a sorted list of resonant feature units. From the sorted list of resonant feature units, select the resonant feature units whose final field strength coupling coefficient ranks within the top preset range as candidate field coupling feature units. The candidate field coupling feature units are rearranged according to the chronological order of their projection time points to generate a sequence of field coupling feature units that matches the audience attention field distribution map.
8. The personalized advertising creative generation method based on multimodal content understanding according to claim 7, characterized in that, The step of inputting the field-coupled feature unit sequence into a preset creative generation decoder, and outputting text creative fragment sequences, image creative fragment sequences, and video creative fragment sequences through the creative generation decoder includes: Each field-coupled feature unit in the field-coupled feature unit sequence is sequentially input into the text decoding branch of the creative generation decoder. The text decoding branch contains multiple cascaded text decoding layers. Each text decoding layer performs text semantic reconstruction on the input field-coupled feature unit and outputs the intermediate text reconstruction representation of the current layer. The intermediate text reconstruction representation output by the last text decoding layer serves as the text creative fragment corresponding to the field-coupled feature unit. Each field-coupled feature unit in the field-coupled feature unit sequence is sequentially input into the image decoding branch of the creative generation decoder. The image decoding branch contains multiple cascaded image decoding layers. Each image decoding layer performs image visual reconstruction on the input field-coupled feature unit and outputs the intermediate representation of the image reconstruction of the current layer. The intermediate representation of the image reconstruction output by the last image decoding layer serves as the image creative fragment corresponding to the field-coupled feature unit. Each field coupling feature unit in the field coupling feature unit sequence is sequentially input into the video decoding branch of the creative generation decoder. The video decoding branch contains multiple cascaded video decoding layers. Each video decoding layer performs video temporal reconstruction on the input field coupling feature unit and outputs the intermediate representation of the video reconstruction of the current layer. The intermediate representation of the video reconstruction output by the last video decoding layer is used as the video creative segment corresponding to the field coupling feature unit. All text creative fragments corresponding to all field coupling feature units are combined into a text creative fragment sequence according to the arrangement order of the field coupling feature unit sequence. All image creative fragments corresponding to all field coupling feature units are combined into an image creative fragment sequence according to the arrangement order of the field coupling feature unit sequence. All video creative fragments corresponding to all field coupling feature units are combined into a video creative fragment sequence according to the arrangement order of the field coupling feature unit sequence.
9. The personalized advertising creative generation method based on multimodal content understanding according to claim 8, characterized in that, The step of interweaving and splicing the text creative fragment sequence, the image creative fragment sequence, and the video creative fragment sequence according to the arrangement order of the field coupling feature unit sequence to generate a personalized advertising creative unit includes: Obtain the projection time point corresponding to each field coupling feature unit in the field coupling feature unit sequence, and calculate the expected presentation duration of each field coupling feature unit in the final advertising creative based on the projection time point; Semantic units are divided into text creative segments corresponding to each field coupling feature unit to generate a text semantic unit sequence. Based on the expected presentation duration, a target presentation duration is assigned to each text semantic unit to form a text creative sub-segment sequence. The image creative fragment corresponding to each field coupling feature unit is divided into visual units to generate an image visual unit sequence. Based on the expected presentation duration, a target presentation duration is assigned to each image visual unit to form an image creative sub-fragment sequence. Based on the expected presentation duration, the video creative segment corresponding to each field coupling feature unit is divided into a sequence of video creative sub-segments, and each video creative sub-segment corresponds to a video content with a unit presentation duration. The text creative sub-fragment sequence, image creative sub-fragment sequence, and video creative sub-fragment sequence corresponding to each field coupling feature unit are aligned between modes so that the text creative sub-fragments, image creative sub-fragments, and video creative sub-fragments within the same time period correspond to each other semantically. According to the arrangement order of the field coupling feature unit sequence, the aligned creative sub-segments corresponding to all field coupling feature units are spliced together end to end to generate a multimodal intertwined initial creative long sequence. The initial creative long sequence is subjected to sequence smoothing processing. Semantic transition points between adjacent creative sub-segments in the initial creative long sequence are detected. Semantic transition sub-segments are inserted at the semantic transition points to generate a smoothed creative long sequence. The smoothed creative long sequence is encapsulated into a data format that conforms to the advertising platform's delivery interface specifications to generate the personalized advertising creative unit.
10. A personalized advertising creative generation system based on multimodal content understanding, characterized in that, The method includes a processor and a computer-readable storage medium storing machine-executable instructions, which, when executed by a computer, implement the personalized advertising creative generation method based on multimodal content understanding as described in any one of claims 1-9.