International video-oriented cultural element cross-modal positioning method
By preprocessing and extracting features from international videos using a multimodal large model, and combining neural network mapping similarity and filtering methods, accurate cross-modal localization of cultural elements in international videos is achieved, solving the problem of inaccurate localization caused by single-modal data, and supporting automated detection and labeling of cultural elements.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HUNAN NORMAL UNIVERSITY
- Filing Date
- 2026-03-28
- Publication Date
- 2026-06-19
AI Technical Summary
Current technologies rely on single-modal data to locate international video cultural elements, resulting in inaccurate positioning and failing to meet the needs of cross-cultural communication research and analysis.
A multimodal large model is used to preprocess the video data to generate textual and visual feature vectors of cultural elements. Similarity is calculated through neural network mapping and post-processing is performed using an exponentially weighted moving average filtering method to select keyframes containing the target cultural elements.
It has achieved accurate cross-modal localization of cultural elements in international videos, solved the problem of accurate localization of target cultural elements in complex video backgrounds, and provided a complete technical solution for the automated detection, labeling and dissemination effect analysis of cultural elements.
Smart Images

Figure CN122241007A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and further to the field of cross-cultural communication technology based on artificial intelligence, and particularly to a method for cross-modal localization of cultural elements in international videos. Background Technology
[0002] With the acceleration of global digitalization, international videos have become an important carrier of cross-cultural communication. Videos from different countries contain their own rich cultural elements. Accurately locating and identifying these cultural elements is a key prerequisite for conducting research and analysis related to cross-cultural communication. However, the current methods for locating cultural elements from multiple countries in international videos suffer from problems such as reliance on single-modal data and inaccurate positioning, which make it difficult to meet the needs of cross-cultural communication research and analysis for accurate positioning and identification of cultural elements. Summary of the Invention
[0003] In view of this, the present invention discloses a cross-modal localization method for cultural elements in international videos, aiming to solve the problem of inaccurate localization of cultural elements in international videos caused by relying on single-modal data in the prior art.
[0004] According to a first aspect of the present invention, a method for cross-modal localization of cultural elements in international videos is disclosed, the method comprising:
[0005] Receive video-related multimodal input data and preprocess the data to obtain a set of core keywords;
[0006] The core keyword set is input into a pre-trained multimodal large model. The multimodal large model is used to extract textual and visual deep features from the current video, generate textual feature vectors and visual feature vectors of cultural elements, and calculate the neural network mapping similarity between the visual feature vector and the textual feature vector of cultural elements in each frame to generate a similarity scoring sequence.
[0007] An exponentially weighted moving average filtering method is used to perform targeted post-processing on the similarity scoring sequence to obtain a smooth similarity scoring sequence;
[0008] Based on the smooth similarity scoring sequence, frame images with a smooth similarity higher than a preset threshold are selected to form a set of keyframes containing the target cultural elements, thus completing the cross-modal localization of video cultural elements.
[0009] In some possible implementations of the first aspect, the data is preprocessed to obtain a set of core keywords, including:
[0010] Multimodal input data is converted into text to obtain three types of initial text: text, speech and image. Based on the attention mechanism and combined with modal reliability rules, fused text without redundancy and with consistent semantics is generated.
[0011] The fused text is segmented into words, and a stop word list is used to filter out meaningless words to obtain a set of effective words.
[0012] The effective word set is encoded, and the cosine similarity between each word vector and the document vector is calculated. High-scoring words are selected based on the cosine similarity and formed into a set, resulting in the core keyword set.
[0013] In some possible implementations of the first aspect, a set of core keywords is input into a pre-trained multimodal large model. The multimodal large model is then used to extract textual and visual deep features from the current video, generating textual and visual feature vectors of cultural elements. The similarity between the neural network mapping of the visual feature vector and the textual feature vector of cultural elements for each frame is calculated, generating a similarity scoring sequence, including:
[0014] The multimodal large model includes a text encoder based on BERT (Bidirectional Encoder Representations from Transformers) and an image encoder based on VIT (Vision Transformer).
[0015] The core keyword set is input into a text encoder based on BERT, and the semantic associations of cultural element keywords are captured through a dynamic attention mechanism to generate a high-dimensional text feature vector.
[0016] The video frame images are input into an image encoder based on VIT, and the visual features of local cultural elements in the frame images are accurately captured through regional feature encoding to generate high-dimensional visual feature vectors.
[0017] Within the feature space, based on the retrieval enhancement method and employing a contrastive learning strategy and a generative alignment loss strategy, the similarity between the visual feature vector of each video frame and the text feature vector of cultural elements is calculated using a neural network mapping, generating a similarity scoring sequence that reflects the degree of semantic association between each frame and the cultural elements.
[0018] In some possible implementations of the first aspect, based on the smooth similarity scoring sequence, frame images with a smooth similarity higher than a preset threshold are selected to form a set of keyframes containing the target cultural elements, thereby completing the cross-modal localization of video cultural elements, including:
[0019] A preset similarity threshold for dynamic or static cultural elements is established. Based on the preset threshold, frames with a smooth similarity higher than the preset threshold are selected to form a set of keyframes containing the target cultural elements, thus completing the cross-modal localization of video cultural elements.
[0020] In some possible implementations of the first aspect, the method further includes:
[0021] We acquire popular international videos and analyze them frame by frame to construct a dataset of video cultural elements.
[0022] In some possible implementations of the first aspect, popular international videos are acquired and analyzed frame by frame to construct a dataset of video cultural elements, including:
[0023] We acquire popular international videos, use FFmpeg (Fast Forward Moving Picture Experts Group) frame-segmentation technology to break the videos down into independent frame images, and annotate the cultural elements in each frame.
[0024] Store the cultural elements and corresponding temporal information for each frame to generate a video cultural element case dataset.
[0025] In some possible implementations of the first aspect, the method further includes:
[0026] Visualize and highlight the target cultural elements in each keyframe;
[0027] The marked keyframes are reassembled according to the time sequence and frame rate of the original video, and video coding technology is used to generate a continuous target video containing complete cultural element markings.
[0028] It preserves the original audiovisual synchronization characteristics of the target video and outputs it as a video file with a specified format and resolution as needed, completing the entire process from locating video cultural elements to visual presentation.
[0029] In some possible implementations of the first aspect, the target cultural elements of each keyframe are visually marked and highlighted, including:
[0030] Multi-level feature extraction is performed on keyframes to generate high-dimensional feature maps and amplify the feature weights related to cultural elements.
[0031] Retrieve reference information from the video cultural element case dataset, and locate the target area of cultural elements in the complex background of the video based on feature weights and reference information;
[0032] Calculate the attention weight of the target region and filter out the region with the highest attention weight;
[0033] The contour segmentation method is adopted to predict the segmentation mask for the region with the highest attention weight in each keyframe, obtain the segmentation mask, and binarize the segmentation mask; extract the contour point set from the binarized segmentation mask, and the contour point set contains the contour point coordinates.
[0034] Coordinate mapping is performed based on the contour point coordinates to obtain the visual contour markers of the target cultural elements in each keyframe, and then they are highlighted.
[0035] In some possible implementations of the first aspect, the tagged keyframes are reassembled according to the temporal sequence and frame rate of the original video, and video coding technology is used to generate a continuous target video containing complete cultural element tags, including:
[0036] For the marked keyframes, the frame rate parameters are extracted from the original video metadata; the frame rate parameters include timing characteristics and playback duration.
[0037] Construct a timestamp sequence and reassemble the frame sequence based on the frame rate parameter;
[0038] By using video coding technology, the recombined frame sequence is compressed to generate a continuous target video containing complete cultural element markers.
[0039] In some possible implementations of the first aspect, the timestamp sequence is determined by the following calculation formula: ;
[0040] in, For timestamps; The frame number; This is the frame rate parameter.
[0041] In this invention, key cultural element features are obtained through multimodal feature extraction; a pre-trained multimodal large model is used to extract text feature vectors and visual feature vectors for each frame of image, and a frame-level scoring sequence is constructed by similarity calculation through neural network mapping; the scoring sequence is smoothed by exponential weighted moving average filtering to select candidate keyframes containing the target cultural elements, thereby forming a keyframe set and completing the cross-modal localization of video cultural elements; through cross-modal semantic alignment, the accurate localization of target cultural elements in complex video backgrounds is effectively solved.
[0042] It should be understood that the description in the Summary of the Invention is not intended to limit the key or essential features of the embodiments of the present invention, nor is it intended to restrict the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description
[0043] The above and other features, advantages, and aspects of the various embodiments of the present invention will become more apparent from the accompanying drawings and the following detailed description. The drawings are provided for a better understanding of the invention and are not intended to limit the invention. In the drawings, the same or similar reference numerals denote the same or similar elements, wherein:
[0044] Figure 1The flowchart illustrates a cross-modal localization method for cultural elements in international videos provided by an embodiment of the present invention.
[0045] Figure 2 The illustration shows a schematic diagram of a cross-modal localization method for cultural elements in international videos provided by an embodiment of the present invention;
[0046] Figure 3 The flowchart of the module for preprocessing multimodal input and fusing and filtering keyword sequences from different sources based on an attention mechanism, provided by an embodiment of the present invention, is shown.
[0047] Figure 4 The flowchart of the module for using FFmpeg to segment popular game video data and annotated sample dataset based on cultural element analysis provided by an embodiment of the present invention is shown.
[0048] Figure 5 The flowchart of the module for obtaining neural network mapping similarity scoring sequences through a large BLIP model and using retrieval enhancement technology is shown in an embodiment of the present invention.
[0049] Figure 6 The flowchart of the module for smoothing the scoring sequence and screening the candidate frame set for cultural elements by using the exponentially weighted moving average filtering algorithm provided in the embodiment of the present invention is shown.
[0050] Figure 7 The flowchart of the module for accurately locating cultural element targets using a pixel-level contour segmentation scheme for a candidate frame set for precise positioning of cultural elements provided by an embodiment of the present invention is shown.
[0051] Figure 8 The diagram illustrates the application environment of the cross-modal localization method for cultural elements in international videos provided by an embodiment of the present invention. Detailed Implementation
[0052] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0053] Furthermore, the term "and / or" in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.
[0054] In response to the problems mentioned in the background art, this invention discloses a cross-modal localization method for cultural elements in international videos.
[0055] Specifically, the system receives multimodal input data related to the video and preprocesses the data to obtain a set of core keywords. This set of core keywords is then input into a pre-trained multimodal large-scale model, which extracts textual and visual depth features from the current video, generating textual and visual feature vectors for cultural elements. The similarity between the neural network mapping of the visual feature vector and the textual feature vector of cultural elements in each frame is calculated, generating a similarity scoring sequence. An exponentially weighted moving average filtering method is used to perform targeted post-processing on the similarity scoring sequence to obtain a smoothed similarity scoring sequence. Based on the smoothed similarity scoring sequence, frames with a smoothed similarity higher than a preset threshold are selected to form a set of keyframes containing the target cultural elements, thus completing the cross-modal localization of video cultural elements.
[0056] This approach can solve the problem of inaccurate positioning of international video cultural elements caused by relying on single-modal data in existing technologies.
[0057] The cross-modal positioning method for cultural elements in international videos provided by the present invention will be described in more detail below with reference to the accompanying drawings and specific embodiments.
[0058] Figure 1 The flowchart illustrates a cross-modal localization method for cultural elements in international videos provided by an embodiment of the present invention; the cross-modal localization method 100 for cultural elements in international videos may include:
[0059] S110 receives video-related multimodal input data and preprocesses the data to obtain a set of core keywords.
[0060] The process involves preprocessing the data to obtain a core keyword set, including: converting multimodal input data into text to obtain three initial text categories: text, speech, and image; generating a non-redundant and semantically consistent fused text based on an attention mechanism and modality reliability rules; segmenting the fused text and filtering out meaningless words using a stop word list to obtain a set of effective words; encoding the set of effective words and calculating the cosine similarity between each word vector and the document vector; selecting high-scoring words based on the cosine similarity and forming a set to obtain the core keyword set.
[0061] S120: Input the core keyword set into the pre-trained multimodal large model, use the multimodal large model to extract textual and visual deep features of the current video, generate textual feature vectors and visual feature vectors of cultural elements, calculate the neural network mapping similarity between the visual feature vector and the textual feature vector of cultural elements of each frame, and generate a similarity scoring sequence.
[0062] Specifically, the multimodal large model can include a text encoder based on BERT and an image encoder based on VIT. The core keyword set is input into the text encoder based on BERT, and the semantic association of cultural element keywords is captured through a dynamic attention mechanism to generate a high-dimensional text feature vector. The video frame images are input into the image encoder based on VIT, and the visual features of local cultural elements in the frame images are accurately captured through region-level feature encoding to generate a high-dimensional visual feature vector. In the feature space, based on retrieval enhancement, contrastive learning, and generative alignment loss strategies, the neural network mapping similarity between the visual feature vector of each video frame and the text feature vector of cultural elements is calculated to generate a similarity scoring sequence that reflects the degree of semantic association between each frame and the cultural elements.
[0063] S130 employs an exponentially weighted moving average filtering method to perform targeted post-processing on the similarity scoring sequence, resulting in a smoothed similarity scoring sequence.
[0064] Specifically, an exponentially weighted moving average filtering method is used to suppress the abrupt changes in similarity scores caused by noise unique to the video, such as camera shake, flashing special effects, and sudden changes in lighting, so as to obtain a smooth similarity score sequence.
[0065] S140: Based on the smooth similarity scoring sequence, select the frame images with a smooth similarity higher than the preset threshold to form a set of keyframes containing the target cultural elements, thus completing the cross-modal localization of video cultural elements.
[0066] Specifically, a similarity threshold for the location of dynamic or static cultural elements is preset; based on the preset threshold, frame images with a smooth similarity higher than the preset threshold are selected to form a set of keyframes containing the target cultural elements, thus completing the cross-modal location of video cultural elements.
[0067] In some embodiments, method 100 further includes:
[0068] We acquire popular international videos and analyze them frame by frame to construct a dataset of video cultural elements.
[0069] Specifically, we acquire popular international videos, use FFmpeg frame segmentation technology to decompose the videos into independent frame images, and annotate the cultural elements of each frame; we store the cultural elements of each frame and their corresponding temporal information to generate a video cultural element case dataset.
[0070] In some embodiments, method 100 further includes:
[0071] Visualize and highlight the target cultural elements in each keyframe;
[0072] The marked keyframes are reassembled according to the time sequence and frame rate of the original video, and video coding technology is used to generate a continuous target video containing complete cultural element markings.
[0073] It preserves the original audiovisual synchronization characteristics of the target video and outputs it as a video file with a specified format and resolution as needed, completing the entire process from locating video cultural elements to visual presentation.
[0074] Furthermore, the target cultural elements in each keyframe are visually marked and highlighted, including:
[0075] Multi-level feature extraction is performed on keyframes to generate high-dimensional feature maps, amplifying the feature weights related to cultural elements. Reference information is retrieved from the video cultural element case dataset, and the target region of cultural elements is located in the complex background of the video based on the feature weights and reference information. The attention weight of the target region is calculated, and the region with the highest attention weight is selected. Contour segmentation is used to predict the segmentation mask of the region with the highest attention weight in each keyframe, and the segmentation mask is binarized. Contour point sets containing contour point coordinates are extracted from the binarized segmentation mask. Coordinate mapping is performed based on the contour point coordinates to obtain the visual contour marker of the target cultural element in each keyframe, and its highlighting is set.
[0076] Furthermore, the tagged keyframes are reassembled according to the temporal sequence and frame rate of the original video, and video coding techniques are used to generate a continuous target video containing complete cultural element tags, including:
[0077] For the marked keyframes, the frame rate parameters are extracted from the original video metadata; the frame rate parameters include temporal characteristics and playback duration; a timestamp sequence is constructed, and the frame sequence is reassembled according to the frame rate parameters; video coding technology is used to compress the reassembled frame sequence to generate a continuous target video containing complete cultural element tags.
[0078] Furthermore, the timestamp sequence can be determined using the following formula: ;
[0079] in, For timestamps; The frame number; This is the frame rate parameter.
[0080] According to embodiments of the present invention, the present invention parses the input text through a text preprocessing module, extracts key cultural element features, and forms a structured query target; it decomposes the input video into a temporal image sequence using video frame segmentation technology, constructing a frame-by-frame image data dictionary; it uses a pre-trained multimodal large model to extract text feature vectors and visual feature vectors for each frame image, and constructs a frame-level semantic association scoring sequence through neural network mapping similarity calculation; it uses exponentially weighted moving average filtering to smooth the similarity sequence, sets a dynamic threshold based on statistical distribution characteristics, and filters out key frames containing target cultural elements; furthermore, it performs multi-level feature extraction on the filtered key frames, automatically focuses on the visual region most relevant to the text description through attention weight calculation, and generates pixel-level coordinate information; and drives the semantic segmentation module to accurately segment the target region. Contour segmentation enables the visualization, marking, and highlighting of cultural elements. All marked frames are reassembled according to their original temporal sequence and frame rate, and a target video retaining the original audio-visual synchronization characteristics is generated through video coding technology. The final result with a specified format and resolution is output, completing the fully automated process from semantic understanding of cultural elements to video visualization. This achieves cross-modal semantic alignment between text and video. By combining dynamic thresholding and attention localization, the problem of accurate identification and segmentation of cultural elements in complex video scenes is solved. This provides a complete technical solution for the automated detection, marking, and dissemination effect analysis of cultural elements, effectively solving technical challenges such as cross-modal semantic alignment, accurate target localization in complex backgrounds, and automated video annotation. It realizes end-to-end intelligent processing from text description to video element localization and marking.
[0081] The above are method embodiments. The following description, with reference to the accompanying drawings, will further illustrate the above method embodiments using international game videos as an example.
[0082] Figure 2 The illustration shows a schematic diagram of a cross-modal localization method for cultural elements in international videos provided by an embodiment of the present invention; Figure 3 The flowchart of the module for preprocessing multimodal input and fusing and filtering keyword sequences from different sources based on an attention mechanism, provided by an embodiment of the present invention, is shown.
[0083] 1. For example Figure 2 As shown, it receives multimodal input data (text, voice, and images) related to game videos and performs targeted preprocessing operations, such as... Figure 3As shown, all multimodal input data are converted into text, resulting in three initial text types. These three text types are then fused using an attention mechanism. Combining the modal reliability rule of "plain text > speech > image text," a non-redundant and semantically consistent fused text related to game video cultural elements is generated. After word segmentation, meaningless words are filtered using a game domain stop word list to obtain a set of effective words. After encoding the effective word set, the cosine similarity between each word vector and the document vector is calculated. High-scoring words are selected to form a core keyword set for game video cultural elements, providing core semantic support for the subsequent accurate positioning of cultural elements.
[0084] Specifically:
[0085] 1.1) For speech input, the Whisper model of the Automatic Speech Recognition (ASR) system is used to convert speech into text; the model architecture used is a Transformer encoder-decoder, and its training objective is to maximize the conditional probability log-likelihood. The cross-entropy loss function is modified as follows: ;
[0086] in, It is a special token related to the task. To output the length of the text sequence, It is the first One token, It is the first Prefix sequence before each token , It is an audio feature input ( For time steps, (where is the Mel spectrum dimension), P is the predicted probability of the i-th token given the prefix sequence, audio input, and task token;
[0087] The result of speech-to-text conversion is ;
[0088] For image input, feature extraction is performed first. VGG16 (Visual Geometry Group, a 16-layer network) is used to extract global and local features of the image. The corrected formula is as follows:
[0089] ;
[0090] in, For the input image ( (Height and width respectively). For model parameters, To output the feature map, For spatial dimensions, Number of channels;
[0091] Then, a Recurrent Neural Network (RNN) Transformer is used to generate text descriptions based on feature maps using beam search. The beam search is performed in the first... candidate sequence set of steps To satisfy:
[0092] ;
[0093] in, For the width of the bundle, Let be the joint probability of the sequences.
[0094] The final image-to-text output is .
[0095] 1.2) , , There is information redundancy / conflict among the (plain text output) results. To address this, a multimodal text fusion strategy based on the Transformer attention mechanism is employed to allow the three modalities to "interact" semantically. First, the three modalities are split into words to obtain word sequences.
[0096] ;
[0097] ;
[0098] ;
[0099] Add modal source tags to each word: phonetic word tagged as <speech>Image word tagging Plain text word tags are <text>Then concatenate them into a total sequence. ,use <sep>Markers are used to separate different modes, and the final total length is , containing 2 separators.
[0100] 1.3) Convert the word sequence into a high-dimensional semantic vector, simultaneously encoding modality source and location information to provide input for attention calculation. First, use a pre-trained word vector model to encode each word... Mapped to semantic vectors: ;
[0101] in, The embedding dimension is set to 768. <sep>Using specialized embeddings from pre-trained models, The token is the k-th word in the input word sequence. For the embedding function of the pre-trained word vector model, For words The corresponding high-dimensional semantic vector (word embedding). Let be the d-dimensional real space where the word vectors reside.
[0102] Secondly, to distinguish the modal origins of words, three learnable modal vectors are defined: Therefore, each word The modal embedding is:
[0103] ;
[0104] To preserve the temporal information of words, sine and cosine positional encoding is used to ensure that the model can understand the "sequence":
[0105] ;
[0106] in, The even-numbered index position of the position encoding vector takes the value of a sine function. The odd-numbered index position of the position encoding vector is represented by the cosine function value. The absolute position of the word in the sequence (counting from 0 or 1); Dimension index for position encoding (0≤m) <d / 2)。
[0107] Finally, the final input vector for each word is the element-wise sum of the word embedding, modality embedding, and position embedding:
[0108] The total input matrix is .
[0109] 1.4) Through a multi-head self-attention mechanism, the semantic association strength between words (including cross-modal words) is calculated, and the input matrix is... This is mapped to a query (Q), key (K), and value (V) matrix through three linear layers:
[0110] in To make the parameters learnable, Q, K, and V are split into h heads (h=12), and each head is processed. Subspace of dimension , , ;
[0111] No. Attention weight calculation for size (quantification words) With words The similarities are as follows:
[0112] ;
[0113] in, To prevent gradient vanishing in scaling factors, Indicates the first In terms of size, words Word pair Attention weights; The dot product (inner product) of the query vector of word k and the key vector of word j is used to quantify the semantic similarity between the two words.
[0114] The result of stitching together all the heads is projected back through a linear layer. Dimensions:
[0115] ;
[0116] in, To obtain an intermediate feature matrix of dimension L×d, the weighted summation of h attention heads is concatenated along the feature dimension. This is for outputting the projection matrix.
[0117] The results are fed into a Transformer encoder consisting of a multi-head attention network and a feedforward network for residual linking and layer normalization.
[0118] ;
[0119] ;
[0120] in, , for The feature vector of a single word in the text; , These are the parameters of the first-layer linear network (broadening the feature dimension and increasing the model capacity). , These are the parameters of the second-layer linear network (restoring the feature dimensions); It is a word feature matrix after cross-modal semantic enhancement.
[0121] Extracting cross-modal association strength from multi-head attention weights: for each word Calculate the sum of its attention weights with those of "non-self modal words":
[0122] ;
[0123] Where h is the number of attention heads in the multi-head attention mechanism, with a value of 12.
[0124] Calculate feature confidence, correct cross-modal correlation strength, and utilize of Semantic quality of quantified words:
[0125] ;
[0126] in, yes The OK; The closer the value is to 1, the clearer the semantic features of the word; for The Line; compare the feature confidence with the original Weighted fusion yields the corrected cross-modal correlation strength:
[0127] ;
[0128] in, =0.7; Cross-modal attention weights for each word ; , where is the feature confidence level.
[0129] 1.5) The preset base confidence levels for each modality are as follows: Get each word The retention weights are a weighted sum of cross-modal attention and modal reliability:
[0130] ;
[0131] in, ; For words The reliability coefficient of the modality, and the retention weight of each word. ;
[0132] Keep The text filters out low-weight words and sorts them according to their temporal order in each modal text to generate a coherent and complete fused text.
[0133] ;
[0134] in, The total set of words representing all modalities In the process, words that meet the weighting criteria are selected. .
[0135] 1.6) Merging text Then, a deep learning-based word segmentation algorithm is used, employing a BiLSTM-CRF model (Bidirectional Long Short-Term Memory Network combined with Conditional Random Field model) to capture contextual semantics. CRF (Conditional Random Field) is used to handle sequence labeling, further optimizing the output of BiLSTM (Bidirectional Long Short-Term Memory Network) by learning the dependencies between elements in the sequence, thereby improving the accuracy of word segmentation. Lexical boundaries are learned through training data, and the conditional probability formula is:
[0136] ;
[0137] in, For feature functions (such as part-of-speech tagging, context). For model training parameters, This represents the text length (total number of characters). For the first input text One character, The segmentation tag for the t-th character (e.g., B - word beginning, I - word middle to word end, O - other word types).
[0138] Through the above model and training process, the input text is segmented into words or phrases, ultimately yielding a set of segmented words:
[0139] ;
[0140] in, Indicates the first Each word, this set of words, provides the basic semantic units for subsequent processing;
[0141] 1.7) Select the game-related stop word list, remove stop words from the word segmentation results, and the filtered text will be the word segmentation results excluding stop words:
[0142] ;
[0143] The word set is obtained:
[0144] .
[0145] 1.8) Extract keywords related to cultural elements from game videos. First, encode the text, then use the BERT model to convert the filtered input text into hidden layer embeddings:
[0146] ;
[0147] in, ∈ The word-level embedding matrix is denoted by n (number of words, d is the hidden layer dimension), [CLS] is the classification tag (located at the beginning of the text, used to aggregate global semantics), and [SEP] is the separator tag (used to distinguish different sentences, such as in question-and-answer scenarios).
[0148] By calculating the cosine similarity between word vectors and document vectors, and using word importance scores to define the cosine similarity between word vectors and document vectors, cultural element keywords are selected. The formula is as follows:
[0149] ;
[0150] in, For words Embedded vector, The document vector represents the full-text semantics;
[0151] The words are sorted according to these scores, and the words with higher scores are selected as cultural element keywords to form a set of cultural element keywords. .
[0152] Figure 4 The diagram illustrates the module flowchart of the present invention, which uses FFmpeg to segment and analyze popular game video data and annotated sample dataset based on cultural element analysis.
[0153] 2. For example Figure 2 As shown, we selectively crawled videos of popular international games and analyzed them frame by frame; specifically, as follows... Figure 4 As shown, FFmpeg frame segmentation technology is used to decompose game videos into independent frame images, constructing a game video frame image data dictionary containing frame images and corresponding temporal information; simultaneously, a game video cultural element case dataset is constructed to provide dedicated data support for retrieval enhancement technology and strengthen the data foundation for accurate positioning; specifically:
[0154] 2.1) Targeted crawling of game videos; using web crawling technology to crawl popular game videos with high viewership from major international platforms, ensuring format compatibility, accurate positioning and processing of subsequent cultural elements, and building a game video data dictionary;
[0155] The video data dictionary is represented according to different games as follows: ,in, This indicates that the first video segment is being scraped. Class 1 One video, here This indicates the total number of video categories.
[0156] 2.2) Game video frame segmentation processing; using FFmpeg frame segmentation technology, parameters such as frame rate and resolution are set to adapt to the needs of precise positioning, ensuring that the quality of the frame images meets the requirements for cultural element feature extraction. The game video is split into frame images at a frequency of 2 seconds / frame, and a frame image data dictionary is constructed. ,in, Represented as the corresponding number The first video Zhang Fenfen Image.
[0157] 2.3) Construct a fine-grained, multilingual, cross-modal dataset of game cultural elements to support the training of a precise game cultural element recognition model based on Pangea-7B (Pangu-7B), and realize three-level granularity recognition and semantic interpretation of cultural elements from various countries in games.
[0158] Data was collected in layers based on game region (East Asia, Europe and America, Southeast Asia, Middle East, etc.), game type (open world, multiplayer competitive games, role-playing games, folk puzzle games, etc.), and cultural carriers (visual elements: clothing / buildings / props; auditory elements: soundtrack / dialogue; behavioral elements: folk rituals / social etiquette). The data covers games with diverse cultural representations, such as Genshin Impact, World of Warcraft, The Legend of Zelda, and Black Myth: Wukong. At least 500 labeled samples were collected for each cultural category. The multilingual video text alignment capability of Pangea-7B was used to transcribe and culturally annotate the multilingual subtitles and character dialogues of overseas games, solving the difficulty of interpreting cultural elements in games with less common languages.
[0159] 2.4) Construct a three-level annotation system: Level 1 is cultural sphere (e.g., Chinese culture, Japanese Yamato culture, Nordic Viking culture); Level 2 is cultural category (e.g., clothing, architecture, mythological symbols, folk activities); Level 3 is fine-grained elements (e.g., Chinese culture - clothing - Hanfu - Ruqun - Qixiong Ruqun, Nordic culture - mythological symbols - Thor's hammer - Mjolnir pattern details); Utilize the multimodal command fine-tuning capability of Pangea-7B to train the annotation personnel auxiliary model to perform semantic tracing of ambiguous cultural elements (e.g., props with integrated design) to ensure the accuracy of annotation;
[0160] The LabelStudio open-source multimodal data annotation platform was used to annotate video frames, text semantics, and audio waveforms. First, Pangea-7B was used to screen the video materials and mark suspected cultural elements in the areas / text segments. Then, manual annotation was carried out according to a three-level system. Finally, the annotation was reviewed and corrected to form the final annotation sample.
[0161] Figure 5 The flowchart of the module for obtaining neural network mapping similarity scoring sequences through a large BLIP model and using retrieval enhancement technology, provided by an embodiment of the present invention, is shown.
[0162] 3. For example Figure 2 As shown, a pre-trained multimodal large model is used to extract textual and visual deep features tailored to the characteristics of game videos; specifically as follows... Figure 5 As shown, the core keyword set of cultural elements extracted in step 1 is input into a text encoder based on BERT improvement. A dynamic attention mechanism is used to capture the semantic association of cultural element keywords, generating a high-dimensional text feature vector. The game video frame images are input into an image encoder based on VIT improvement. Regional feature encoding is used to accurately capture the visual features of local cultural elements in the frame images, generating a high-dimensional visual feature vector. Within the feature space, retrieval enhancement techniques are integrated with contrastive learning and generative alignment loss strategies to calculate the neural network mapping similarity between the visual feature vector of each game video frame image and the text feature vector of cultural elements. This constructs a similarity scoring sequence reflecting the degree of semantic association between each frame and the cultural elements, providing a feature matching basis for accurate localization. Specifically:
[0163] 3.1) Adjust the resolution of each frame image to match the input size of the BLIP (Bootstrapping Language-Image Pre-training) model to eliminate the impact of size differences on feature extraction. Then, convert the images from RGBA format to RGB format to fit the model input requirements. Finally, standardize the pixel values (mean = 0, standard deviation = 1) using the following formula:
[0164] ;
[0165] in, This ensures that the data distribution conforms to the statistical characteristics during model training.
[0166] 3.2) Using a multimodal large model to analyze the frame diagram Multimodal feature extraction is performed on the input text. Multimodal feature extraction is the basis for linking video frame images with input text, aiming to map images and text to the same semantic space.
[0167] Using a multimodal large model BLIP based on an improved BERT text encoder, the core keyword set of game video cultural elements obtained from the preprocessed input text is processed. Encode into text feature vectors;
[0168] Specifically, the set of keywords in the preprocessed input text is... Word The text feature vector is: ;
[0169] ;
[0170] in, It is the dimension of the feature vector ( 768), where L is the total length of the input keyword sequence (total number of tokens). This represents a BLIP text encoder that maps text. 3D vector space, For the i-th word The corresponding text feature vector;
[0171] Perform text feature vector analysis Normalization facilitates subsequent neural network mapping similarity calculations.
[0172] ;
[0173] 3.3) Using a multimodal large model BLIP-based VIT-improved image encoder, all frame images... Encoded into the corresponding image feature vector, the first one in the video The feature vector of the frame is:
[0174] ;
[0175] ;
[0176] in, This represents the BLIP image encoder, which maps images to a semantic space with the same dimension as the text feature vectors;
[0177] Similarly, L2 normalization is applied to the image features, so the actual output is:
[0178] ;
[0179] That is, the image feature vector is standardized to a unit length, which facilitates the similarity calculation for accurate positioning of cultural elements.
[0180] 3.4) Introduce retrieval enhancement technology. Based on the game video cultural element case dataset constructed in step 2, obtain the retrieval matching score through the process of "feature retrieval - two-dimensional similarity calculation - score normalization", which provides auxiliary basis for the accurate positioning of game video cultural elements.
[0181] 3.4.1) Categorize samples by "game name + cultural element type" and bind their visual and semantic features (e.g., "Assassin's Creed - Arabic patterns" "Black Myth: Wukong - Chinese ancient architecture") to improve retrieval efficiency and meet the rapid matching needs of game video scenarios.
[0182] 3.4.2) Retrieve all sample frames under the game type to which the current frame map belongs from the index library, and calculate the visual features to be matched. Visual features of the sample Cosine similarity measures the degree of similarity in visual details (shape, color, texture) of an image:
[0183] ;
[0184] Similarly, calculate the semantic features to be matched. semantic features of sample labeled text Cosine similarity is used to verify the consistency of cultural attributes and avoid positioning biases caused by visual similarity but different cultural types.
[0185] ;
[0186] 3.4.3) Considering the "visual recognizability priority" characteristic when retrieving cultural elements from game videos in the dataset, the visual similarity weight is set as follows: Semantic similarity weight The initial matching score is obtained:
[0187] ;
[0188] Renormalized mapping:
[0189] ;
[0190] 3.5) A fusion strategy of "basic splicing + gating mechanism" is adopted to ensure that the semantic information of text and visual features is fully combined. First, the projected text features are combined with the first... Frame visual features are concatenated dimensionally to obtain fused features. :
[0191] ;
[0192] Where [;] represents the dimension concatenation operation. (forward Dimensions are text features, then (where 'dimensional' represents visual features) and 't' represents textual features. For the first Frame visual features;
[0193] Then, gating is used to enhance fusion, allowing the model to automatically adjust the contribution of text / visual features. First, the gating vector is calculated:
[0194] ;
[0195] Among them, the gate weight matrix Gating Paranoia , It is the Sigmoid activation function. ;
[0196] Then use the gate vector Basic fusion features Weighting is performed to obtain gating fusion features. ,in This indicates element-wise multiplication.
[0197] 3.6) Construct a small fully connected neural network (MLP, Multilayer Perceptron) to map the fused features into a one-dimensional correlation score. Assume the MLP network structure has a fixed 3 layers, and the weight matrices and bias terms for each layer are as follows: Layer 1 (input layer to hidden layer 1): ; Layer 2 (Hidden Layer 1 to Hidden Layer 2): Layer 3 (Hidden Layer 2 to Output Layer): ;
[0198] With gated fusion features For input, MLP forward propagation:
[0199] ;
[0200] ;
[0201] ;
[0202] in, This is the output of the first hidden layer. This is the output of the second hidden layer. For the first The final association score of the frame.
[0203] 3.7) Deep alignment of text-image features is achieved through "contrastive learning loss + generative alignment loss + similarity supervision loss + retrieval matching loss" to ensure semantic consistency.
[0204] Contrastive learning loss is:
[0205] ;
[0206] in, The temperature coefficient is learned through comparison; T is the total number of image samples, and t is the text feature vector. For the first Frame visual features The feature vectors of all images;
[0207] The generative alignment loss is:
[0208] ;
[0209] ;
[0210] in, Based on text features Predicted image feature values generated based on conditions;
[0211] The Mean Squared Error (MSE) loss is used, along with the output score of the Multilayer Perceptron (MLP). Manual labeling Connection:
[0212] ;
[0213] The retrieval matching loss also uses the MSE loss:
[0214] ;
[0215] Three types of losses were weighted and fused. In backpropagation minimization This updates the relevant parameters and improves the feature matching accuracy for precise positioning.
[0216] Figure 6 The flowchart of the module provided by the embodiment of the present invention, which uses an exponentially weighted moving average filtering algorithm to smooth the scoring sequence and filter the candidate frame set for cultural elements, is shown.
[0217] 4. For example Figure 2 As shown, the similarity scoring sequences generated in step 3 undergo targeted post-processing; specifically as follows: Figure 6 As shown, an exponentially weighted moving average filtering method is used to suppress noise-induced similarity score abrupt changes caused by camera shake, special effects flashes, and sudden lighting changes unique to game videos, resulting in a smooth similarity score sequence. Combining the characteristics of game video content, the need for cultural element positioning, and the enhanced matching results of the retrieval, a similarity threshold for precise positioning of dynamic or static cultural elements is set, preferably determined based on statistics of the overall score distribution. Based on this threshold, frames with similarity scores higher than the threshold are selected to form a candidate frame set containing the target cultural element, thus completing the precise positioning and extraction of keyframes for cultural elements in the game video.
[0218] After obtaining the similarity scores between each video frame and the text description in the three steps, abrupt changes in the similarity scores are filtered to improve the accuracy and stability of the localization results. At the same time, frames that may contain target cultural elements are selected by setting a reasonable similarity score threshold.
[0219] 4.1) Similarity scores may change abruptly due to instantaneous noise in the video (such as camera shake or lighting changes), leading to misjudgment. It is necessary to smooth the score sequence through filtering algorithms.
[0220] The exponentially weighted moving average filtering algorithm smooths the sequence by weighting historical data and giving more weight to recent data, thereby effectively eliminating abrupt changes.
[0221] Let the original similarity scoring sequence be... After EWMA (Exponentially Weighted Moving Average) filtering, the sequence is: The EWMA filter calculation formula is as follows:
[0222] ;
[0223] ;
[0224] in, It is a smoothing factor, with a value range of (0,1); due to the smoothing factor The choice of [the appropriate filter] has a significant impact on the filtering effect; therefore, it can be determined based on the original similarity scoring sequence. Value; when the score changes drastically, choose the larger value. Use values (e.g., 0.3-0.5) to quickly track changes in data; when the score changes relatively smoothly, a smaller value can be selected. Values (such as 0.1-0.3) can be used to obtain smoother filtering results.
[0225] 4.2) International game videos often involve cultural elements from multiple countries (such as clothing, architecture, and symbols). Different cultures may interpret the same visual element semantically differently. For example, red symbolizes good fortune in Chinese culture, but may be associated with danger in Western culture. By setting a high threshold, we can ensure a strict match between visual features and text descriptions, avoiding positioning errors caused by cultural semantic misalignment. This meets the accuracy requirements in cross-cultural and cross-modal scenarios. Therefore, we choose to manually define the similarity scoring threshold between the visual feature vector of the video content frame-by-frame image and the feature vector of the target text. In this embodiment, we select... .
[0226] Search Match Score When the search match score is <0.3, the threshold is lowered to 0.75 to balance precision and recall; when the search match score is <0.3, the threshold is raised to 0.85 to enhance the purity of precise positioning.
[0227] The condition for retaining frames is: The selection process identifies a set of candidate frames that precisely pinpoint cultural elements.
[0228] Figure 7 The flowchart illustrates the module for accurately locating cultural element targets using a pixel-level contour segmentation scheme based on a candidate frame set for precise positioning of cultural elements, as provided in an embodiment of the present invention.
[0229] 5. For example Figure 2 As shown, multi-level feature extraction is performed on the precisely located keyframes to generate high-dimensional feature maps, and the feature weights related to cultural elements are automatically learned and amplified; specifically as follows: Figure 7 As shown, by combining the retrieval reference information from the game video cultural element case dataset, the target region of cultural elements is accurately located in the complex background of game videos; pixel-level coordinate information is generated based on the region with the highest attention weight, driving the graphics processing module to accurately visualize and highlight the cultural element targets in the game videos using precise semantic segmentation; specifically:
[0230] 5.1) Perform multi-level feature extraction on the framed images, using a convolutional neural network as the backbone network. The input image is the framed image filtered in step 4. Each frame image is [size missing] ,in It is the image height. This is the image width, and 3 represents the three RGB channels. First, a linear convolution is applied, as shown in the formula below:
[0231]
[0232] in, For the first Feature maps output by the layer For the first The input feature map of the layer (for the first layer, That is, the original input image). For the first The convolutional kernel weights of the layer have a size of , The height of the convolution kernel. The width of the convolution kernel. Input the number of channels. Number of output channels For the first Layer bias terms, This is the stride of the convolution operation. For fill size; To output the spatial coordinates of the feature map, This refers to the channel index of the output feature map. The channel index of the input feature map; This is the spatial offset index within the convolution kernel;
[0233] Then, using a nonlinear activation function, the calculation formula is:
[0234] ;
[0235] in, It is a feature map after being processed by an activation function;
[0236] Finally, perform the pooling operation using max pooling, as shown in the following formula:
[0237] ;
[0238] in, This is the feature map after pooling. For pooled window height, To pool the window width, This is the step size for the pooling operation; This is the spatial offset index within the pooling window.
[0239] Finally, the output feature map is , For height, For width, , Number of output channels;
[0240] 5.2) Calculation of attention weights: First, generate the Query, Key, and Value. The calculation formula is as follows:
[0241] ;
[0242] in, To make feature maps The reshaped matrix, , It represents the total number of spatial locations; For a learnable projection matrix, ; These are respectively a query matrix, a key matrix, and a value matrix. ; The projected dimension is usually (here) (For the number of attention heads).
[0243] Then calculate the attention score by querying the matrix. Bond matrix The attention score matrix is calculated using the following formula:
[0244] ;
[0245] Then scale:
[0246] ;
[0247] in, It is an attention score matrix, each element Indicates the first The position is the first Attention scores for each position;
[0248] Then, after Softmax normalization, the calculation formula is as follows:
[0249] ;
[0250] ;
[0251] ;
[0252] in, This is the attention weight matrix. The projected dimension. To generate the location The new representation of position The level of attention;
[0253] Finally, weighted summation is performed to generate enhanced features. The Value is weighted and summed using attention weights to obtain the self-attention features, as shown in the following formula:
[0254] ;
[0255] It is the feature matrix after self-attention. To be Reconstructed into an enhanced feature vector map with spatial dimensions. ;
[0256] 5.3) A pixel-level contour segmentation scheme is adopted, which can more accurately locate text targets. First, segmentation mask prediction is performed, and the calculation formula is as follows:
[0257] ;
[0258] ;
[0259] in, The raw output of the segmentation mask (unnormalized log odds); For decoder networks, upsampling, skip links, and convolutional operations are typically included; To define the height and width of the segmentation mask;
[0260] Then, Sigmoid activation is performed, using the following formula:
[0261] ;
[0262] ;
[0263] in, It is the probability map after activation. For position The probability that a pixel belongs to a text target; For the Sigmoid function, ;
[0264] Next, binarize the segmentation mask using the following formula:
[0265] ;
[0266] ;
[0267] in, For binary segmentation mask, Indicates position Pixel labels (1 indicates text target, 0 indicates background); The binarization threshold is set to 0.5;
[0268] 5.4) Finally, the text targets in the filtered frame images are visually marked and highlighted using binary segmentation masks. Extract contour point set ,in These are the coordinates of a point on the contour.
[0269] The coordinate mapping formula is as follows:
[0270] ;
[0271] in, These are the coordinates of the contour points in the original image coordinate system; These are the coordinates of the contour points in the segmentation mask coordinate system; The height and width of the segmentation mask.
[0272] This resulted in precise visual outline markings of the target cultural elements in each frame.
[0273] Figure 8 The diagram illustrates an application of the cross-modal localization method for cultural elements in international videos provided by an embodiment of the present invention.
[0274] 6. For example Figure 2 As shown, based on the positioning results obtained in step 5, all marked frame images are reassembled according to the time sequence and frame rate of the original video. Video coding technology is used to generate a continuous target video containing complete cultural element markers. This target video retains the original audiovisual synchronization characteristics and can be output as a video file of a specified format and resolution as needed. This completes the entire process from precise positioning of cultural elements in the game video to visual presentation. The final application can be as follows: Figure 8 As shown; specifically:
[0275] 6.1) The marked frame diagram output in step 5 The data is sorted and reassembled according to its absolute chronological order in the original video; this process strictly follows the frame rate parameter (Frames Per Second, or FPS) extracted from the original video metadata, denoted as... This parameter ensures that the output video has completely consistent temporal characteristics and playback duration with the original video; frame sequence reconstruction is achieved by constructing a timestamp sequence, where the presentation timestamp corresponding to the j-th frame image is... Determined by the following calculation formula:
[0276] ;
[0277] Where, is the frame number j (counting from 0);
[0278] 6.2) The recombined frame sequence is compressed by a video encoder; preferably, industry-standard video encoding protocols are used, including but not limited to HEVC / H.265 encoding standards, which can achieve high compression efficiency while ensuring accurate and clear marking of cultural elements.
[0279] The encoding process calls the libx265 encoder in the FFmpeg multimedia processing framework. Key technical indicators in the encoding parameter configuration include:
[0280] (1) Bitrate control: A variable bitrate (VBR) control strategy is adopted, and the target bitrate R_target is dynamically set according to the output video resolution;
[0281] (2) Quality Assessment: The fidelity of the encoded video is objectively evaluated using the Peak Signal-to-Noise Ratio (PSNR), which is calculated using the following formula:
[0282] ;
[0283] ;
[0284] Where MAX represents the maximum possible value of the image pixels (255 for an 8-bit depth image), and MSE is the mean square error between the original frame and the encoded reconstructed frame. For the width and height of the image; Represents the coordinates of pixels in an image. , ; Indicates the original frame in coordinates Pixel brightness value at that location Indicates the reconstructed frame at the corresponding coordinates The pixel brightness value at that location;
[0285] 6.3) The encoded video stream will be encapsulated into a specified multimedia container format to form the final video file. Supported standard container formats include MP4, AVI and MOV.
[0286] During the encapsulation process, the audio stream of the original video and the newly generated tagged video stream are precisely reused through timestamp synchronization technology to ensure that the audio-visual synchronization of the output video is completely consistent.
[0287] Through the above technical process, the entire process from locating cultural elements to generating high-quality visual videos is completed. The output target video can be directly used in scenarios such as cultural display, educational dissemination, or commercial applications.
[0288] In this invention, popular international game video content is acquired through a video input interface, and FFmpeg frame segmentation technology is used to decompose the video into continuous frame images at fixed time intervals of 2 seconds per frame, constructing a frame image data dictionary. A multimodal large model is used to process the text input and video frames separately. The preprocessed text description is input into a text encoder to generate a high-dimensional text feature vector, while the frame images are input into an image encoder to generate visual feature vectors. A frame-level semantic association scoring sequence is constructed by calculating the similarity of neural network mappings. The similarity sequence is post-processed and optimized, using an exponentially weighted moving average filtering method to eliminate scoring abrupt changes. A dynamic threshold is set based on the statistical characteristics of the overall scoring distribution to filter out a set of key frames containing the target cultural elements. The located key frames are then subjected to multi-layer processing. The secondary feature extraction automatically learns and amplifies the feature weights related to text through an attention mechanism, accurately locating target regions in complex game scenes. Pixel-level coordinate information is generated based on the attention weights, driving the semantic segmentation module to accurately segment and visually label cultural elements. All segmented and labeled frames are reassembled according to the original video temporal sequence and frame rate, and a target video retaining the original audio-visual synchronization characteristics is generated using H.265 / HEVC video encoding technology, outputting the final result in a specified format and resolution. This achieves an organic combination of cross-modal semantic alignment and dynamic threshold filtering. Through attention-driven pixel-level positioning technology, it solves the problem of accurate identification and segmentation of multi-national cultural elements in complex game scenes, providing a complete technical solution for the automated detection and visualization of cultural elements.
[0289] According to embodiments of the present invention, the following technical effects are achieved:
[0290] (1) By integrating multimodal large models with advanced video framing technology, a complete cross-modal semantic alignment pipeline was constructed. Stable detection was achieved in complex video scenes through a dynamic threshold screening mechanism, realizing the systematic identification and positioning of multicultural elements in international videos, and providing a reliable foundation for cross-cultural research.
[0291] (2) In view of the dynamic and ever-changing characteristics of international video scenes, the exponential weighted moving average filtering algorithm is used to process the similarity sequence, which effectively eliminates the false detection interference caused by scene switching, special effects flashing, etc. Combined with the dynamic threshold setting based on statistical distribution, the accuracy and recall of cultural element positioning are significantly improved, and a precise positioning system adapted to the characteristics of international videos is established.
[0292] (3) By introducing attention mechanism and deep convolutional neural network, the visual features related to cultural elements are automatically learned and amplified, and the target area is accurately locked in the complex game background. The semantic segmentation technology based on DeepLabv3+ is used to realize the contour-level accurate marking of cultural elements and the pixel-level accurate positioning and segmentation of cultural elements, which provides a solid foundation for subsequent quantitative analysis.
[0293] (4) By combining time-series frame reassembly technology with modern video coding standards (H.265 / HEVC), a seamless connection from cultural element recognition to marked video generation is achieved, supporting multiple output formats and resolution configurations, ensuring the practicality and adaptability of the technical solution.
[0294] (5) Through end-to-end automated processing, video content that originally required manual screening is transformed into standardized work that can be processed in batches, which greatly improves the efficiency of cultural element positioning and analysis and provides technical support for large-scale cross-cultural research.
[0295] (6) It is applicable to the positioning of cultural elements in various types of international videos, with good scalability and universality, and provides a unified technical framework and methodological support for the analysis of cultural elements in multiple fields and scenarios.
[0296] (7) By organically integrating advanced technologies such as multimodal understanding, temporal analysis, attention mechanism and semantic segmentation, a complete and efficient video cultural element analysis system was constructed, filling the technological gap in the field of cross-modal cultural element precise positioning.
[0297] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that the present invention is not limited to the described order of actions, because according to the present invention, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily essential to the present invention.< / sep> < / sep> < / text> < / speech>
Claims
1. A cross-modal localization method for cultural elements in international videos, characterized in that, The method includes: Receive video-related multimodal input data and preprocess the data to obtain a set of core keywords; The core keyword set is input into a pre-trained multimodal large model. The multimodal large model is used to extract textual and visual deep features from the current video, generate textual feature vectors and visual feature vectors of cultural elements, and calculate the neural network mapping similarity between the visual feature vector and the textual feature vector of cultural elements for each frame, generating a similarity scoring sequence. An exponentially weighted moving average filtering method is used to perform targeted post-processing on the similarity scoring sequence to obtain a smooth similarity scoring sequence; Based on the smooth similarity scoring sequence, frame images with a smooth similarity higher than a preset threshold are selected to form a set of keyframes containing the target cultural elements, thus completing the cross-modal localization of video cultural elements.
2. The method according to claim 1, characterized in that, The data is preprocessed to obtain a set of core keywords, including: Multimodal input data is converted into text to obtain three types of initial text: text, speech and image. Based on the attention mechanism and combined with modal reliability rules, fused text without redundancy and with consistent semantics is generated. The fused text is segmented into words, and a stop word list is used to filter out meaningless words to obtain a set of effective words. The effective word set is encoded, and the cosine similarity between each word vector and the document vector is calculated. High-scoring words are selected based on the cosine similarity and formed into a set, resulting in the core keyword set.
3. The method according to claim 1, characterized in that, The process involves inputting the core keyword set into a pre-trained multimodal large model, using the multimodal large model to extract textual and visual deep features from the current video, generating textual and visual feature vectors of cultural elements, and calculating the neural network mapping similarity between the visual feature vector and the textual feature vector of cultural elements for each frame to generate a similarity scoring sequence, including: The multimodal large model includes a text encoder based on BERT and an image encoder based on VIT. The core keyword set is input into a text encoder based on BERT, and the semantic associations of cultural element keywords are captured through a dynamic attention mechanism to generate a high-dimensional text feature vector. The video frame images are input into an image encoder based on VIT, and the visual features of local cultural elements in the frame images are accurately captured through regional feature encoding to generate high-dimensional visual feature vectors. Within the feature space, based on the retrieval enhancement method and employing a contrastive learning strategy and a generative alignment loss strategy, the similarity between the visual feature vector of each video frame and the text feature vector of cultural elements is calculated via a neural network mapping, generating a similarity scoring sequence that reflects the degree of semantic association between each frame and the cultural elements.
4. The method according to claim 1, characterized in that, The step of selecting frame images with a smooth similarity score higher than a preset threshold based on the smooth similarity scoring sequence to form a set of keyframes containing the target cultural elements, thereby completing the cross-modal localization of video cultural elements, includes: A preset similarity threshold for dynamic or static cultural elements is established. Based on the preset threshold, frames with a smooth similarity higher than the preset threshold are selected to form a set of keyframes containing the target cultural elements, thus completing the cross-modal localization of video cultural elements.
5. The method according to claim 1, characterized in that, The method further includes: We acquire popular international videos and analyze them frame by frame to construct a dataset of video cultural elements.
6. The method according to claim 5, characterized in that, The process of acquiring popular international videos, analyzing them frame by frame, and constructing a dataset of video cultural elements includes: We acquire popular international videos, use FFmpeg frame-segmentation technology to break the videos down into independent frame images, and annotate the cultural elements in each frame. Store the cultural elements and corresponding temporal information for each frame to generate a video cultural element case dataset.
7. The method according to claim 1, characterized in that, The method further includes: Visualize and highlight the target cultural elements in each keyframe; The marked keyframes are reassembled according to the time sequence and frame rate of the original video, and video coding technology is used to generate a continuous target video containing complete cultural element markings. It preserves the original audiovisual synchronization characteristics of the target video and outputs it as a video file with a specified format and resolution as needed, completing the entire process from locating video cultural elements to visual presentation.
8. The method according to claim 7, characterized in that, The visualization and highlighting of target cultural elements in each keyframe includes: Multi-level feature extraction is performed on keyframes to generate high-dimensional feature maps and amplify the feature weights related to cultural elements. Retrieve reference information from the video cultural element case dataset, and locate the target area of cultural elements in the complex background of the video based on the feature weights and the reference information; Calculate the attention weight of the target region and filter out the region with the highest attention weight; The contour segmentation method is used to predict the segmentation mask for the region with the highest attention weight in each keyframe, and then the segmentation mask is binarized. Contour point set is extracted from the binarized segmentation mask, and the contour point set contains the contour point coordinates. Coordinate mapping is performed based on the contour point coordinates to obtain the visual contour markers of the target cultural elements in each keyframe, and then they are highlighted.
9. The method according to claim 7, characterized in that, The process of reassembling the marked keyframes according to the temporal sequence and frame rate of the original video, and using video coding technology to generate a continuous target video containing complete cultural element markers, includes: For the marked keyframes, the frame rate parameters are extracted from the original video metadata; the frame rate parameters include temporal characteristics and playback duration. Construct a timestamp sequence and reassemble the frame sequence according to the frame rate parameter; By using video coding technology, the recombined frame sequence is compressed to generate a continuous target video containing complete cultural element markers.
10. The method according to claim 9, characterized in that, The timestamp sequence is determined by the following formula: ; in, For timestamps; The frame number; This is the frame rate parameter.