Adaptive video key frame screening method and system based on wavelet decoupling
By extracting video semantic boundaries and adaptively allocating frame numbers using wavelet decoupling technology, the problems of accuracy and resource waste in keyframe selection in long video understanding are solved, thereby improving the video understanding performance and computational efficiency of large-scale vision-language models.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XIAMEN UNIV
- Filing Date
- 2026-02-06
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies suffer from semantic structure fragmentation, sensitivity to relevance scoring noise, lack of adaptive frame budget scheduling and diversity-driven approaches in long video understanding, leading to inaccurate keyframe selection and resource waste, which affects the video understanding performance of large visual-language models (LVLMs).
An adaptive video keyframe selection method based on wavelet decoupling is adopted. Robust semantic boundaries are extracted through discrete wavelet transform, a one-dimensional temporal signal of video text relevance is constructed, which is decomposed into temporally continuous semantic segments, and the number of frames is adaptively allocated based on the segment importance function. Keyframes are selected by combining diversity-driven strategies.
It improves the accuracy and stability of keyframe selection, preserves the video narrative structure, reduces noise interference, optimizes frame resource allocation, and enhances the long video understanding performance and computational efficiency of LVLM.
Smart Images

Figure CN122244747A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer vision and multimedia information processing technology, and specifically refers to an adaptive video keyframe filtering method and system based on wavelet decoupling for long video understanding. It is particularly suitable for efficient compression representation and keyframe filtering of long videos ranging from minutes to hours under the condition that there is a context window limitation in the large visual-language model (LVLM). Background Technology
[0002] With the rapid development of large-scale visual-language models (LVLMs), the overall capabilities of multimodal large models in tasks such as video understanding, video question answering, retrieval, and summarization are constantly improving. However, most LVLMs are designed with fixed-length context windows, which limit the number of image frames they can receive. At the same time, video decoding and feature extraction themselves also bring high computational and memory overhead.
[0003] Against this backdrop, "long video compression representation" and "keyframe selection" have gradually become crucial bridges connecting real-world long videos and LVLM. Existing keyframe selection technologies can be broadly categorized into two types: one is training-based keyframe selection methods, which learn "frame selection strategies" or "information scoring models" by constructing specialized neural networks or reinforcement learning models, generally requiring a large amount of labeled data or a complex training process; the other is training-free keyframe selection methods, which do not require additional training of new networks, but directly utilize the output of pre-trained visual / multimodal models for selection, such as video text relevance scores and attention weights.
[0004] Although existing technologies have solved the frame rate reduction problem to some extent, the following significant drawbacks still exist: (1) Fragmented semantic structure and disjointed narrative: Many training-free methods based on relevance scores tend to retain only a small subset of “isolated frames” with the highest scores. These frames are often discretely distributed and fragmented on the timeline, which disrupts the overall narrative structure of the video. In tasks that require understanding the causal relationships of events, operational procedures, and changes in character behavior, key intermediate transition frames are often missed, making it difficult for LVLM to see only a few “static slices” and to reconstruct the complete semantic chain.
[0005] (2) Sensitive to noise in relevance scoring and unstable boundary recognition: The relevance scoring signal of a video usually contains a large amount of high-frequency noise (originating from the uncertainty of the scoring model's prediction or small changes in lighting and viewpoint in the video). Existing methods directly perform threshold judgment or peak detection on the original signal, which is easily interfered with by noise, leading to incorrect boundary detection and misselection of keyframes. The video text relevance score sequence usually has obvious non-stationary characteristics and contains a large amount of high-frequency noise. This noise comes from the uncertainty of the scoring model's prediction itself and slight visual disturbances such as viewpoint jitter, lighting changes, and background motion. Existing technologies often directly perform threshold judgment or simple peak detection of first-order difference classification on the original score sequence, which makes semantic boundaries (such as real scene transitions or topic change points) easily submerged or falsely triggered by noise, resulting in boundary positioning offset, too many or too few.
[0006] (3) Lack of adaptive frame budget scheduling based on content complexity: Traditional uniform sampling or fixed-step sampling schemes cannot differentiate based on the content complexity, information density, and relevance to the query of different time segments. For segments with extremely dense information, a fixed sampling density is often insufficient to capture all key transitions and subtle changes; for segments that are static for a long time or have unimportant background, valuable frame budgets are wasted. Even if some methods introduce simple weighting or threshold adjustment, they still lack an importance assessment mechanism that comprehensively models the importance from four dimensions: segment duration + average relevance + peak relevance + richness of internal variation.
[0007] (4) Lack of a systematic “diversity-driven” local frame selection strategy: Some sorting or Top-K based schemes tend to select a large number of visually similar frames within a local segment, resulting in high redundancy of the final key frame set and limited information increment, making it difficult to cover more complementary details with a limited number of frames.
[0008] The aforementioned problems are further amplified in the long video input scenario of LVLM: even if the number of selected keyframe sets meets the context window limit, if the semantic structure is destroyed, important segments are not sampled enough, or there is serious inter-frame redundancy, the overall reasoning and understanding performance of the model will still decrease significantly. Summary of the Invention
[0009] The main objective of this invention is to provide an adaptive video keyframe filtering method and system based on wavelet decoupling, which solves the problems existing in the prior art. The entire process does not require additional training of new neural networks and can be used for general keyframe filtering in the front end of LVLM.
[0010] To achieve the above objectives, one solution of the present invention is: An adaptive video keyframe selection method based on wavelet decoupling includes: Step 1: Construct a one-dimensional temporal signal of video-text correlation; Step 2: Use discrete wavelet transform to decompose the time-series signal into multiple layers, retain only the coarsest-scale detail coefficients and reconstruct them using inverse transform, and extract robust semantic boundaries from them; based on this, divide the video into a series of time-continuous and semantically coherent segments. Step 3: Construct a segment importance function based on segment duration, average relevance, peak relevance, and score variance, and adaptively allocate the global frame budget to each segment using sofamax; Step 4: Within each segment, a maximum marginal relevance diversity-driven strategy is used to select frames, resulting in a set of keyframes.
[0011] Step 1 specifically involves: First, analyze the input video. Uniform sparse sampling is performed to obtain a base frame sequence. Using a pre-trained vision-language model, the image-text matching score between each frame of the base frame sequence and the user query is calculated, forming a one-dimensional temporal correlation signal sequence that varies over time. ,in Indicates correlation signal, Indicates the length of the video sampling frame.
[0012] Preferably, step 2 specifically involves: First, the Daubechies-4 wavelet basis is used to analyze the correlation signal sequence. conduct The formula for the layered discrete wavelet transform is as follows: ; in, This represents the discrete wavelet transform operation; Indicates the approximation coefficient; Indicates the first Layer detail factor; Then, the detail coefficients at the coarsest scale are... All detail coefficients except those are set to zero, and the signal is reconstructed using inverse wavelet transform, as shown in the following formula: ; in, Represents the reconstructed correlation signal sequence; Indicates the inverse wavelet transform operation; Finally, for Obtain the semantic change intensity signal by taking the absolute value The peak detection algorithm is applied to identify its local maxima, and these local maxima correspond to the input video. The semantic transition moment constitutes the semantic boundary set. Then, based on semantic boundaries Input video Divided into A temporally continuous semantic segment: ; in, Represents semantic fragments; Indicates the number of semantic boundaries.
[0013] Preferably, step 3 specifically involves: First, calculate each semantic segment. Importance score The formula is as follows: ; in, Indicating time span factors, it represents the current semantic segment. Length and input video The ratio between the lengths; The weighting parameter represents the time span factor; The average relevance factor is the average image-text matching score within a segment; The weighting parameter representing the average correlation factor; Peak correlation factor represents the highest image-text matching score within a segment; The weighting parameter representing the peak correlation factor; Representing diversity factors, it is the ratio of the variance of scores within a segment to the variance of scores worldwide; These are the weighting parameters for diversity factors; Then, based on The function normalizes the importance scores of all semantic segments from the input video. Input frame budget Distribute them proportionally to each semantic segment to obtain each semantic segment. Frame sampling quota The formula is as follows: ; in, Represents an exponential function; Indicates semantic fragments in a set Index / traversal variables in; Represents the set of all semantic fragments; Represents a set The first in A semantic fragment.
[0014] Preferably, step 4 specifically involves: First, semantic fragments After sorting all frames according to their image-text relevance scores, the frame with the highest score is first added to the keyframe set for that semantic segment. As the anchor frame; for the remaining frames, iterative selection. Each keyframe is used to select the next frame at each step according to the maximum marginal correlation criterion, and the formula is as follows: ; in, The hidden keyframe is the one that will be selected next. This represents a balance coefficient that controls for correlation and diversity. Indicates the first Image-text relevance score of the frame; Indicates and Inequal indices; This indicates the calculation of cosine similarity; , The respective , The visual feature vector of a frame; Then, the keyframe sets of all semantic segments are aggregated to obtain the global keyframe set. Then, the global keyframe set The LVLM is input along with the user's query text, and downstream tasks are executed.
[0015] The second solution of the present invention is: An adaptive video keyframe selection system based on wavelet decoupling is characterized by comprising a temporal correlation signal construction module, a semantic boundary detection module, an adaptive sampling quota allocation module, and a diversity-aware frame selection module, and is configured to perform steps 1 to 4 of the adaptive video keyframe selection method based on wavelet decoupling.
[0016] After adopting the above technical solution, the present invention has the following technical effects: (1) This method exhibits extremely high stability when processing low-quality videos with drastic lighting changes or containing a large number of meaningless transition shots.
[0017] (2) Under a fixed frame count budget, the present invention can better preserve the narrative structure and key turning points of the video; in long video understanding benchmark tests such as VideoMME and MLVU, the present invention has higher accuracy than the representative baseline method under the same frame count budget; when dealing with long-term span reasoning and event causal analysis problems, it significantly alleviates the large model “forgetting” and misjudgment caused by long video input.
[0018] (3) In terms of computational efficiency and applicability, this invention adopts a training-free architecture, which is entirely based on mathematical transformations and statistical rules, without the need for expensive model training processes. The computational complexity of wavelet transform and allocation algorithms is extremely low (negligible compared to video decoding and large model inference), supporting real-time processing of hour-long videos on a single consumer-grade graphics card. The system is not sensitive to specific LVLM categories and can be "plug-and-play" adapted to various mainstream multimodal large models such as Qwen-VL and InternVL, significantly reducing the development and deployment threshold of long video analysis systems. Attached Figure Description
[0019] Figure 1 This is a flowchart of a specific embodiment of the present invention. Detailed Implementation
[0020] To further explain the technical solution of the present invention, the present invention will be described in detail below through specific embodiments.
[0021] The overall idea of this invention is to transform the keyframe selection problem into a multi-scale analysis and singularity detection problem of a one-dimensional non-stationary signal. Specifically: First, a one-dimensional temporal signal of video-text correlation is constructed. Second, the discrete wavelet transform (DWT) is used to decompose the temporal signal into multiple layers, retaining only the coarsest-scale detail coefficients and reconstructing them using inverse transform, thereby extracting robust semantic boundaries. Based on this, the video is divided into a series of temporally continuous and semantically coherent segments.
[0022] Then, a segment importance function is constructed based on segment duration, average relevance, peak relevance, and score variance, and the global frame budget is adaptively allocated to each segment using sofamax.
[0023] Finally, a maximum marginal relevance (MMR) diversity-driven strategy is used to select frames within each segment, resulting in a set of keyframes that retains the overall narrative structure while having high semantic coverage and diversity.
[0024] Based on this, refer to Figure 1 As shown, this invention discloses an adaptive video keyframe selection method based on wavelet decoupling, comprising: Step 1: This step acts as a signal source, responsible for converting the video frame sequence into an analyzable temporally correlated signal. Specifically, it first processes the input video... Uniform sparse sampling (e.g., 1 FPS) is performed to obtain a base frame sequence. Using a pre-trained vision-language model (such as the ITM head of BLIP-2), the image-text matching score between each frame of the base frame sequence and the user query is calculated, forming a one-dimensional temporal correlation signal sequence that varies over time. ,in Indicates correlation signal, Indicates the length of the video sampling frame.
[0025] In the system, this step is performed by the time-series correlation signal construction module, which lays the foundation for subsequent signal processing.
[0026] Step 2: This step, as the core of semantic analysis, is responsible for extracting clear semantic transition features from the heavily noisy raw signal. Since semantic transitions in video (such as scene switching and topic changes) manifest as smooth changes in the mid-to-low frequency range in the relevant signal, while noise (model uncertainty, visual jitter) manifests as high-frequency perturbations, this invention will extract clear semantic transition features from the relevant signal sequence. Treating the signal as a one-dimensional non-stationary signal, the multi-scale analysis capability of discrete wavelet transform is utilized to decouple different frequency components, thereby achieving high-frequency noise suppression and separation of large-scale semantic change features. Specifically, the Daubechies-4 (db4) wavelet basis is first used to correlate the signal sequence. conduct The formula for Discrete Wavelet Transform (DWT) is as follows: ; in, This represents the discrete wavelet transform operation; Indicates the approximation coefficient (low-frequency trend); Indicates the first Layer detail coefficients (variation characteristics at different scales); where The value (corresponding to the number of decomposition layers) is based on the input video. The length is set adaptively.
[0027] Then, the detail coefficients at the coarsest scale are... All detail coefficients except those corresponding to semantic transitions at larger time scales are set to zero, and the signal is then reconstructed using inverse wavelet transform (IDWT), as shown in the following formula: ; in, Represents the reconstructed correlation signal sequence; This represents the inverse wavelet transform operation. The reconstructed signal retains semantic-level variation features while naturally suppressing high-frequency noise and local random jitter.
[0028] Finally, for Obtain the semantic change intensity signal by taking the absolute value The peak detection algorithm is applied to identify its local maxima, and these local maxima correspond to the input video. The semantic transition moment constitutes the semantic boundary set. Then, based on semantic boundaries Input video Divided into A temporally continuous semantic segment: ; in, Represents semantic fragments; Indicates the number of semantic boundaries.
[0029] In the system, this step is performed by the semantic boundary detection module.
[0030] Step 3: This step acts as the resource scheduling center, responsible for dynamically allocating keyframe selection quotas based on the importance of each semantic segment. Specifically, it first calculates the quota for each semantic segment. Importance score The formula is as follows: ; in, Indicating time span factors, it represents the current semantic segment. Length and input video The ratio between the lengths of the segments indicates that the longer the segment duration, the more potential information it contains. The weighting parameter represents the time span factor; The average relevance factor is the average image-text matching score within a segment, used to measure the overall fit between the segment and the query. The weighting parameter representing the average correlation factor; Peak correlation factor represents the highest image-text matching score within a segment, reflecting the importance of the most critical frame in the segment. The weighting parameter representing the peak correlation factor; The diversity factor is represented by the ratio of the intra-segment score variance to the global score variance, reflecting the richness of the semantic content within the segment. It is the weighting parameter of the diversity factors.
[0031] Then, based on The function normalizes the importance scores of all semantic segments from the input video. Input frame budget Distribute them proportionally to each semantic segment to obtain each semantic segment. Frame sampling quota The formula is as follows: ; in, Represents an exponential function; Indicates semantic fragments in a set Index / traversal variables in; Represents the set of all semantic fragments; Represents a set The first in A semantic fragment.
[0032] In the system, this step is performed by the adaptive sampling quota allocation module. Through this module, while ensuring that all semantic segments have a certain number of frames covered, more sampling resources are automatically allocated to segments with high information density, high relevance to the query, and rich internal variations.
[0033] Step 4: This step acts as a fine-tuning executor, responsible for balancing relevance and diversity within each semantic segment and selecting the most representative keyframes. After obtaining each semantic segment... Frame sampling quota Then, in each semantic segment Internally, fine-grained frame selection is performed to balance relevance and diversity, avoiding redundant frames. Specifically, semantic segments... After sorting all frames according to their image-text relevance scores, the frame with the highest score is first added to the keyframe set for that semantic segment. As the anchor frame; for the remaining frames, iterative selection. Each keyframe is used to select the next frame at each step according to the maximum marginal correlation criterion, and the formula is as follows: ; in, The hidden keyframe is the one that will be selected next. This represents a balance coefficient that controls for correlation and diversity. Indicates the first Image-text relevance score of the frame; Indicates and Inequal indices; This indicates the calculation of cosine similarity; , The respective , The visual feature vector of the frame. Therefore, by penalizing the similarity term, the repeated selection of highly similar frames can be effectively suppressed, thereby improving the semantic coverage and diversity of the keyframe set within the segment.
[0034] Then, the keyframe sets of all semantic segments are aggregated to obtain the global keyframe set. Then, the global keyframe set The LVLM is input along with the user's query text to perform downstream tasks such as video question answering, event analysis, or summary generation.
[0035] The method of the present invention can be implemented by software, hardware or a combination of software and hardware, and can be deployed on servers, cloud platforms or local terminal devices as a video preprocessing module of LVLM system, providing a unified long video keyframe selection capability for different multimodal large models in a plug-and-play manner.
[0036] The main innovations of this invention are concentrated in the following aspects: (1) A wavelet domain semantic boundary detection method for one-dimensional temporal signals of video text correlation: This invention is the first to use discrete wavelet transform to process one-dimensional non-stationary signals composed of video text correlation scores, decomposing the video text correlation signal into sub-bands of different frequencies. By retaining only the coarsest-grained detail coefficients and reconstructing them using inverse wavelet transform, the method successfully separates low-frequency singularities representing semantic abrupt changes from high-frequency noise representing model uncertainty. This method solves the industry problem of distinguishing between "real plot twists" and "random jitter noise" in visual scoring signals from the perspective of signal processing.
[0037] (2) A two-stage hierarchical sampling strategy of "macro-segmentation-micro-optimization": This invention abandons the existing global discrete sampling mode and proposes a hierarchical strategy that preserves both order and quality. At the macro level, the video is divided into continuous segments with independent narrative functions using the aforementioned semantic boundaries, ensuring the causal logic and contextual integrity in the temporal dimension. At the micro level, a maximum boundary correlation algorithm is introduced within each segment to force the selection of differentiated frames by penalizing visual similarity. This "determine the structure first, then fill in the details" strategy effectively avoids the "fragmented" frame set generated by traditional methods and greatly improves LVLM's ability to understand the temporal logic of the video.
[0038] (3) Multi-dimensional feature-driven frame budget source allocation mechanism: To address the resource waste caused by traditional uniform sampling, this invention constructs a dynamic quota allocation model based on mathematical optimization. This model no longer allocates frames evenly, but instead uses a comprehensive importance function to calculate the duration weight, average relevance, peak significance, and content variance of each semantic segment in real time. Based on the Softmax normalization allocation algorithm, it can automatically tilt more sampling quotas towards segments with high information density and complex semantics, while pruning low-value redundant segments, thereby maximizing information entropy under a fixed frame budget.
[0039] (4) Training-free long video keyframe screening framework for LVLM: The entire process of this invention relies only on the relevance score, wavelet transform and statistical rules of the pre-trained visual language model. No additional network or information model needs to be trained. It can be used as a unified preprocessing module for LVLM, which significantly reduces the system training and deployment costs and has good cross-model generalization ability.
[0040] By adopting the above technical solution, this system demonstrates significant technical advantages in three dimensions: noise resistance, understanding accuracy, and computational efficiency. (I) In terms of noise resistance and robustness, thanks to the natural suppression of high-frequency disturbances by wavelet decoupling, this method exhibits extremely high stability when processing low-quality videos with drastic lighting changes or containing a large number of meaningless transition shots. Experimental data show that compared with traditional boundary detection algorithms based on gradients or fixed thresholds, this scheme reduces the semantic boundary localization error in complex scenes by about 40%, effectively avoiding erroneous segmentation caused by signal jitter and ensuring the accuracy of keyframe selection.
[0041] (II) Regarding the accuracy of long video understanding, thanks to the hierarchical frame selection strategy of "macro segmentation-micro optimization", this invention can better preserve the narrative structure and key transitions of the video under a fixed frame budget. In long video understanding benchmark tests such as VideoMME and MLVU, this invention improves the overall accuracy by an average of about 4%-9% compared with representative baseline methods under the same frame budget. In particular, in long-span reasoning and event causal analysis problems, the recall rate of key information can exceed 90%, which significantly alleviates the "forgetting" and misjudgment of large models caused by long video input.
[0042] (III) In terms of computational efficiency and applicability, this invention adopts a training-free architecture, based entirely on mathematical transformations and statistical rules, eliminating the need for expensive model training processes. The computational complexity of wavelet transform and allocation algorithms is extremely low (negligible compared to video decoding and large model inference), supporting real-time processing of hour-long videos on a single consumer-grade graphics card. The system is not sensitive to specific LVLM categories and can be "plug-and-play" adapted to various mainstream multimodal large models such as Qwen-VL and InternVL, significantly reducing the development and deployment threshold of long video analysis systems.
[0043] The present invention also discloses an adaptive video keyframe selection system based on wavelet decoupling, including a temporal correlation signal construction module, a semantic boundary detection module, an adaptive sampling quota allocation module, and a diversity-aware frame selection module, and is configured to perform steps 1 to 4 of the above method.
[0044] The above embodiments and figures are not intended to limit the product form and style of the present invention. Any appropriate changes or modifications made by those skilled in the art should be considered as not departing from the patent scope of the present invention.
Claims
1. An adaptive video key frame selection method based on wavelet decoupling, characterized in that, include: Step 1: Construct a one-dimensional temporal signal of video-text correlation; Step 2: Use discrete wavelet transform to decompose the time-series signal into multiple layers, retain only the coarsest-scale detail coefficients and reconstruct them using inverse transform, and extract robust semantic boundaries from them; based on this, divide the video into a series of time-continuous and semantically coherent segments. Step 3: Construct a segment importance function based on segment duration, average relevance, peak relevance, and score variance, and adaptively allocate the global frame budget to each segment using sofamax; Step 4: Within each segment, a maximum marginal relevance diversity-driven strategy is used to select frames, resulting in a set of keyframes.
2. The method of claim 1, wherein, Step 1 specifically involves: First, analyze the input video. Uniform sparse sampling is performed to obtain a base frame sequence. Using a pre-trained vision-language model, the image-text matching score between each frame of the base frame sequence and the user query is calculated, forming a one-dimensional temporal correlation signal sequence that varies over time. ,in Indicates correlation signal, Indicates the length of the video sampling frame.
3. The method of claim 2, wherein, Step 2 specifically involves: Firstly, the correlation signal sequence is decomposed by Daubechies-4 wavelet base is performed The discrete wavelet transform of the layer is as follows: ; wherein denotes a discrete wavelet transform operation; denotes approximation coefficients; denotes the layer detail coefficients; Then, the detail coefficients of the coarsest scale are set to zero and the signal is reconstructed by inverse wavelet transform. The formula is as follows: ; wherein denotes the reconstructed correlation signal sequence; denotes an inverse wavelet transform operation; Finally, the semantic boundary set is obtained by taking the absolute value of the semantic change intensity signal and applying a peak detection algorithm to identify local maximum points, which correspond to semantic transition moments of the input video segments. ; wherein, represents a semantic segment; represents the number of semantic boundaries.
4. The method of claim 3, wherein, Step 3 specifically involves: First, the importance score of each semantic segment is calculated which is calculated as follows: ; in, Indicating time span factors, it represents the current semantic segment. Length and input video The ratio between the lengths; The weighting parameter represents the time span factor; The average relevance factor is the average image-text matching score within a segment; The weighting parameter representing the average correlation factor; The peak correlation factor is the highest image-text matching score within a segment. The weighting parameter represents the peak correlation factor; Representing diversity factors, it is the ratio of the variance of scores within a segment to the variance of scores worldwide. These are the weighting parameters for diversity factors; Then, based on The function normalizes the importance scores of all semantic segments from the input video. Input frame budget Distribute them proportionally to each semantic segment to obtain each semantic segment. Frame sampling quota The formula is as follows: ; in, Represents an exponential function; Indicates semantic fragments in a set Index / traversal variables in; Represents the set of all semantic fragments; Represents a set The first in A semantic fragment.
5. The adaptive video keyframe selection method based on wavelet decoupling as described in claim 4, characterized in that, Step 4 specifically involves: First, semantic fragments After sorting all frames according to their image-text relevance scores, the frame with the highest score is first added to the keyframe set for that semantic segment. As an anchor frame; For the remaining frames, iterative selection Each keyframe is used to select the next frame at each step according to the maximum marginal correlation criterion, and the formula is as follows: ; in, This indicates the hidden keyframe to be selected next; This represents a balance coefficient that controls for correlation and diversity. Indicates the first Image-text relevance score of the frame; Indicates and Inequal indices; This indicates the calculation of cosine similarity; , The respective , The visual feature vector of a frame; Then, the keyframe sets of all semantic segments are aggregated to obtain the global keyframe set. Then, the global keyframe set The LVLM is input along with the user's query text, and downstream tasks are executed.
6. An adaptive video keyframe filtering system based on wavelet decoupling, characterized in that, It includes a temporal correlation signal construction module, a semantic boundary detection module, an adaptive sampling quota allocation module, and a diversity-aware frame selection module, and is configured to perform steps 1 to 4 of the wavelet decoupling-based adaptive video keyframe selection method as described in any one of claims 1 to 5.