Video processing method and related apparatus

By identifying keyframes and removing redundant feature segments in a large video language model, the problem of low efficiency in long video processing is solved, and efficient long video processing is achieved.

WO2026123773A1PCT designated stage Publication Date: 2026-06-18HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2025-08-22
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing video language models require processing a large number of feature segments when handling long videos, resulting in low processing efficiency and an inability to effectively handle videos longer than several minutes.

Method used

By identifying keyframes in the video and selecting to remove some feature segments based on the keyframes, while retaining the feature segments corresponding to the keyframes, the feature segment processing is optimized using methods such as time windowing and attention scores to reduce redundant content.

🎯Benefits of technology

While ensuring video processing accuracy, it significantly improves the processing efficiency of long videos, reduces memory usage, and ensures effective processing of long videos.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025116429_18062026_PF_FP_ABST
    Figure CN2025116429_18062026_PF_FP_ABST
Patent Text Reader

Abstract

A video processing method, which is used in the technical field of artificial intelligence (AI). In the video processing method, a key frame is first determined from among a plurality of image frames comprised in a target video, and after the target video is processed by means of a target model and a group of feature segments corresponding to each image frame are obtained, on the basis of the key frame, some feature segments are selected to be removed, so as to retain a feature segment corresponding to the key frame, thus achieving the aim of feature deletion. In this way, a feature segment corresponding to a key frame in a video which comprises key content is retained, and some other feature segments are removed, such that redundant content in the video that needs to be processed can be significantly reduced while the video processing precision is ensured, thereby improving the video processing efficiency and ensuring that a long video can be effectively processed.
Need to check novelty before this filing date? Find Prior Art

Description

A video processing method and related apparatus

[0001] This application claims priority to Chinese Patent Application No. 202411807724.7, filed with the State Intellectual Property Office of China on December 9, 2024, entitled “A Video Processing Method and Related Apparatus”, the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of artificial intelligence (AI) technology, and in particular to a video processing method and related apparatus. Background Technology

[0003] Large Language Models (LLMs) are deep learning models trained on massive amounts of text data. They can not only generate natural language text, but also deeply understand the meaning of text and handle various natural language tasks, such as text summarization, intelligent question answering, and text translation.

[0004] With the development of artificial intelligence technology, Video Large Language Models (VLM) have made significant progress in video understanding. VLM performs multimodal tasks by fusing a visual encoder with a large language model. However, current VLM suffers from a significant technical bottleneck: it requires a large number of feature fragments to represent a single image. Therefore, VLM often needs to process a large number of feature fragments when processing videos, making the processing of long videos exceeding several minutes a highly challenging task. Specifically, due to accuracy limitations, VLM can typically only process short video clips containing around two hundred images (approximately four minutes), and cannot effectively handle long videos.

[0005] Therefore, there is an urgent need for an optimization scheme for video large language models so that they can effectively process long videos. Summary of the Invention

[0006] This application provides a video processing method and related apparatus, which can improve video processing efficiency.

[0007] Firstly, a video processing method is provided, comprising: an execution device first acquiring a target video to be processed, the target video including multiple image frames; then, the execution device determining keyframes among the multiple image frames, wherein keyframes are image frames containing content of interest to the user. Generally, the difference between a keyframe and its adjacent image frames is significant, that is, the object changes considerably from the adjacent image frame to the keyframe; therefore, keyframes can effectively represent scenes with significant changes in the target video. The number of keyframes determined by the execution device can be one or more.

[0008] Secondly, the execution device processes the target video through the target model to obtain multiple sets of feature segments corresponding to multiple image frames, and one set of feature segments in the multiple sets of feature segments corresponds to one image frame.

[0009] Finally, the execution device removes some feature segments from multiple sets of feature segments based on the keyframes, obtaining the retained feature segments. The retained feature segments include the target feature segments corresponding to the keyframes in the multiple sets of feature segments, and these retained feature segments are used as features of the target video to perform video processing tasks. Where the execution device determines that there are multiple keyframes, the target feature segments can be feature segments corresponding to all keyframes, or feature segments corresponding to only some keyframes.

[0010] In this scheme, keyframes are first identified among the multiple image frames included in the target video. Then, when processing the target video using the target model to obtain a set of feature segments corresponding to each image frame, some feature segments are selectively removed based on the keyframes, retaining only the feature segments corresponding to the keyframes, thus achieving feature reduction. In this way, by retaining the feature segments corresponding to keyframes containing key content in the video and removing other feature segments, the scheme can significantly reduce redundant content in the video while maintaining processing accuracy, improving processing efficiency and ensuring effective processing of long videos.

[0011] In one possible implementation, the execution device removes some feature fragments from the candidate feature fragments to obtain the retained feature fragments; wherein, the candidate feature fragments include feature fragments other than the target feature fragment from multiple sets of feature fragments.

[0012] For example, based on keyframes, the execution device first determines candidate feature segments from multiple sets of feature segments. These candidate feature segments do not include the target feature segment. That is, the candidate feature segments do not include all or some of the feature segments corresponding to the keyframes. Then, the execution device removes some feature segments from the candidate feature segments to obtain the retained feature segments.

[0013] In this scheme, candidate feature segments that do not include target feature segments are determined based on key frames, and some feature segments are selected from the candidate feature segments to be removed. This ensures that the target feature segments corresponding to the key frames are necessarily retained, thereby retaining feature segments containing key information and removing feature segments containing redundant information. This effectively reduces the number of feature segments in the video while ensuring the accuracy of video processing.

[0014] In one possible implementation, when performing feature segment removal, the execution device first divides the candidate feature segments according to time windows to obtain multiple batches of candidate feature segments. That is, the target video can be divided into multiple time windows, and the feature segments corresponding to all image frames within a time window constitute a batch of candidate feature segments.

[0015] In this way, the execution device selects a portion of the feature segments from each batch of feature segments to remove, resulting in the retained feature segments. That is, for each batch of feature segments corresponding to each time window, the execution device will perform feature segment removal within that batch, so that each batch of feature segments corresponding to different time windows will have a portion of feature segments removed and a portion of feature segments retained.

[0016] In this scheme, feature segments are divided into batches based on the time windows corresponding to the feature segments, and some feature segments are selected and removed in each batch. This ensures that the feature segments corresponding to the image frames in each time period of the video are retained, resulting in a wider distribution of the final retained feature segments over time, avoiding the omission of important information, and helping to ensure the accuracy of video processing.

[0017] In one possible implementation, when processing the target video using the target model, the execution device can divide the target video into multiple sub-videos, each of which includes a portion of image frames from multiple image frames. That is, the multiple sub-videos are divided along a time dimension, each sub-video includes multiple adjacent image frames, and different sub-videos do not contain duplicate image frames.

[0018] Then, the execution device processes multiple sub-videos through the target model to obtain multiple sets of feature segments corresponding to each sub-video.

[0019] In this way, when removing feature segments, the execution device removes a portion of the feature segments from the multiple sets of feature segments corresponding to each sub-video. That is, the execution device removes a portion of the feature segments from the multiple sets of feature segments corresponding to each sub-video.

[0020] In other words, the execution device actually cuts the target video into multiple sub-videos, extracting and removing feature segments from only one sub-video at a time. Furthermore, the remaining feature segments for each sub-video are saved. Thus, after processing all the sub-videos, the execution device can obtain the retained feature segments for the target video by combining the remaining feature segments from all the sub-videos.

[0021] In this solution, by processing long target videos in blocks independently, the peak memory usage of the execution device when running the target model to process the target video can be effectively reduced, ensuring that the memory on the execution device can effectively handle the processing of the target video and other tasks, and avoiding affecting the normal operation of the execution device.

[0022] In one possible implementation, each set of feature segments in the multiple sets of feature segments includes multiple feature segments, and one feature segment corresponds to an image block in an image frame.

[0023] In the process of removing some feature segments from the candidate feature segments, the execution device first obtains the attention score of each feature segment in the candidate feature segments; and then, based on the attention score of each feature segment, the execution device removes some feature segments from the candidate feature segments.

[0024] In this scheme, feature segment removal is performed based on the attention score of the feature segment. This enables the removal of feature segments with low importance based on the attention score originally calculated by the model, ensuring that feature segments containing redundant information are removed and reducing the number of features that the model needs to process.

[0025] In one possible implementation, when the execution device determines keyframes among multiple image frames, it can first divide the multiple image frames into multiple time windows according to chronological order. Each time window includes at least two adjacent image frames. Furthermore, different time windows include non-repeating image frames. Then, the execution device determines the keyframes based on the similarity between the image frames in each time window and their adjacent image frames.

[0026] In this scheme, by dividing the target video into time windows and determining the final keyframes based on the multiple image frames included in each time window, it is possible to identify the image frames with the greatest local differences in the target video as keyframes. This makes the selected keyframes more consistent with the characteristics of human video perception and ensures the accuracy of keyframe selection.

[0027] In one possible implementation, the execution device specifically selects at least one image frame as a candidate keyframe from the image frames included in each time window based on the similarity between each image frame and its neighboring image frames, thereby obtaining multiple candidate keyframes; and the execution device determines the keyframe based on the multiple candidate keyframes.

[0028] In this scheme, the target video is divided into time windows, and candidate keyframes are selected from multiple image frames included in each time window to determine the final keyframes. This enables the identification of image frames with the greatest local differences in the target video as keyframes, making the selected keyframes more consistent with the characteristics of human video perception (i.e., detecting motion changes by tracking local peak stimuli). Furthermore, the final selected keyframes are distributed across different time periods as much as possible, avoiding the omission of key information and ensuring the accuracy of keyframe selection.

[0029] In one possible implementation, the execution device selects a subset of candidate keyframes as keyframes based on the similarity between each candidate keyframe and its neighboring image frames.

[0030] In one possible implementation, the target model is a video large language model.

[0031] In one possible implementation, after obtaining the retained feature segments, the execution device processes the retained feature segments through the target model to obtain the video processing result.

[0032] In a second aspect, a video processing apparatus is provided, comprising: an acquisition module for acquiring a target video, the target video including multiple image frames; a processing module for determining keyframes among the multiple image frames; the processing module further for processing the target video through a target model to obtain multiple sets of feature segments corresponding to the multiple image frames, one set of feature segments in the multiple sets of feature segments corresponding to one image frame; and the processing module further for removing some feature segments from the multiple sets of feature segments based on at least one keyframe to obtain retained feature segments, the retained feature segments including target feature segments corresponding to the keyframes in the multiple sets of feature segments, and the retained feature segments being used as features of the target video to perform video processing tasks.

[0033] In one possible implementation, the processing module is specifically used to: remove some feature segments from the candidate feature segments to obtain the retained feature segments; wherein, the candidate feature segments include feature segments other than the target feature segment from multiple sets of feature segments.

[0034] In one possible implementation, the processing module is specifically used to: divide the candidate feature segments according to a time window to obtain multiple batches of candidate feature segments; select some feature segments in each batch of multiple batches of feature segments for removal to obtain the retained feature segments.

[0035] In one possible implementation, the processing module is specifically used to: divide the target video into multiple sub-videos, each of the multiple sub-videos including a portion of image frames from multiple image frames; process the multiple sub-videos separately using the target model to obtain multiple sets of feature segments corresponding to each sub-video; and remove some feature segments from the multiple sets of feature segments corresponding to the sub-videos, on a per-sub-video basis.

[0036] In one possible implementation, each set of feature segments in the multiple sets of feature segments includes multiple feature segments, and one feature segment corresponds to an image block in an image frame. The acquisition module is also used to acquire the attention score of each feature segment in the candidate feature segments; the processing module is also used to remove some feature segments in the candidate feature segments according to the attention score of each feature segment.

[0037] In one possible implementation, the processing module is specifically used to: divide multiple image frames in chronological order to obtain multiple time windows, each of the multiple time windows including at least two adjacent image frames; and determine keyframes based on the similarity between the image frames in each time window and the adjacent image frames.

[0038] In one possible implementation, the processing module is specifically used to: select at least one image frame as a candidate keyframe from the image frames included in each time window based on the similarity between each image frame and its neighboring image frames, thereby obtaining multiple candidate keyframes; and determine the keyframe based on the multiple candidate keyframes.

[0039] In one possible implementation, the processing module is specifically used to: select a subset of candidate keyframes as keyframes from among the multiple candidate keyframes based on the similarity between each candidate keyframe and its neighboring image frames.

[0040] In one possible implementation, the target model is a video large language model.

[0041] In one possible implementation, the processing module is further configured to: perform processing on the retained feature segments through the target model to obtain the video processing result.

[0042] Thirdly, a video processing apparatus is provided, comprising: a processor and a memory; the memory is used to store computer instructions, which, when executed by the processor, cause the video processing apparatus to perform any of the methods described above.

[0043] Fourthly, a computer-readable storage medium is provided that stores instructions which, when executed on a computer, cause the computer to perform the methods described in any of the above aspects.

[0044] Fifthly, a computer program product containing instructions is provided, which, when executed on a computer, enable the computer to perform the methods described above.

[0045] In a sixth aspect, a chip system is provided, the chip system including a processor and a communication interface for communicating with a module other than the chip shown, the processor for running computer programs or instructions such that an apparatus on which the chip system is mounted can perform the methods of any of the above aspects.

[0046] In a seventh aspect, a computing device is provided, the computing device including a video processing apparatus of the third aspect or a chip system of the sixth aspect, wherein the video processing apparatus or the chip system in the computing device is used to implement the operational steps of the method of any of the above aspects.

[0047] Eighthly, a computing device cluster is provided, comprising at least one computing device, wherein any one computing device is used to run a computer program or instructions, such that the computing device cluster can perform the methods of any of the above aspects. Alternatively, some or all of the computing devices are used together to run a computer program or instructions, such that the computing device cluster can perform the methods of any of the above aspects.

[0048] Based on the implementation methods provided in the above aspects, this application can be further combined to provide more implementation methods. Attached Figure Description

[0049] Figure 1 is a schematic diagram of a system architecture provided in this application;

[0050] Figure 2 is a flowchart illustrating a video processing method provided in this application;

[0051] Figure 3 is a schematic diagram of a feature segment removal method provided in this application;

[0052] Figure 4 is a schematic diagram of a feature segment removal method based on attention score provided in this application;

[0053] Figure 5 is a schematic diagram of a batch processing flow for feature segments provided in this application;

[0054] Figure 6 is a schematic diagram of a block-based processing method for a target video provided in this application;

[0055] Figure 7 is a schematic diagram of a method provided in this application for dividing a target video into time windows and determining key frames;

[0056] Figure 8 is a schematic diagram of a method for selecting keyframes based on the vector distance between image frames provided in this application;

[0057] Figure 9 is a schematic diagram of the structure of a video processing device provided in this application;

[0058] Figure 10 is a schematic diagram of the structure of a computing device provided in this application;

[0059] Figure 11 is a schematic diagram of the structure of a computing device cluster provided in this application;

[0060] Figure 12 is a schematic diagram of another computing device cluster provided in this application;

[0061] Figure 13 is a schematic diagram of the structure of a computer-readable storage medium provided in this application. Detailed Implementation

[0062] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application are described below with reference to the accompanying drawings. Obviously, the described embodiments are merely some, and not all, of the embodiments of this application. Those skilled in the art will recognize that, with the emergence of new application scenarios, the technical solutions provided by this application are also applicable to similar technical problems.

[0063] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such descriptions can be used interchangeably where appropriate to allow embodiments to be implemented in a sequence other than that illustrated or described in this application. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or device that includes a series of steps or modules is not necessarily limited to those explicitly listed, but may include other steps or modules not explicitly listed or inherent to such processes, methods, products, or devices. The naming or numbering of steps appearing in this application does not imply that the steps in the method flow must be performed in the chronological / logical order indicated by the naming or numbering. The execution order of named or numbered process steps can be changed according to the desired technical purpose, as long as the same or similar technical effect is achieved. The division of units in this application is a logical division. In practical applications, there may be other division methods. For example, multiple units may be combined or integrated into another system, or some features may be ignored or not executed. In addition, the shown or discussed mutual coupling, direct coupling, or communication connection may be through some interface, and the indirect coupling or communication connection between units may be electrical or other similar forms, none of which are limited in this application. Furthermore, the units or sub-units described as separate components may or may not be physically separated, may or may not be physical units, or may be distributed among multiple circuit units. Some or all of the units can be selected to achieve the purpose of the solution in this application according to actual needs.

[0064] To facilitate understanding, some technical terms used in this application will be introduced below.

[0065] (1) Large Language Model

[0066] Large language models are deep learning models trained on massive amounts of text data that can generate natural language text or understand the meaning of language text. Large language models can handle various natural language tasks, such as text classification, question answering, and dialogue, and are an important pathway to artificial intelligence.

[0067] Specifically, large language models are a technology that has emerged in recent years. Because large language models undergo meticulous data engineering and training processes, their parameters have learned a wealth of existing natural language processing knowledge. This knowledge can now replace humans in many language-related tasks, such as having large language models write code or perform text summarization.

[0068] (2) Transformer network

[0069] Transformer networks are powerful sequence models, but the computation time and memory required increase quadratically with sequence length, significantly increasing the hardware's storage and computing power demands. Essentially, Transformer networks employ a self-attention mechanism. Self-attention is a mechanism that associates different positions within a single sequence to compute a representation of the same sequence, playing a crucial role in machine reading, abstract summarization, and image description generation.

[0070] Taking the Transformer network applied to natural language processing as an example, the Transformer network processes input data of arbitrary length and generates new feature representations of the input data, which are then converted into target words. The self-attention network layer in the Transformer network uses an attention mechanism to capture the relationships between all other words, thereby generating new feature representations for each word. The advantage of the Transformer network is that the attention mechanism can directly capture the relationships between all words in a sentence without considering word positions.

[0071] (3) Video Large Language Model

[0072] Video Big Language Model (VML) is a model based on Big Language Model (ML) that can process both video and text simultaneously. Building upon ML, VML maps each frame of a video to an image embedding vector using a video encoder, aligning the dimensions of the image embedding vectors with those of the text embedding vectors. This allows VML to recognize and process the image embedding vectors, thus completing the video processing.

[0073] (4) Attention Score

[0074] Attention score refers to a score calculated through an attention mechanism, used to measure the relevance and importance between different elements. In the Transformer network, the attention mechanism generates an attention score for each word by calculating the dot product of the query vector, key vector, and value vector. These scores are then normalized to become attention weights, used to weight the relevant information of each word.

[0075] (5) Key-Value Cache (KV Cache)

[0076] When using Transformer networks for natural language processing tasks, they typically employ a self-attention mechanism to process the input sequence. In this mechanism, the Transformer network generates a corresponding key (K) vector, value (V) vector, and query (Q) vector for each token in the input sequence. Further, the Transformer network calculates the degree of matching between each query vector and all key vectors, usually achieved through a dot product. Then, using the matching degree between query and key vectors as weights, it calculates a weighted sum of all value vectors to obtain the final result.

[0077] When the Transformer network processes the input sequence, the key vector generated by the Transformer network for each word in the input sequence can be stored in a single matrix, and the value vector generated by the Transformer network for each word in the input sequence can be stored in another matrix. Therefore, KV Cache refers to the key matrix and value matrix generated by the Transformer network for the words in the input sequence, and these key matrices and value matrices are cached.

[0078] (6) Patch Embedding

[0079] Patch embedding is the process of segmenting an image into multiple small image patches and mapping each patch to a high-dimensional vector space. Patch embedding is commonly used in Transformer networks to convert two-dimensional image data into sequential data, facilitating image data processing using the Transformer network.

[0080] The applicant's research revealed a significant technical bottleneck in current video large language models: these models require a large number of feature segments to represent a single image. Therefore, processing videos often involves handling a vast amount of these feature segments, making long videos exceeding several minutes a highly challenging task. Specifically, the length of feature segments that a video large language model can effectively process is typically limited. If the video is too long, it generates an excessive number of feature segments, leading to a rapid decrease in the accuracy of the video large language model. Consequently, current video large language models can generally only process short video segments containing around two hundred images (approximately four minutes), and cannot effectively handle long videos.

[0081] In view of this, this application provides a video processing method. First, keyframes are determined among multiple image frames included in the target video. Then, when processing the target video using a target model to obtain a set of feature segments corresponding to each image frame, some feature segments are selectively removed based on the keyframes, retaining only the feature segments corresponding to the keyframes, thus achieving feature reduction. In this way, by retaining the feature segments corresponding to keyframes containing key content in the video and removing other feature segments, the method can significantly reduce redundant content in the video while ensuring video processing accuracy, improving video processing efficiency and ensuring effective processing of long videos.

[0082] Please refer to Figure 1, which is a schematic diagram of a system architecture provided in this application. As shown in Figure 1, in this system architecture, the execution device 10 can be implemented by a single physical host (computing device) or multiple physical hosts (computing device cluster). The execution device 10 includes an accelerator 101 and a processor 102. The accelerator 101 is used to run the target model (such as a video large language model) to process the inference task transmitted by the processor 102. The processor 102 is used to obtain the client's task request (such as a request to analyze a video) and schedule the accelerator 101 to process the specified inference task based on the task request from the client.

[0083] Optionally, the execution device 10 can be used in conjunction with other computing devices, such as data storage devices, load balancers, etc.; the execution device 10 can be deployed on a single physical site or distributed across multiple physical sites.

[0084] In addition, the system architecture also includes a data storage system 11, which is used to store data such as video data, KV cache, or program code.

[0085] Optionally, for persistent data storage, the data storage system 11 can be located external to the execution device 10 and exchange data with the execution device 10 via a network. Alternatively, if the execution device 10 is a physical host, the data storage system 11 can also be located internally to the execution device 10, such as by exchanging data with the processor via a bus. In this case, the data storage system 11 functions as a hard disk. With the data storage system 11, the execution device 10 can use the data in the data storage system 11 (such as a KV cache) or call the program code in the data storage system to implement the video processing method provided in this application.

[0086] Optionally, users can interact with execution device 10 using their respective local devices. For example, a client 121 is deployed on local device 12, and users interact with the execution device through client 121 on local device 12. Local device 12 can represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, laptop, or smart car.

[0087] Local device 12 can interact with execution device 10 through a communication network of any communication mechanism / standard. The communication network can be a wide area network, a local area network, a point-to-point connection, or any combination thereof.

[0088] Optionally, during the execution of the video processing method by the execution device 10, the local device 12 can provide the execution device 10 with a task request or a video to be processed, so that the execution device 10 can process the video through the target model to achieve task processing. Furthermore, after the execution device 10 executes the video processing method and obtains the output result, it can feed the output result back to the local device 12.

[0089] Please refer to Figure 2, which is a flowchart illustrating a video processing method provided in this application. As shown in Figure 2, the video processing method includes the following steps 201-204.

[0090] Step 201: Obtain the target video, which includes multiple image frames.

[0091] In this application, the target video is a video that needs to be processed by a neural network model, and the target video includes multiple ordered image frames. For example, the target video can be a long video with a duration of 5 minutes, 10 minutes, or 30 minutes, and the number of image frames included in the target video can be, for example, 300, 600, or 1800. This application does not limit the duration of the target video or the number of image frames included.

[0092] Furthermore, for the execution device, it can acquire the target video by obtaining a user's task request. For example, the user sends a task request to the execution device through a local device, carrying the target video and requesting processing of it. The target video can also be pre-stored in a database, in which case the execution device can retrieve it from the database. In general, this application does not limit the specific method by which the execution device acquires the target video.

[0093] Step 202: Identify keyframes among multiple image frames.

[0094] It's understandable that, for a series of ordered image frames in a target video, different frames typically record the form of an object at different times. Therefore, the target video actually records the motion (changes) of an object over a period of time. In most scenarios, the form of an object doesn't change rapidly over time, so adjacent image frames in the target video are often very similar, or even identical. That is, there will be some redundant content in the target video. Especially when the target video is long, there will be a lot of temporal redundancy, meaning that consecutive image frames often contain a lot of similar information.

[0095] Based on this, in this application, the execution device may determine a portion of image frames as keyframes from among multiple image frames included in the target video, thereby obtaining keyframes. Here, a keyframe can be understood as an image frame in the target video that contains key content (i.e., content of interest to the user), and the content included in non-keyframes in the target video is often derived from variations of the content included in the keyframes. Specifically, the execution device can determine keyframes in various ways, such as based on the cosine similarity between image frames, or by analyzing the motion information of objects in the target video. This application does not limit the specific method by which the execution device determines keyframes.

[0096] Step 203: Process the target video using the target model to obtain multiple sets of feature segments corresponding to multiple image frames, where one set of feature segments corresponds to one image frame.

[0097] After obtaining the target video, the execution device can further process the target video using the target model to extract multiple sets of feature segments. These multiple sets of feature segments can correspond one-to-one with multiple image frames in the target video, meaning each set of feature segments uniquely corresponds to a specific image frame in the target video.

[0098] Specifically, when the target model is a video large language model, the multiple sets of feature segments may include a key-value cache. When processing the target video using the target model, the execution device employs patch embedding to segment each image frame in the target video into multiple small image patches, and maps each image patch to an image embedding vector, thus obtaining a sequence of data where each image frame is mapped to multiple image embedding vectors. In this way, by inputting the sequence data corresponding to each image frame into the target model, a set of feature segments extracted by the target model for each image frame's sequence data can be obtained. This set of feature segments may, for example, include multiple feature matrices, with each feature matrix corresponding to an image patch within the image frame.

[0099] For example, when the target model is a video large language model, a set of feature segments includes multiple feature matrices, and each feature matrix can include a key matrix, a value matrix, and a query matrix.

[0100] Step 204: Based on the keyframe, remove some feature segments from multiple sets of feature segments to obtain the retained feature segments. The retained feature segments include the target feature segments corresponding to the keyframe in the multiple sets of feature segments, and the retained feature segments are used as features of the target video to perform video processing tasks.

[0101] In this application, the multiple sets of feature segments obtained from processing the target video in step 203 are actually all feature segments extracted by the target model for the target video, which can easily lead to information redundancy. Therefore, the execution device can determine the target feature segments that must be retained from the multiple sets of feature segments based on key frames, and then remove some feature segments from the multiple sets of feature segments to obtain the retained feature segments. Since key frames are image frames in the target video that contain key content, the target feature segments corresponding to the key frames also need to be retained, that is, the retained feature segments necessarily include the target feature segments corresponding to the key frames in the multiple sets of feature segments. In other words, in the multiple sets of feature segments, feature segments other than the target feature segments are optional to be retained.

[0102] In cases where the execution device determines that there are multiple keyframes, the target feature fragment may include feature fragments corresponding to all keyframes, or feature fragments corresponding to some keyframes.

[0103] That is, the execution device actually removes some feature fragments from the candidate feature fragments to obtain the retained feature fragments; wherein, the candidate feature fragments include feature fragments other than the target feature fragment from multiple sets of feature fragments.

[0104] For example, the execution device can first determine candidate feature segments from multiple sets of feature segments based on the keyframe, where the candidate feature segments do not include the target feature segment. That is, the target feature segment corresponding to the keyframe is the feature segment that must be retained, while other feature segments in the multiple sets of feature segments besides the target feature segment can be regarded as candidate feature segments that can be retained optionally.

[0105] Then, the execution device selects and removes a portion of the feature fragments from the candidate feature fragments, thereby obtaining the retained feature fragments. That is, the retained feature fragments include the target feature fragments mentioned above as well as the retained feature fragments from the candidate feature fragments.

[0106] In this scheme, candidate feature segments that do not include target feature segments are determined based on key frames, and some feature segments are selected from the candidate feature segments to be removed. This ensures that the target feature segments corresponding to the key frames are necessarily retained, thereby retaining feature segments containing key information and removing feature segments containing redundant information. This effectively reduces the number of feature segments in the video while ensuring the accuracy of video processing.

[0107] Specifically, please refer to Figure 3, which is a schematic diagram of feature segment removal provided in this application. As shown in Figure 3, assume that the target video includes N image frames, namely image frame 1 to image frame N. Furthermore, image frame 3 and image frame N are keyframes among the N image frames. After processing the target video using the target model, N sets of feature segments are obtained, namely the first set of feature segments to the Nth set of feature segments. Each set of feature segments corresponds to one image frame. For example, the first set of feature segments corresponds to image frame 1, the second set of feature segments corresponds to image frame 2, and so on, up to the Nth set of feature segments corresponding to image frame N. For the obtained N sets of feature segments, the execution device can remove some feature segments based on the keyframes. That is, the execution device retains the feature segments corresponding to the keyframes and selects some feature segments from the other feature segments for deletion. For example, in Figure 3, the third group of feature segments corresponding to image frame 3 and the Nth group of feature segments corresponding to image frame N were not removed, while some feature segments were deleted from other groups of feature segments (such as the first group of feature segments, the second group of feature segments, ... the fourth group of feature segments).

[0108] Of course, in some possible embodiments, when the number of keyframes is large, the execution device may also remove all candidate feature segments, thereby retaining only the target feature segments corresponding to the keyframes.

[0109] Specifically, in practical applications, the execution device can eliminate feature segments according to a pre-set compression ratio, ensuring that the number of retained feature segments meets the compression ratio requirement. For example, with a compression ratio of 50%, the number of feature segments eliminated by the execution device accounts for 50% of the total number of feature segments across multiple sets. Generally, the more feature segments eliminated by the execution device, the higher the efficiency of the target model in processing the video; however, the accuracy of the target model in processing the video may also decrease. Therefore, in practical applications, the compression ratio of feature segments can be defined according to the actual needs of the video processing task, and this application does not impose specific limitations on it.

[0110] Furthermore, in this application, the retained feature fragments are used as features of the target video to perform video processing tasks. That is, when performing corresponding video processing tasks on the target video, the processing is no longer based on multiple sets of feature fragments corresponding to the target video, but on the retained feature fragments. Video processing tasks can be, for example, video analysis tasks, video question answering tasks, video enhancement tasks, and video generation tasks. Among these, a video question answering task can be a task that combines the target video with a user-provided question text. That is, the user can simultaneously provide the target video and question text, requesting the execution device to answer the question indicated by the question text using the target video. In this case, the execution device can simultaneously process the target video and question text using a video large language model to obtain the answer text.

[0111] For example, in a possible application scenario where the target video is from a security or traffic scenario, the video analysis task for that target video is, for instance, a video analysis task. Specifically, the video analysis task involves analyzing the target video to determine whether any specific behavior (such as user violations) occurs within it.

[0112] Of course, video processing tasks can also be other types of tasks. The video processing method provided in this application can be applied to various video processing scenarios. This application does not limit the specific scenario in which the video processing method is applied.

[0113] For example, after obtaining the retained feature segments, the execution device can continue to process the retained feature segments using the target model to obtain a video processing result. This video processing result can be, for example, a video analysis result, a video question-answering result, a video enhancement result, or a generated video. That is, after extracting all feature segments corresponding to the target video, the target model does not perform further processing on all feature segments. Instead, it first removes some feature segments from all feature segments and then uses the target model to continue processing the remaining feature segments.

[0114] To make it easier to understand, the following will explain in detail how to select and remove feature fragments from candidate feature fragments.

[0115] Optionally, each of the above multiple sets of feature segments includes multiple feature segments, and one feature segment corresponds to an image block in an image frame.

[0116] After obtaining candidate feature fragments, the execution device can acquire the attention score for each feature fragment. The attention score measures the importance of a feature fragment among all feature fragments. Generally, the higher the importance of a feature fragment, the higher its attention score. In Transformer networks, attention scores are often used to measure the relevance and importance of different elements, allowing the network to focus on more important elements.

[0117] Then, based on the attention score of each feature fragment, a subset of feature fragments are selected from the candidate feature fragments for elimination. For example, the feature fragments with the lowest attention scores are selected from the candidate feature fragments for elimination, thereby retaining as many feature fragments as possible with higher attention scores (i.e., higher importance). Furthermore, when the candidate feature fragments include feature fragments corresponding to keyframes, after calculating the attention score of the feature fragments corresponding to the keyframes, the execution device can multiply the calculated attention score by a higher weight and then compare it with the attention scores of feature fragments corresponding to non-keyframes. This allows for focused attention on feature fragments corresponding to keyframes, preserving as many of these feature fragments as possible.

[0118] It should be noted that when the target model is a video large language model, it can calculate the attention score for each feature segment during the extraction of multiple feature segments. Specifically, a feature segment can include a Query matrix, a Key matrix, and a Value matrix. The target model can generate the attention score for the feature segment by calculating the dot product of the Query, Key, and Value matrices. Alternatively, the target model can also obtain the attention score for the feature segment through other methods, which are not specifically limited here.

[0119] In this scheme, feature segment removal is performed based on the attention score of the feature segment. This enables the removal of feature segments with low importance based on the attention score originally calculated by the model, ensuring that feature segments containing redundant information are removed and reducing the number of features that the model needs to process.

[0120] Specifically, for a single image frame in a video, the high-level semantic information contained within that frame often exhibits redundancy. Therefore, this solution uses the feature segment corresponding to a single image block within an image frame as the smallest unit of removal, eliminating feature segments with low attention scores. This process removes redundant semantic information from the image frame, thereby accurately deleting feature segments that do not affect the accuracy of video processing and ensuring that the removal of feature segments does not impact the overall processing precision.

[0121] For example, please refer to Figure 4, which is a schematic diagram of feature segment removal based on attention score according to this application. As shown in Figure 4, for the N groups of feature segments extracted by the target model, the candidate feature segments other than the target feature segments are sorted according to the attention score of each feature segment, for example, sorted in descending order of attention score. Then, based on the sorting result of the feature segments, the feature segments with the lowest attention scores are removed, thereby obtaining the retained feature segments. As shown in Figure 4, except for the third group of feature segments and the Nth group of feature segments (i.e., the target feature segments), feature segments with low attention scores are removed in each other group of feature segments.

[0122] It is understandable that, since a set of feature segments corresponds to an image frame, and a feature segment corresponds to an image patch within an image frame, the removed feature segments can be understood as corresponding to image patches that are not important in the image frame (such as the image patch containing the background). In this case, removing feature segments corresponding to some image patches in an image frame often does not affect the accuracy of video processing, and can also extract a large amount of redundant information, improving the efficiency of the model in processing video.

[0123] Of course, besides selecting feature segments for elimination based on their attention scores, the execution device can also use other methods to select the feature segments to be eliminated. For example, the execution device can randomly select a portion of the candidate feature segments for elimination. This application does not limit how the feature segments to be eliminated are selected.

[0124] Optionally, during the process of selecting and eliminating some feature segments from the candidate feature segments, the execution device can first divide the candidate feature segments based on the position of the image frames corresponding to the feature segments in the target video, thus obtaining multiple batches of candidate feature segments. For example, the execution device can first divide the candidate feature segments according to time windows, thus obtaining multiple batches of candidate feature segments. The target video can be divided into multiple time windows, and the feature segments corresponding to all image frames within a time window constitute a batch of candidate feature segments.

[0125] Specifically, candidate feature segments are sorted according to the position of the corresponding image frames in the target video. The earlier the image frame of a feature segment is in the target video, the earlier the feature segment appears in the sorted result, thus being assigned to an earlier time window. Conversely, the later the image frame appears in the target video, the later the feature segment appears in the sorted result, thus being assigned to a later time window. Therefore, after dividing the candidate feature segments, feature segments corresponding to adjacent image frames within a certain time period (i.e., image frames within the same time window) in the target video often belong to the same batch of feature segments. For example, suppose the target video consists of 100 image frames, and the first, 50th, and 100th image frames are keyframes. If the candidate feature segments are divided into four batches, the second to the 25th image frames belong to the first batch of feature segments, the 26th to the 49th image frames belong to the second batch of feature segments, the 51st to the 75th image frames belong to the third batch of feature segments, and the 76th to the 99th image frames belong to the fourth batch of feature segments.

[0126] Then, the execution device selects a portion of feature segments from each batch of feature segments to remove, resulting in the retained feature segments. That is, each batch of feature segments needs to select a portion of feature segments to remove and retain a portion of feature segments to avoid all feature segments in the same batch being removed.

[0127] In this scheme, feature segments are divided into batches based on the position of the image frames corresponding to the feature segments in the video. In each batch, some feature segments are selected for removal. This ensures that the feature segments corresponding to the image frames in different time periods of the video are retained, resulting in a wider distribution of the final retained feature segments over time. This avoids missing important information and helps to ensure the accuracy of video processing.

[0128] For example, please refer to Figure 5, which is a schematic flowchart of a batch processing method for feature segments provided in this application. As shown in Figure 5, for feature segments from group 1 to group N, the execution device can divide the candidate feature segments into M batches of feature segments, i.e., batch 1 to batch M. For example, batch 1 includes group 1 and group 2, batch 2 includes group 4 and group 5, and so on, batch M includes group N-2 and group N-1. Then, the execution device can remove some feature segments from each batch, thereby completing the removal of feature segments.

[0129] In some scenarios, if the target video is a long video, the execution device often consumes a large amount of memory when processing the target video through the target model to generate corresponding feature segments, which can easily affect the normal operation of the execution device (for example, the execution device cannot perform other tasks). Based on this, this application proposes to divide the target video into blocks, and then process the sub-videos of the blocks separately, so as to reduce the peak memory consumption and ensure that the execution device can operate normally.

[0130] Optionally, when processing the target video using the target model, the execution device divides the target video into multiple sub-videos. Each sub-video includes a portion of image frames from multiple image frames. Furthermore, different sub-videos do not contain duplicate image frames, and all image frames included in the multiple sub-videos constitute all image frames in the target video. Simply put, the execution device trims the target video to obtain multiple sub-videos with shorter durations. The durations of the multiple sub-videos can be the same (e.g., all 30 seconds or 60 seconds) or they can be different; this application does not impose specific limitations on this.

[0131] Then, the execution device processes multiple sub-videos using the target model, obtaining multiple sets of feature segments corresponding to each sub-video. That is, the execution device processes only one sub-video at a time using the target model, thereby extracting multiple sets of feature segments corresponding to that sub-video. The multiple sets of feature segments corresponding to each of the multiple sub-videos then constitute the multiple sets of feature segments corresponding to the aforementioned target video.

[0132] Furthermore, when performing feature segment removal, the execution device can remove a portion of the feature segments from multiple sets of feature segments corresponding to each sub-video, based on keyframes and on a sub-video basis. That is, the execution device removes a portion of the feature segments from each set of feature segments corresponding to each sub-video. After the feature segment removal is completed for each set of feature segments corresponding to each sub-video, the execution device saves the remaining feature segments so that they can be combined with the remaining feature segments corresponding to other sub-videos to form the retained feature segments corresponding to the target video.

[0133] In other words, the execution device actually cuts the target video into multiple sub-videos, extracting and removing feature segments from only one sub-video at a time. Furthermore, the remaining feature segments for each sub-video are saved. Thus, after processing all the sub-videos, the execution device can obtain the retained feature segments for the target video by combining the remaining feature segments from all the sub-videos.

[0134] In this solution, by processing long target videos in blocks independently, the peak memory usage of the execution device when running the target model to process the target video can be effectively reduced, ensuring that the memory on the execution device can effectively handle the processing of the target video and other tasks, and avoiding affecting the normal operation of the execution device.

[0135] Understandably, when the target video is segmented, the execution device may no longer need to process the extracted feature segments in batches. Instead, it can treat all feature segments corresponding to a sub-video as the same batch of feature segments and then remove some feature segments from the same batch.

[0136] For example, please refer to Figure 6, which is a schematic diagram of segmenting a target video according to this application. As shown in Figure 6, assume that the target video includes 1000 image frames, namely image frame 1 to image frame 1000. After performing video segmentation on the target video, 10 sub-videos can be obtained, namely sub-video 1 to sub-video 10. Each sub-video includes 100 consecutive image frames, for example, sub-video 1 includes image frames 1 to 100, sub-video 2 includes image frames 101 to 200, and so on, sub-video 10 includes image frames 901 to 1000. For each of the 10 sub-videos, the execution device independently performs feature segment extraction and feature segment removal processes on each of the 10 sub-videos, thereby obtaining the retained feature segments corresponding to each sub-video, which then constitute the retained feature segments corresponding to the target video.

[0137] Specifically, for sub-video 1, the execution device extracts 100 sets of feature segments (i.e., the first set of feature segments to the 100th set of feature segments) corresponding to sub-video 1 through the target model. Then, based on the keyframes of the target video, the execution device removes the feature segments from the first set of feature segments to the 100th set of feature segments, thereby obtaining the retained feature segments.

[0138] Similarly, for sub-video 10, the execution device extracts 100 sets of feature segments corresponding to sub-video 10 (i.e., feature segments 901 to 1000) through the target model. Then, the execution device removes feature segments from feature segments 901 to 1000 based on the keyframes of the target video, thereby obtaining the retained feature segments.

[0139] The above describes the process by which the execution device removes feature segments based on keyframes. The following will describe how the execution device determines the keyframes in the target video.

[0140] In one possible implementation, the execution device first divides multiple image frames into multiple time windows according to chronological order. Each time window includes at least two adjacent image frames. Different time windows will not contain duplicate image frames; that is, an image frame can only appear in one time window. The number of image frames included in each time window can be a preset value, such as 3 or 5, etc., without specific limitation. Furthermore, the number of image frames included in each window can be the same or different.

[0141] Then, the execution device determines keyframes based on the similarity between each image frame and its neighboring image frames. For example, the execution device selects a subset of image frames as candidate keyframes from the image frames included in each time window, resulting in multiple candidate keyframes. That is, a subset of image frames in each time window will be selected as candidate keyframes. For example, assuming a time window includes 5 image frames, then 1 image frame can be selected as a candidate keyframe in one time window. The number of candidate keyframes selected in each time window can be preset and is related to the total number of image frames included in each time window; this application does not impose specific limitations on this.

[0142] Furthermore, when selecting candidate keyframes, the similarity between each image frame and its preceding (or following, or both preceding and following) image frames can be calculated first. This yields the similarity between each image frame and its adjacent image frames (i.e., the similarity corresponding to each image frame). Thus, for each image frame's corresponding similarity, the image frame with the lowest similarity can be selected as the candidate keyframe. In other words, the smaller the similarity between an image frame and its adjacent image frames, the greater the change in the image frame (i.e., the greater the motion amplitude of the object in the image frame). This image frame is likely to contain more critical content, and therefore, images with greater change can be selected as candidate keyframes.

[0143] Finally, based on multiple candidate keyframes, the execution device determines the keyframes. For example, the execution device selects a subset of candidate keyframes (e.g., those with the lowest similarity to adjacent image frames) from among the multiple candidate keyframes based on the similarity between each candidate keyframe and its neighboring image frames. This approach aims to select keyframes containing the most critical content and improve the accuracy of keyframe selection. Alternatively, the execution device may determine all multiple candidate keyframes as keyframes.

[0144] In this scheme, the target video is divided into time windows, and candidate keyframes are selected from multiple image frames included in each time window to determine the final keyframes. This enables the identification of image frames with the greatest local differences in the target video as keyframes, making the selected keyframes more consistent with the characteristics of human video perception (i.e., detecting motion changes by tracking local peak stimuli). Furthermore, the final selected keyframes are distributed across different time periods as much as possible, avoiding the omission of key information and ensuring the accuracy of keyframe selection.

[0145] For example, please refer to Figure 7, which is a schematic diagram of determining keyframes by performing time windowing on a target video according to this application. As shown in Figure 7, assume that the target video includes 50 image frames, namely image frame 1 to image frame 50. By performing time windowing on the target video, the target video can be divided into 10 time windows, and each time window includes 5 image frames. For example, time window 1 includes image frames 1 to 5, time window 2 includes image frames 6 to 10, and so on, time window 10 includes image frames 46 to 50. Then, in each of the divided time windows, the execution device can calculate the similarity between each image frame and the previous image frame, so as to determine the keyframe based on the calculated similarity. Specifically, after calculating the similarity between each image frame and the previous image frame, the execution device can take the image frame with the lowest similarity to the previous image frame in the same time window as the candidate keyframe, and continue to select a portion of the candidate keyframes as the final keyframes. For example, in Figure 7, image frame 1 in time window 1 and image frame 47 in time window 10 were both eventually determined as keyframes.

[0146] Furthermore, since the similarity between the first image frame in the target video and the previous image frame cannot be calculated, the first image frame can be considered as a candidate keyframe by default. For example, in time window 1 shown in Figure 7, the execution device can directly determine image frame 1 as the candidate keyframe. Of course, the execution device can also determine the first image frame as a candidate keyframe, and then perform time window division on other image frames in the target video located after the first image frame and determine the candidate keyframes in each time window.

[0147] When calculating the similarity between image frames, the execution device can measure the similarity between image frames by calculating the distance between the image embedding vectors corresponding to the image frames. That is, the execution device can first determine the image embedding vectors corresponding to two adjacent image frames, and obtain the similarity between the two image frames by calculating the distance between the image embedding vectors corresponding to these two image frames.

[0148] For example, please refer to Figure 8, which is a schematic diagram of keyframe selection based on the vector distance between image frames provided in this application. As shown in Figure 8, the horizontal axis represents the image frame number, and the vertical axis represents the distance between the image frame and the image embedding vector corresponding to the previous image frame. Generally, the larger the distance between the image embedding vectors corresponding to image frames, the lower the similarity between the two image frames. Therefore, in Figure 8, after dividing the video into time windows, the image frame with the largest distance to the image embedding vector of the adjacent image frame can be selected within the time window, thereby determining the image frame with the lowest similarity to the adjacent image frame, thus facilitating the selection of keyframes.

[0149] Furthermore, when calculating the distance between image embedding vectors corresponding to image frames, in the case of a video large language model, since each image frame is divided into multiple image blocks and each image frame is converted into a corresponding image embedding vector, the execution device can calculate the distance between the image embedding vectors of image blocks at the same position in two image frames, and take the average of the distances between the image embedding vectors corresponding to all image blocks, thereby obtaining the distance between the image embedding vectors corresponding to two adjacent image frames.

[0150] The method provided in this application has been described in detail above. Next, the device provided in this application for performing the above method will be described.

[0151] Please refer to Figure 9, which is a schematic diagram of the structure of a video processing apparatus provided in this application. As shown in Figure 9, the video processing apparatus includes: an acquisition module 901, used to acquire a target video, the target video including multiple image frames; a processing module 902, used to determine key frames in the multiple image frames; the processing module 902 is further used to process the target video through a target model to obtain multiple sets of feature segments corresponding to the multiple image frames, one set of feature segments in the multiple sets of feature segments corresponding to one image frame; the processing module 902 is further used to remove some feature segments from the multiple sets of feature segments based on at least one key frame to obtain retained feature segments, the retained feature segments including target feature segments corresponding to the key frames in the multiple sets of feature segments, and the retained feature segments are used as features of the target video to perform video processing tasks.

[0152] In one possible implementation, the processing module 902 is specifically used to: remove some feature segments from the candidate feature segments to obtain the retained feature segments; wherein, the candidate feature segments include feature segments other than the target feature segment from multiple sets of feature segments.

[0153] In one possible implementation, the processing module 902 is specifically used to: divide the candidate feature segments according to a time window to obtain multiple batches of candidate feature segments; select some feature segments in each batch of multiple batches of feature segments for removal to obtain the retained feature segments.

[0154] In one possible implementation, the processing module 902 is specifically used to: divide the target video into multiple sub-videos, each of the multiple sub-videos including a portion of image frames from multiple image frames; process the multiple sub-videos separately using the target model to obtain multiple sets of feature segments corresponding to each sub-video; and remove some feature segments from the multiple sets of feature segments corresponding to the sub-videos, taking the sub-videos as units.

[0155] In one possible implementation, each set of feature segments in the multiple sets of feature segments includes multiple feature segments, and one feature segment corresponds to an image block in an image frame. The acquisition module 901 is further used to acquire the attention score of each feature segment in the candidate feature segments; the processing module 902 is further used to remove some feature segments in the candidate feature segments according to the attention score of each feature segment.

[0156] In one possible implementation, the processing module 902 is specifically used to: divide multiple image frames in chronological order to obtain multiple time windows, each of the multiple time windows including at least two adjacent image frames; and determine keyframes based on the similarity between the image frames in each time window and the adjacent image frames.

[0157] In one possible implementation, the processing module 902 is specifically used to: select at least one image frame as a candidate keyframe from the image frames included in each time window based on the similarity between each image frame and its neighboring image frames, thereby obtaining multiple candidate keyframes; and determine the keyframe based on the multiple candidate keyframes.

[0158] In one possible implementation, the processing module 902 is specifically used to: select a portion of the candidate keyframes as keyframes based on the similarity between each candidate keyframe and its neighboring image frames.

[0159] In one possible implementation, the target model is a video large language model.

[0160] In one possible implementation, the processing module 902 is further configured to: perform processing on the retained feature segments through the target model to obtain the video processing result.

[0161] Both the acquisition module 901 and the processing module 902 can be implemented in software or in hardware. For example, the implementation of the processing module 902 will be described below. Similarly, the implementation of the acquisition module 901 can be referenced from the implementation of the processing module 902.

[0162] As an example of a software functional unit, processing module 902 may include code running on a computing instance. The computing instance may include at least one of a physical host (computing device), a virtual machine, or a container. Further, the aforementioned computing instance may be one or more. For example, processing module 902 may include code running on multiple hosts / virtual machines / containers. It should be noted that the multiple hosts / virtual machines / containers used to run the code may be distributed within the same region or in different regions. Further, the multiple hosts / virtual machines / containers used to run the code may be distributed within the same availability zone (AZ) or in different AZs, each AZ including one or more geographically proximate data centers. Typically, a region may include multiple AZs.

[0163] Similarly, multiple hosts / virtual machines / containers used to run this code can be distributed within the same Virtual Private Cloud (VPC) or across multiple VPCs. Typically, a VPC is set up within a region. Communication between two VPCs within the same region, as well as between VPCs in different regions, requires a communication gateway to be set up within each VPC to enable interconnection between VPCs.

[0164] As an example of a hardware functional unit, the processing module 902 may include at least one computing device, such as a server. Alternatively, the processing module 902 may be implemented using a central processing unit (CPU), an application-specific integrated circuit (ASIC), or a programmable logic device (PLD). The PLD may be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL), a data processing unit (DPU), a neural network processing unit (NPU), a system-on-chip (SoC), an offload card, an accelerator card, or any combination thereof.

[0165] The processing module 902 includes multiple computing devices that can be distributed within the same region or in different regions. Similarly, the processing module 902 can be distributed within the same Availability Zone (AZ) or in different AZs. Likewise, the processing module 902 can be distributed within the same Virtual Private Cloud (VPC) or multiple VPCs. These multiple computing devices can be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, GALs, DPUs, NPUs, SoCs, offloading cards, and accelerator cards.

[0166] Please refer to Figure 10, which is a schematic diagram of the structure of a computing device provided in this application. The computing device 1000 shown in Figure 10 can be used to execute the video processing method provided in this embodiment. As shown in Figure 10, the computing device 1000 includes: a bus 1002, a processor 1004, a memory 1006, and a communication interface 1008. The processor 1004, the memory 1006, and the communication interface 1008 communicate with each other via the bus 1002. The computing device 1000 can be a server or a terminal device. It should be understood that this application does not limit the number of processors and memories in the computing device 1000.

[0167] Bus 1002 can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, only one line is used in Figure 10, but this does not imply that there is only one bus or one type of bus. Bus 1002 can include pathways for transmitting information between various components of computing device 1000 (e.g., memory 1006, processor 1004, communication interface 1008).

[0168] The processor 1004 may include any one or more processors such as a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).

[0169] The memory 1006 may include volatile memory, such as random access memory (RAM). The processor 1004 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid state drive (SSD).

[0170] The memory 1006 stores executable program code, and the processor 1004 executes this executable program code to implement the functions of the aforementioned acquisition module and processing module, thereby realizing the video processing method of the model described above. That is, the memory 1006 stores instructions for executing the video processing method of the model.

[0171] The communication interface 1008 uses transceiver modules such as, but not limited to, network interface cards and transceivers to enable communication between the computing device 1000 and other devices or communication networks.

[0172] This application also provides a computing device cluster. The computing device cluster includes at least one computing device. The computing device can be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device can also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.

[0173] Please refer to Figure 11, which is a schematic diagram of a computing device cluster provided in this application. As shown in Figure 11, the computing device cluster includes at least one computing device 1000. The memory 1006 of one or more computing devices 1000 in the computing device cluster may store the same instructions for executing video processing methods.

[0174] In some possible implementations, the memory 1006 of one or more computing devices 1000 in the computing device cluster may also store partial instructions for executing the video processing method. In other words, a combination of one or more computing devices 1000 can jointly execute the instructions for executing the video processing method.

[0175] It should be noted that the memory 1006 in different computing devices 1000 within the computing device cluster can store different instructions, each used to execute a portion of the functions of the video processing device. That is, the instructions stored in the memory 1006 of different computing devices 1000 can implement the functions of one or more of the aforementioned acquisition and processing modules.

[0176] In some possible implementations, one or more computing devices in a computing device cluster can be connected via a network. This network can be a wide area network (WAN) or a local area network (LAN), etc. Figure 12 illustrates one possible implementation. Figure 12 is a schematic diagram of another computing device cluster structure provided in this application. As shown in Figure 12, in computing device cluster 1200, two computing devices 1000A and 1000B are connected via a network. Specifically, they are connected to the network through communication interfaces in each computing device. In this type of possible implementation, the memory 1006 in computing device 1000A stores instructions for executing the functions of the acquisition module. Simultaneously, the memory 1006 in computing device 1000B stores instructions for executing the functions of the processing module.

[0177] It should be understood that the function of computing device 1000A shown in Figure 12 can also be performed by multiple computing devices 1000. Similarly, the function of computing device 1000B can also be performed by multiple computing devices 1000. 1301

[0178] This application also provides a chip comprising a processing unit and a communication unit. The processing unit may be, for example, a processor, and the communication unit may be, for example, an input / output interface, pins, or circuits. The processing unit can execute computer execution instructions stored in a storage unit to cause the chip within the electronic device to perform the methods described in the above embodiments. Optionally, the storage unit may be an in-chip storage unit, such as a register or cache. Alternatively, the storage unit may be an external storage unit located within a wireless access device, such as a read-only memory (ROM) or other types of static storage devices capable of storing static information and instructions, such as random access memory (RAM).

[0179] Referring to Figure 13, which is a schematic diagram of the structure of a computer-readable storage medium provided in this application. This application also provides a computer-readable storage medium in which, in some embodiments, the method disclosed in Figure 2 can be implemented as computer program instructions encoded in a machine-readable format on a computer-readable storage medium or on other non-transitory media or articles of art.

[0180] Figure 13 schematically illustrates a conceptual partial view of an example computer-readable storage medium arranged according to at least some of the embodiments shown herein, the example computer-readable storage medium including a computer program for executing computer processes on a computing device.

[0181] In one embodiment, the computer-readable storage medium 1300 is provided using a signal bearer medium 1301. The signal bearer medium 1301 may include one or more program instructions 1302, which, when executed by one or more processors, can provide the functions or parts thereof described above with reference to FIG2.

[0182] In some examples, the signal carrying medium 1301 may include a computer-readable medium 1303, such as, but not limited to, a hard disk drive, a compact disc (CD), a digital video optical disc (DVD), a digital magnetic tape, a memory, ROM, or RAM, etc.

[0183] In some embodiments, the signal-bearing medium 1301 may comprise a computer-recordable medium 1304, such as, but not limited to, a memory, a read / write (R / W) CD, a R / W DVD, and so on. In some embodiments, the signal-bearing medium 1301 may comprise a communication medium 1305, such as, but not limited to, digital and / or analog communication media (e.g., fiber optic cables, waveguides, wired communication links, wireless communication links, and so on). Therefore, for example, the signal-bearing medium 1301 may be transmitted by a wireless communication medium 1305 (e.g., a wireless communication medium conforming to the IEEE 1202.X standard or other transmission protocols).

[0184] One or more program instructions 1302 may be, for example, computer-executable instructions or logical implementation instructions. In some examples, the computing device may be configured to provide various operations, functions, or actions in response to one or more program instructions 1302 conveyed to the computing device via a computer-readable medium 1303, a computer-recordable medium 1304, and / or a communication medium 1305.

[0185] It should also be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. In addition, in the accompanying drawings of the device embodiments provided in this application, the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.

[0186] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware, or it can be implemented by special-purpose hardware including application-specific integrated circuits, special-purpose CPUs, special-purpose memory, special-purpose components, etc. Generally, any function performed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for this application, software program implementation is more often the preferred implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk, or optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, training equipment, or network device, etc.) to execute the methods of the various embodiments of this application.

[0187] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product.

[0188] A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions according to this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transferred from one computer-readable storage medium to another. For example, computer instructions can be transferred from one website, computer, training device, or data center to another website, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can store or a data storage device such as a training device or data center that integrates one or more available media. The available media can be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media (e.g., solid-state drives (SSDs)).

Claims

1. A video processing method, characterized in that, include: Acquire a target video, which includes multiple image frames; Identify the keyframes among the plurality of image frames; The target video is processed by the target model to obtain multiple sets of feature segments corresponding to the multiple image frames, and one set of feature segments in the multiple sets of feature segments corresponds to one image frame. Based on the keyframe, some feature segments are removed from the multiple sets of feature segments to obtain retained feature segments. The retained feature segments include the target feature segments corresponding to the keyframe in the multiple sets of feature segments, and the retained feature segments are used as features of the target video to perform video processing tasks.

2. The method according to claim 1, characterized in that, The process of removing certain feature segments from the multiple sets of feature segments includes: Some feature segments are removed from the candidate feature segments to obtain the retained feature segments; wherein, the candidate feature segments include feature segments other than the target feature segment from the multiple sets of feature segments.

3. The method according to claim 2, characterized in that, The process of removing certain feature segments from the candidate feature segments includes: The candidate feature segments are divided according to time windows to obtain multiple batches of candidate feature segments; In each batch of the multiple batches of feature fragments, some feature fragments are selected and removed to obtain the retained feature fragments.

4. The method according to claim 2, characterized in that, The process of processing the target video using the target model includes: The target video is divided into multiple sub-videos, and each sub-video includes a portion of the image frames from the multiple image frames. The target model is used to process the multiple sub-videos respectively to obtain multiple sets of feature segments corresponding to each sub-video; The process of removing certain feature segments from the multiple sets of feature segments includes: On a sub-video basis, some feature segments are removed from the multiple sets of feature segments corresponding to the sub-video.

5. The method according to any one of claims 2-4, characterized in that, Each set of feature segments includes multiple feature segments, and one feature segment corresponds to an image block in an image frame. The step of removing some feature segments from the candidate feature segments includes: Obtain the attention score for each feature segment in the candidate feature segments; Based on the attention score of each feature segment, some feature segments in the candidate feature segments are eliminated.

6. The method according to any one of claims 1-5, characterized in that, Determining the keyframes among the plurality of image frames includes: The multiple image frames are divided into multiple time windows according to time sequence, and each time window includes at least two adjacent image frames. The keyframes are determined based on the similarity between the image frames in each time window and their adjacent image frames.

7. The method according to claim 6, characterized in that, Determining the keyframe based on the similarity between image frames in each time window and adjacent image frames includes: Based on the similarity between each image frame and its neighboring image frames, at least one image frame is selected as a candidate key frame from the image frames included in each time window, resulting in multiple candidate key frames. The keyframe is determined based on the plurality of candidate keyframes.

8. The method according to claim 7, characterized in that, Determining the at least one key frame based on the plurality of candidate key frames includes: Based on the similarity between each candidate keyframe and its adjacent image frames, a subset of candidate keyframes are selected as the keyframes.

9. The method according to any one of claims 1-8, characterized in that, The target model is a video large language model.

10. The method according to any one of claims 1-9, characterized in that, The method further includes: The retained feature segments are processed using the target model to obtain the video processing result.

11. A video processing apparatus, characterized in that, include: The acquisition module is used to acquire a target video, which includes multiple image frames; The processing module is used to determine the keyframes among the plurality of image frames; The processing module is further configured to process the target video through the target model to obtain multiple sets of feature segments corresponding to the multiple image frames, wherein one set of feature segments in the multiple sets of feature segments corresponds to one image frame; The processing module is further configured to remove some feature segments from the multiple sets of feature segments based on the key frame to obtain retained feature segments. The retained feature segments include the target feature segments corresponding to the key frame in the multiple sets of feature segments, and the retained feature segments are used as features of the target video to perform video processing tasks.

12. The apparatus according to claim 11, characterized in that, The processing module is specifically used for: Some feature segments are removed from the candidate feature segments to obtain the retained feature segments; wherein, the candidate feature segments include feature segments other than the target feature segment from the multiple sets of feature segments.

13. The apparatus according to claim 12, characterized in that, The processing module is specifically used for: The candidate feature segments are divided according to time windows to obtain multiple batches of candidate feature segments; In each batch of the multiple batches of feature fragments, some feature fragments are selected and removed to obtain the retained feature fragments.

14. The apparatus according to claim 12, characterized in that, The processing module is specifically used for: The target video is divided into multiple sub-videos, and each sub-video includes a portion of the image frames from the multiple image frames. The target model is used to process the multiple sub-videos respectively to obtain multiple sets of feature segments corresponding to each sub-video; On a sub-video basis, some feature segments are removed from the multiple sets of feature segments corresponding to the sub-video.

15. The apparatus according to any one of claims 12-14, characterized in that, Each of the multiple sets of feature segments includes multiple feature segments, and one feature segment corresponds to an image block in an image frame; The acquisition module is also used to acquire the attention score of each feature segment in the candidate feature segments; The processing module is further configured to remove some feature segments from the candidate feature segments based on the attention score of each feature segment.

16. The apparatus according to any one of claims 11-15, characterized in that, The processing module is specifically used for: The multiple image frames are divided into multiple time windows according to time sequence, and each time window includes at least two adjacent image frames. The keyframes are determined based on the similarity between the image frames in each time window and their adjacent image frames.

17. The apparatus according to claim 16, characterized in that, The processing module is specifically used for: Based on the similarity between each image frame and its neighboring image frames, at least one image frame is selected as a candidate key frame from the image frames included in each time window, resulting in multiple candidate key frames. The keyframe is determined based on the plurality of candidate keyframes.

18. The apparatus according to claim 17, characterized in that, The processing module is specifically used for: Based on the similarity between each candidate keyframe and its adjacent image frames, a subset of candidate keyframes are selected as the keyframes.

19. The apparatus according to any one of claims 11-18, characterized in that, The target model is a video large language model.

20. The apparatus according to any one of claims 11-19, characterized in that, The processing module is further configured to: The retained feature segments are processed using the target model to obtain the video processing result.

21. A computing device, characterized in that, The device includes a memory and a processor; the memory stores code, and the processor is configured to execute the code, wherein when the code is executed, the computing device performs the method as described in any one of claims 1 to 10.

22. A computing device cluster, characterized in that, It includes at least one computing device, each computing device including a processor and memory; The processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to perform the operational steps of the method as described in any one of claims 1 to 10.

23. A computer storage medium, characterized in that, The computer storage medium stores instructions that, when executed by the computer, cause the computer to perform the method according to any one of claims 1 to 10.

24. A computer program product, characterized in that, The computer program product stores instructions that, when executed by a computer, cause the computer to perform the method described in any one of claims 1 to 10.