Video key frame extraction method and system, medium and electronic device
By using graph-based structured information and computer vision models, the feature similarity between the main node and child node of video keyframes is calculated, which solves the problem of low accuracy in video keyframe extraction in existing technologies and achieves more efficient keyframe extraction and video content retrieval.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI MIDU INFORMATION TECH CO LTD
- Filing Date
- 2026-03-02
- Publication Date
- 2026-06-12
AI Technical Summary
The accuracy of keyframe extraction in existing technologies is low, which affects the processing time for video content retrieval.
By using graph-based structured information, the features of the main nodes and child nodes of the video to be processed are obtained. Using computer vision models and pre-trained models such as 3D convolutional models, VIT models and BERT models, the similarity and distance between the main nodes and child nodes are calculated to determine whether the frame image is a keyframe.
It improves the accuracy of keyframe extraction, reduces computational load, and increases the efficiency of video content retrieval.
Smart Images

Figure CN122200480A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of keyframe extraction technology, and in particular to a video keyframe extraction method and system, medium and electronic device. Background Technology
[0002] Keyframe extraction is of great significance in the field of computer vision. It refers to extracting frames with significant features from a sequence of images with a large amount of information. Keyframe extraction can effectively reduce the amount of computation. Therefore, the accuracy of keyframe extraction directly affects the accuracy of vision tasks.
[0003] In the current field of video surveillance or other video-related fields, it is often necessary to analyze the acquired video to extract keyframes. Keyframes represent the most salient features of each shot in the video, so accurately extracting video keyframes can effectively reduce the processing time for video content retrieval.
[0004] Traditional techniques typically input video data into algorithms that extract different categories of keyframes to obtain the corresponding keyframes. However, traditional keyframe extraction methods have low accuracy. Summary of the Invention
[0005] The purpose of this application is to provide a method, system, medium, and electronic device for extracting keyframes from video files, in order to solve the technical problem of low accuracy in extracting keyframes from video files.
[0006] In a first aspect, this application provides a method for extracting keyframes from a video. The method includes: acquiring a video to be processed; acquiring a graph-based master node based on the video to be processed; splitting the video to be processed into several frame images, and acquiring graph-based child nodes based on the several frame images; wherein each frame image corresponds to one child node; and determining whether a frame image is a keyframe based on the master node and the child nodes.
[0007] In one implementation of the first aspect, obtaining graph-based master nodes based on the video to be processed includes: extracting video features from the video to be processed; and preprocessing the video features to obtain graph-based master nodes.
[0008] In one implementation of the first aspect, obtaining graph-based child nodes based on a plurality of the frame images includes: obtaining frame features of the frame images based on a computer vision model; and performing regularization processing on the frame features to obtain graph-based child nodes.
[0009] In one implementation of the first aspect, determining whether the frame image is a keyframe based on the master node and the child node includes: obtaining master node features and child node features based on the master node and the child node; and determining whether the frame image is a keyframe based on the master node features and the child node features.
[0010] In one implementation of the first aspect, obtaining the main node features and child node features based on the main node and the child nodes includes: calculating the similarity between the main node and several child nodes to obtain a relationship vector between the main node and several child nodes; fusing the main node, the child nodes and the relationship vector through a pre-trained model to obtain an output matrix; taking the vector in the output matrix corresponding to the index of the main node as the main node feature, and taking the vector in the output matrix corresponding to the index of the child node as the child node feature.
[0011] In one implementation of the first aspect, before taking the vector in the output matrix corresponding to the index of the master node as the master node feature, the vector in the output matrix corresponding to the index of the master node needs to be preprocessed by an activation function; before taking the vector in the output matrix corresponding to the index of the child node as the child node feature, the vector in the output matrix corresponding to the index of the child node needs to be preprocessed by an activation function.
[0012] In one implementation of the first aspect, obtaining a similarity vector based on the predicted feature matrix and the target text features includes: obtaining a frame vector based on the predicted feature matrix; and obtaining a similarity vector based on the frame vector and the target text features.
[0013] In one implementation of the first aspect, determining whether the frame image is a keyframe based on the main node features and the child node features includes: calculating the distance between the main node features and each of the child node features; determining whether each distance is less than a preset threshold; if the distance is less than the preset threshold, then the frame image corresponding to the distance is a keyframe; otherwise, the frame image corresponding to the distance is a non-keyframe.
[0014] Secondly, this application provides a video keyframe extraction system. The video keyframe extraction system includes: an acquisition module for acquiring a video to be processed; a first processing module for acquiring a graph-based master node based on the video to be processed; a second processing module for splitting the video to be processed into several frame images and acquiring graph-based child nodes based on the several frame images; wherein each frame image corresponds to one child node; and a keyframe extraction module for determining whether a frame image is a keyframe based on the master node and the child nodes.
[0015] Thirdly, this application provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by an electronic device, implements the video keyframe extraction method described in any of the first aspects of this application.
[0016] Fourthly, this application provides an electronic device, including: a processor and a memory; the memory is used to store a computer program; the processor is used to execute the computer program stored in the memory, so that the electronic device performs the video keyframe extraction method according to any one of the first aspects of this application.
[0017] The video keyframe extraction method, system, medium, and electronic device according to this application utilize the structured information of a graph to extract video keyframes, thereby improving the accuracy of keyframe extraction. Attached Figure Description
[0018] Figure 1 The diagram shown is a scene illustration of the electronic device of this application in one embodiment.
[0019] Figure 2 The diagram shown is a flowchart of one embodiment of the video keyframe extraction method described in this application.
[0020] Figure 3 The diagram shown is a flowchart of an embodiment of the method for obtaining the master node as described in this application.
[0021] Figure 4 The diagram shown is a flowchart of an embodiment of the method for obtaining child nodes as described in this application.
[0022] Figure 5 The diagram shown is a flowchart of an embodiment of the keyframe extraction method described in this application.
[0023] Figure 6 The diagram shown is a flowchart of an embodiment of the method for obtaining the characteristics of the master node and the child node as described in this application.
[0024] Figure 7 The diagram shown is a flowchart of an embodiment of the keyframe extraction method described in this application.
[0025] Figure 8 The diagram shown is a flowchart of another embodiment of the video keyframe extraction method described in this application.
[0026] Figure 9 The diagram shown is a structural schematic of the video keyframe extraction system described in this application in one embodiment.
[0027] Figure 10 The diagram shown is a structural schematic of the electronic device described in an embodiment of this application.
[0028] Component designation explanation
[0029] 11 cell phone 12 Tablet PC 13 laptop 900 Video keyframe extraction system 910 Get Module 920 First processing module 930 Second processing module 940 Keyframe extraction module 101 Processing unit 102 memory 1021 Random Access Memory 1022 Cache memory 1023 Storage System 1024 Programs / Utilities 1025 Program Module 103 bus 104 Input / output interface 105 Network adapter S1~S4 step S21~S22 step S31~S32 step S41~S42 step S411~S413 step S421~S423 step Detailed Implementation
[0030] The following specific examples illustrate the implementation of this application. Those skilled in the art can easily understand other advantages and effects of this application from the content disclosed in this specification. This application can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of this application. It should be noted that, unless otherwise specified, the following embodiments and features in the embodiments can be combined with each other.
[0031] It should be noted that the illustrations provided in the following embodiments are only schematic representations of the basic concept of this application. Therefore, the drawings only show the components related to this application and are not drawn according to the actual number, shape and size of the components in the actual implementation. In the actual implementation, the form, quantity and proportion of each component can be arbitrarily changed, and the layout of the components may also be more complex.
[0032] Furthermore, the use of terms such as "first" and "second" in this application is for descriptive purposes only and should not be construed as indicating or implying their relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include at least one of those features. Additionally, the technical solutions of the various embodiments can be combined with each other, but only on the basis of being achievable by those skilled in the art. If the combination of technical solutions is contradictory or impossible to implement, such a combination of technical solutions should be considered non-existent and not within the scope of protection claimed in this application.
[0033] The following embodiments of this application provide a video keyframe extraction method and system, medium, and electronic device. Based on the structured information of a graph, the main node features and child node features of the video to be processed are obtained. By calculating the distance between the main node features and each child node feature, the keyframes are further determined. The embodiments of this application utilize the structured information of a graph to extract video keyframes, which can improve the accuracy of keyframe extraction.
[0034] It's important to note that in computer science and data science, a graph typically refers to a graph in graph theory, a data structure consisting of nodes (or vertices) and edges connecting these nodes. Specifically, a graph comprises multiple nodes and edges connecting them. Nodes represent entities, while edges represent relationships or connections between entities.
[0035] In this embodiment, the video features acquired from the video to be detected are used as the main node, and the frame features corresponding to each split frame image are used as child nodes. One frame image corresponds to one child node.
[0036] The video keyframe extraction method of this application can be applied to, for example... Figure 1 The electronic devices shown in this application may include mobile phones 11 with wireless charging capabilities, tablet computers 12, laptop computers 13, wearable devices, in-vehicle devices, augmented reality (AR) / virtual reality (VR) devices, ultra-mobile personal computers (UMPCs), netbooks, personal digital assistants (PDAs), etc. The specific types of electronic devices are not limited in this application embodiment.
[0037] For example, the electronic device may be a station (STAION, ST) in a WLAN with wireless charging capability, a cellular phone, cordless phone, Session Initiation Protocol (SIP) phone, Wireless Local Loop (WLL) station, Personal Digital Assistant (PDA) device, handheld device with wireless charging capability, computing device or other processing device, computer, laptop computer, handheld communication device, handheld computing device, and / or other devices for communication over a wireless system, as well as next-generation communication systems, such as mobile terminals in 5G networks, mobile terminals in future evolved Public Land Mobile Networks (PLMNs), or mobile terminals in future evolved Non-terrestrial Networks (NTNs).
[0038] For example, the electronic device can communicate with networks and other devices wirelessly. The wireless communication can use any communication standard or protocol, including but not limited to Global System for Mobile Communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Messaging Service (SMS), BT, GNSS, WLAN, NFC, FM, and / or IR technologies. The GNSS can include Global Positioning System (GPS), Global Navigation Satellite System (GLONASS), BeiDou Navigation Satellite System (BDS), Quasi-Zenith Satellite System (QZSS), and / or Satellite Based Augmentation Systems (SBAS).
[0039] The following will describe in detail the principles and implementation methods of the video keyframe extraction method and system, medium and electronic equipment described in the embodiments of this application, with reference to the accompanying drawings, so that those skilled in the art can understand the video keyframe extraction method and system, medium and electronic equipment of this embodiment without creative effort.
[0040] Please see Figure 2 The diagram shows a flowchart illustrating the video keyframe extraction method provided in this embodiment. Figure 2 As shown, the video keyframe extraction method includes the following steps S1 to S4.
[0041] Step S1: Obtain the video to be processed.
[0042] Specifically, the video to be processed, which requires keyframe extraction, is obtained, including but not limited to surveillance videos.
[0043] Step S2: Obtain the graph-based master node based on the video to be processed.
[0044] Specifically, the video features of the video to be processed are obtained, and the video features are used as the master nodes of the graph.
[0045] In some implementations, obtaining graph-based master nodes based on the video to be processed includes: extracting video features from the video to be processed; and preprocessing the video features to obtain graph-based master nodes.
[0046] like Figure 3 As shown, obtaining the master node based on the graph includes the following steps S21 to S22.
[0047] Step S21: Extract the video features of the video to be processed.
[0048] In one embodiment, video features of the video to be processed are extracted using a 3D convolutional model.
[0049] Specifically, the video to be processed is preprocessed to ensure consistency between the video to be processed and the input data of the 3D convolutional model; the preprocessed video to be processed is then input into an existing 3D convolutional model, or a custom 3D convolutional network model is created as needed, to obtain the video features of the video to be processed.
[0050] The preprocessing of the video to be processed includes: adjusting the frame rate of the video to be processed, adjusting the video to the size required by the model, and normalizing the video to be processed.
[0051] It should be noted that the method for extracting video features from the video to be processed can be implemented not only through 3D convolutional models, but also through CNN extended networks, dual-channel CNNs, long short-term memory networks, and other methods.
[0052] Step S22: Preprocess the video features to obtain graph-based master nodes.
[0053] Specifically, the preprocessing of the video features includes: performing global max pooling, MLP and L2 regularization on the video features to enhance the expressiveness of the video features from multiple perspectives, and using the preprocessed video features as master nodes based on a graph.
[0054] Step S3: Divide the video to be processed into several frames, and obtain graph-based child nodes based on the several frames.
[0055] Specifically, the video to be processed is split into several frames, and feature extraction is performed on each frame to obtain the frame features corresponding to the frame, and the frame features are used as child nodes based on a graph.
[0056] It should be noted that each of the aforementioned frame images corresponds to one of the aforementioned child nodes.
[0057] In some implementations, obtaining graph-based child nodes based on several frame images includes: obtaining frame features of the frame images based on a computer vision model; and performing regularization processing on the frame features to obtain graph-based child nodes.
[0058] like Figure 4 As shown, obtaining graph-based child nodes includes the following steps S31 to S32.
[0059] Step S31: Obtain the frame features of the frame image based on a computer vision model.
[0060] In one embodiment, the frame image is processed using a VIT model to obtain the frame features of the frame image.
[0061] Specifically, the frame image is adjusted to the size and format required by the VIT model. Typically, the VIT model accepts a fixed-size input, for example... The image consists of pixels. The frame image is divided into multiple smaller image blocks, and the frame features of the frame image are further obtained by extracting feature vectors and encoding positions for each image block.
[0062] Step S32: Perform regularization processing on the frame features to obtain graph-based child nodes.
[0063] Specifically, the frame features are regularized to enhance their expressiveness, and the regularized frame features are used as the child nodes.
[0064] Step S4: Determine whether the frame image is a keyframe based on the main node and the child nodes.
[0065] Specifically, keyframes of the video to be processed are extracted by judging the similarity between the main node features corresponding to the main node and the child node features corresponding to the child nodes.
[0066] In some implementations, determining whether a frame image is a keyframe based on the master node and the child nodes includes: obtaining master node features and child node features based on the master node and the child nodes; and determining whether a frame image is a keyframe based on the master node features and the child node features.
[0067] like Figure 5 As shown, determining whether the frame image is a keyframe includes the following steps S41 to S42.
[0068] Step S41: Obtain the characteristics of the main node and the characteristics of the child nodes based on the main node and the child nodes.
[0069] In some implementations, obtaining the main node features and child node features based on the main node and the child nodes includes: calculating the similarity between the main node and several child nodes to obtain the relationship vector between the main node and several child nodes; fusing the main node, the child nodes and the relationship vector through a pre-trained model to obtain an output matrix; taking the vector in the output matrix corresponding to the index of the main node as the main node feature, and taking the vector in the output matrix corresponding to the index of the child node as the child node feature.
[0070] like Figure 6 As shown, obtaining the main node features and the child node features includes the following steps S411 to S413.
[0071] Step S411: Calculate the similarity between the main node and several child nodes to obtain the relationship vector between the main node and several child nodes.
[0072] Specifically, the cosine similarity between the main node and each of the child nodes is calculated to obtain a set of vectors. The number of elements in the vectors is the same as the number of child nodes, and each element corresponds to a child node. The vectors are the relationship vectors between the main node and each of the child nodes.
[0073] Step S412: The master node, the child node and the relation vector are fused using a pre-trained model to obtain an output matrix.
[0074] Specifically, the relationship vector is input into a multilayer perceptron for deeper learning to more accurately obtain the relationship information between the master node and each of the child nodes.
[0075] The relation vectors, master nodes, and child nodes learned by the multilayer perceptron are input into the bidirectional encoder representation from the BERT model. The master nodes, child nodes, and relation vectors are fused through the BERT model, and the fusion result is output in matrix form to obtain an output matrix.
[0076] It should be noted that the Bidirectional Encoder Representations from Transformers (BERT) model is a pre-trained deep learning model based on Transformer, which can efficiently process sequence data.
[0077] Step S413: Take the vector in the output matrix corresponding to the index of the master node as the master node feature, and take the vector in the output matrix corresponding to the index of the child node as the child node feature.
[0078] Specifically, before taking the vector in the output matrix corresponding to the index of the master node as the master node feature, the vector in the output matrix corresponding to the index of the master node needs to be preprocessed by an activation function; before taking the vector in the output matrix corresponding to the index of the child node as the child node feature, the vector in the output matrix corresponding to the index of the child node needs to be preprocessed by an activation function.
[0079] Step S42: Determine whether the frame image is a keyframe based on the main node features and the child node features.
[0080] Specifically, the similarity between the main node features and the features of each child node is used to determine whether the frame image corresponding to the child node feature is a key frame. The higher the similarity between the main node features and the child node features, the closer the frame image corresponding to the child node is to the key frame of the video to be processed.
[0081] In some implementations, determining whether a frame image is a keyframe based on the main node features and the child node features includes: calculating the distance between the main node features and each of the child node features; determining whether each distance is less than a preset threshold; if the distance is less than the preset threshold, then the frame image corresponding to the distance is a keyframe; otherwise, the frame image corresponding to the distance is a non-keyframe.
[0082] like Figure 7 As shown, determining whether the frame image is a keyframe includes the following steps S421 to S423.
[0083] Step S421: Calculate the distance between the main node feature and each of the child node features.
[0084] Specifically, the higher the similarity between the main node feature and the child node feature, the higher the similarity between the vector corresponding to the main node feature and the vector corresponding to the child node feature, that is, the closer the main node feature and the child node feature are.
[0085] Specifically, the distance between the main node feature and each of the child node features is calculated using Euclidean distance.
[0086] Step S422: Determine whether each distance is less than a preset threshold.
[0087] In one embodiment, the preset threshold is set to 0.1, then it is determined whether the distance between the main node feature and each of the child node features is less than 0.1.
[0088] It should be noted that the preset threshold is not a unique value and can be adjusted according to actual needs.
[0089] Step S423: If the distance is less than a preset threshold, then the frame image corresponding to the distance is a key frame; otherwise, the frame image corresponding to the distance is a non-key frame.
[0090] Specifically, if the distance between the main node feature and a child node feature is less than 0.1, then the frame image corresponding to the child node feature is a keyframe.
[0091] If the distance between the main node feature and a child node feature is not less than 0.1, then the frame image corresponding to the child node feature is a non-key frame.
[0092] Please see Figure 8 The above is a flowchart of another embodiment of the video keyframe extraction method described in this application.
[0093] like Figure 8 As shown, the video keyframe extraction method includes:
[0094] Acquire the video to be processed and preprocess the video to be processed: split the video to be processed into frames to obtain several frames of frame images;
[0095] Obtaining the master node: Extracting video features from the video to be processed using a 3D convolutional model, and then performing global max pooling, MLP, and L2 regularization on the video features to improve their expressiveness, and using the processed video features as the master node;
[0096] Obtaining child nodes: Input each frame image into the VIT model to obtain the frame features corresponding to the frame images, perform regularization processing on the frame features to improve the expressiveness of the frame features, and use the regularized frame features as the child nodes;
[0097] Obtaining main node features and child node features: Cosine similarity is calculated for the main node and each child node. The obtained vector is used as the relationship vector between the main node and each child node. This relationship vector is then used for deep learning via a multilayer perceptron to more accurately obtain the relationship information between the main node and each child node. The main node, child nodes, and relationship information are input into a BERT model for fusion to obtain an output matrix. The vectors in the output matrix corresponding to the main node's index are preprocessed using an activation function to obtain the main node features. Similarly, the vectors in the output matrix corresponding to the child node's index are preprocessed using an activation function to obtain the child node features.
[0098] Determine whether the frame image is a keyframe: Calculate the Euclidean distance between the main node and each of the child nodes. If the Euclidean distance between the main node and the child node is less than 0.1, then the frame image corresponding to the child node is a keyframe; if the Euclidean distance between the main node and the child node is not less than 0.1, then the frame image corresponding to the child node is a non-keyframe.
[0099] It should be noted that the scope of protection of the video keyframe extraction method described in this application is not limited to the execution order of the steps listed in this embodiment. Any solution implemented by adding, subtracting, or replacing steps in the prior art based on the principles of this application is included within the scope of protection of this application.
[0100] Please see Figure 9 The diagram shown is a structural schematic of the video keyframe extraction system described in the embodiments of this application.
[0101] like Figure 9 As shown, the video keyframe extraction system 900 includes: an acquisition module 910, a first processing module 920, a second processing module 930, and a keyframe extraction module 940.
[0102] The acquisition module 910 is used to acquire the video to be detected and the target text prompt.
[0103] The first processing module 920 is used to obtain a graph-based master node based on the video to be processed.
[0104] The second processing module 930 is used to split the video to be processed into several frames and obtain graph-based child nodes based on the several frames.
[0105] Specifically, each of the frame images corresponds to one of the child nodes.
[0106] The keyframe extraction module 940 is used to determine whether the frame image is a keyframe based on the main node and the child node.
[0107] It should be noted that the above division of modules is merely a logical functional division. In actual implementation, they can be fully or partially integrated into a single physical entity, or they can be physically separated. Furthermore, these modules can be implemented entirely in software via processing element calls; they can be fully implemented in hardware; or some modules can be implemented by processing element calls to software, while others are implemented in hardware. For example, module x can be a separate processing element, or it can be integrated into a chip in the aforementioned device. Alternatively, it can be stored as program code in the memory of the aforementioned device, and its function can be called and executed by a processing element of the device. The implementation of other modules is similar. Moreover, these modules can be fully or partially integrated together, or they can be implemented independently. The processing element mentioned here can be an integrated circuit with signal processing capabilities. In the implementation process, each step of the above method or each of the above modules can be completed through integrated logic circuits in the hardware of the processor element or through software instructions.
[0108] For example, these modules can be one or more integrated circuits configured to implement the above methods, such as one or more Application Specific Integrated Circuits (ASICs), one or more Digital Signal Processors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs). As another example, when a module is implemented using processing element scheduler code, the processing element can be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. Furthermore, these modules can be integrated together to form a system-on-a-chip (SOC).
[0109] This application also provides a computer-readable storage medium. Those skilled in the art will understand that all or part of the steps in the methods of the above embodiments can be implemented by a program instructing a processor. The program can be stored in a computer-readable storage medium, which is a non-transitory medium, such as random access memory, read-only memory, flash memory, hard disk, solid-state drive, magnetic tape, floppy disk, optical disk, and any combination thereof. The storage medium can be any available medium accessible to a computer or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., digital video disc (DVD)), or a semiconductor medium (e.g., solid-state drive (SSD)).
[0110] This application also provides an electronic device, including a processor and a memory.
[0111] Specifically, a memory is used to store computer programs; memory includes various media that can store program code, such as ROM, RAM, magnetic disks, USB flash drives, memory cards, or optical discs.
[0112] The processor executes the computer program stored in the memory to enable the electronic device to perform the video keyframe extraction method described above.
[0113] like Figure 10 As shown, the electronic device of the present invention is embodied in the form of a general-purpose computing device. The components of the electronic device may include, but are not limited to: one or more processors or processing units 101, a memory 102, and a bus 103 connecting different system components (including the memory 102 and the processing unit 101).
[0114] Bus 103 represents one or more of several bus architectures, including a memory bus or memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any of the various bus architectures. For example, these architectures include, but are not limited to, the Industry Standard Architecture (ISA) bus, the Micro Channel Architecture (MAC) bus, the Enhanced ISA bus, the Video Electronics Standards Association (VESA) local bus, and the Peripheral Component Interconnect (PCI) bus.
[0115] Electronic devices typically include a variety of computer-readable media. These media can be any available media that can be accessed by the electronic device, including volatile and non-volatile media, and removable and non-removable media.
[0116] Memory 102 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 1021 and / or cache memory 1022. The electronic device may further include other removable / non-removable, volatile / non-volatile computer system storage media. By way of example only, storage system 1023 may be used to read and write non-removable, non-volatile magnetic media (… Figure 10 Not shown; usually referred to as a "hard drive"). Although Figure 10 Not shown, a disk drive for reading and writing to a removable non-volatile disk (e.g., a "floppy disk") and an optical disc drive for reading and writing to a removable non-volatile optical disc (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 103 via one or more data media interfaces. Memory 102 may include at least one program product having a set (e.g., at least one) of program modules configured to perform the functions of the embodiments of the present invention.
[0117] A program / utility 1024 having a set (at least one) of program modules 1025 may be stored, for example, in memory 102. Such program modules 1025 include, but are not limited to, an operating system, one or more application programs, other program modules, and program data. Each or some combination of these examples may include an implementation of a network environment. Program modules 1025 typically perform the functions and / or methods described in the embodiments of the present invention.
[0118] The electronic device can also communicate with one or more external devices (e.g., keyboard, pointing device, display, etc.), and with one or more devices that enable a user to interact with the electronic device, and / or with any device that enables the electronic device to communicate with one or more other computing devices (e.g., network card, modem, etc.). This communication can be performed through input / output (I / O) interface 104. Furthermore, the electronic device can also communicate with one or more networks (e.g., local area network (LAN), wide area network (WAN), and / or public networks, such as the Internet) through network adapter 105. Figure 10 As shown, network adapter 105 communicates with other modules of the electronic device via bus 103. It should be understood that, although not shown in the figure, other hardware and / or software modules can be used in conjunction with the electronic device, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems.
[0119] The descriptions of the processes or structures corresponding to the above figures each have their own emphasis. For parts of a process or structure that are not described in detail, please refer to the relevant descriptions of other processes or structures.
[0120] In summary, the video keyframe extraction method, system, medium, and electronic device of this application, based on the structured information of a graph, obtains the main node features and child node features of the video to be processed. By calculating the distance between the main node features and each child node feature, keyframes are further determined. The embodiments of this application utilize the structured information of a graph to extract video keyframes, which can improve the accuracy of keyframe extraction. Therefore, this application effectively overcomes the various shortcomings of the prior art and has high industrial application value.
[0121] The above embodiments are merely illustrative of the principles and effects of this application and are not intended to limit this application. Any person skilled in the art can modify or alter the above embodiments without departing from the spirit and scope of this application. Therefore, all equivalent modifications or alterations made by those skilled in the art without departing from the spirit and technical concept disclosed in this application should still be covered by the claims of this application.
Claims
1. A method for extracting keyframes from a video, characterized in that, The method includes: Get the video to be processed; Obtain the graph-based master node based on the video to be processed; The video to be processed is divided into several frames, and graph-based child nodes are obtained based on the several frames; wherein each frame corresponds to one child node. Determine whether the frame image is a keyframe based on the main node and the child nodes.
2. The video keyframe extraction method according to claim 1, characterized in that, The process of obtaining the graph-based master node based on the video to be processed includes: Extract the video features of the video to be processed; The video features are preprocessed to obtain graph-based master nodes.
3. The video keyframe extraction method according to claim 1, characterized in that, The step of obtaining graph-based child nodes based on a plurality of the frame images includes: The frame features of the frame image are obtained based on a computer vision model; The frame features are regularized to obtain graph-based child nodes.
4. The video keyframe extraction method according to claim 1, characterized in that, The step of determining whether the frame image is a keyframe based on the main node and child nodes includes: Based on the master node and the child nodes, obtain the characteristics of the master node and the child nodes; Determine whether the frame image is a keyframe based on the main node features and the child node features.
5. The video keyframe extraction method according to claim 4, characterized in that, The step of obtaining the main node features and child node features based on the main node and the child nodes includes: Calculate the similarity between the main node and several child nodes to obtain the relationship vector between the main node and several child nodes; The main node, the child node, and the relation vector are fused using a pre-trained model to obtain an output matrix; The vector within the output matrix corresponding to the index of the master node is taken as the master node feature, and the vector within the output matrix corresponding to the index of the child node is taken as the child node feature.
6. The video keyframe extraction method according to claim 5, characterized in that, Before taking the vector in the output matrix corresponding to the index of the master node as the master node feature, the vector in the output matrix corresponding to the index of the master node needs to be preprocessed by an activation function; before taking the vector in the output matrix corresponding to the index of the child node as the child node feature, the vector in the output matrix corresponding to the index of the child node needs to be preprocessed by an activation function.
7. The video keyframe extraction method according to claim 4, characterized in that, The step of determining whether the frame image is a keyframe based on the main node features and child node features includes: Calculate the distance between the main node feature and each of the child node features; Determine whether each distance is less than a preset threshold; If the distance is less than a preset threshold, the frame image corresponding to the distance is a key frame; otherwise, the frame image corresponding to the distance is a non-key frame.
8. A video keyframe extraction system, characterized in that, The system includes: The acquisition module is used to acquire the video to be processed; The first processing module is used to obtain graph-based master nodes based on the video to be processed; The second processing module is used to split the video to be processed into several frames and obtain graph-based child nodes based on the several frames; wherein each frame corresponds to one child node. The keyframe extraction module is used to determine whether the frame image is a keyframe based on the main node and the child nodes.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by an electronic device, the program implements the video keyframe extraction method as described in any one of claims 1 to 7.
10. An electronic device, characterized in that, Including processor and memory; The memory is used to store computer programs; The processor is used to execute the computer program stored in the memory to cause the electronic device to perform the video keyframe extraction method according to any one of claims 1 to 7.