Video clip retrieval model generation method and device, equipment and storage medium
By segmenting and feature encoding video clips, constructing a graph network structure and training a model, the problems of low reliability and efficiency in video clip retrieval are solved, and efficient and accurate video clip retrieval is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- PING AN TECH (SHENZHEN) CO LTD
- Filing Date
- 2024-09-23
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies for retrieving complete videos from video clips have poor reliability and low efficiency, limiting their application scenarios.
By segmenting video clips, video blocks and their accompanying text information are obtained. A graph network structure is constructed, and a video clip retrieval model is trained using feature encoding and loss functions to capture the relationships between video blocks and between video and text.
It improves the reliability and efficiency of video clip retrieval, expands the flexibility of application scenarios, and can accurately retrieve complete videos.
Smart Images

Figure CN119202312B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence, and in particular to a method for generating a video clip retrieval model, an apparatus for generating a video clip retrieval model, a computer device, and a computer-readable storage medium. Background Technology
[0002] Currently, video clip retrieval tasks generally refer to retrieving and querying video clips related to existing video clips from large-scale video databases. Its applications are widespread; for example, insurance companies can use text-based video clips to retrieve complete videos, thereby accelerating the insurance claims review process. For instance, by inputting relevant video clips of a claim, the system can automatically retrieve the complete video to verify the authenticity and legality of the claim. However, among related technologies, the reliability of complete videos retrieved from video clips is relatively poor, efficiency is low, and application scenarios are limited. Summary of the Invention
[0003] This application provides a method for generating a video segment retrieval model, a device for generating a video segment retrieval model, a computer device, and a computer-readable storage medium, aiming to solve the problem in related technologies that it is impossible to accurately retrieve a complete video from a video segment.
[0004] To achieve the above objectives, this application provides a method for generating a video segment retrieval model, the method comprising:
[0005] Acquire video segments and segment the video segments to obtain multiple video blocks and their corresponding video text information and prior boxes;
[0006] The plurality of video blocks and their corresponding video text information are subjected to feature encoding processing to generate video encoding information and text encoding information corresponding to the plurality of video blocks;
[0007] The prior bounding boxes are used as first graph network nodes, the video text information is used as second graph network nodes, and a first graph network structure is constructed based on the first graph network nodes and the second graph network nodes.
[0008] Based on the video encoding information and the text encoding information, the nodes in the first graph network structure are updated to obtain the second graph network structure;
[0009] Based on the labels corresponding to the prior bounding boxes, determine the loss function of the video segment retrieval model;
[0010] Based on the second graph network structure and the loss function, the video segment retrieval model is trained to obtain the target video segment retrieval model.
[0011] To achieve the above objectives, this application provides a device for generating a video segment retrieval model, the device comprising:
[0012] The video segmentation module is used to acquire video segments and segment the video segments to obtain multiple video blocks and their corresponding video text information and prior boxes;
[0013] The feature encoding module is used to perform feature encoding processing on the multiple video blocks and their corresponding video text information to generate video encoding information and text encoding information corresponding to the multiple video blocks;
[0014] The network structure construction module is used to use the prior bounding boxes as first graph network nodes, the video text information as second graph network nodes, and to construct a first graph network structure based on the first graph network nodes and the second graph network nodes.
[0015] The network structure update module is used to update the nodes in the first graph network structure according to the video encoding information and the text encoding information to obtain the second graph network structure.
[0016] The loss function determination module is used to determine the loss function of the video segment retrieval model based on the labels corresponding to the prior boxes;
[0017] The model training module is used to train the video segment retrieval model based on the second graph network structure and the loss function to obtain the target video segment retrieval model.
[0018] To achieve the above objectives, this application provides a computer device, which includes a memory and a processor; the memory is used to store a computer program; the processor is used to execute the computer program and, when executing the computer program, implement the video segment retrieval model generation method as described above.
[0019] To achieve the above objectives, this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, causes the processor to implement the video segment retrieval model generation method described above.
[0020] The video segment retrieval model generation method, device, computer equipment, and computer-readable storage medium disclosed in this application embodiment acquire video segments and segment them to obtain multiple video blocks and their corresponding video text information and prior boxes; perform feature encoding processing on the multiple video blocks and their corresponding video text information to generate video encoding information and text encoding information corresponding to the multiple video blocks; use the prior boxes as nodes in a first graph network and the video text information as nodes in a second graph network, and construct a first graph network structure based on the first graph network nodes and the second graph network nodes; update the nodes in the first graph network structure according to the video encoding information and text encoding information to obtain a second graph network structure; determine the loss function of the video segment retrieval model based on the labels corresponding to the prior boxes; and train the video segment retrieval model based on the second graph network structure and the loss function to obtain a target video segment retrieval model. This can effectively capture the relationships between video blocks and the interaction between video and text, thereby enabling accurate retrieval of complete videos through video segments, improving the reliability and efficiency of video retrieval, and offering flexible application scenarios. Attached Figure Description
[0021] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0022] Figure 1 This is a schematic diagram illustrating a method for generating a video segment retrieval model according to an embodiment of this application.
[0023] Figure 2 This is a flowchart illustrating a method for generating a video segment retrieval model according to an embodiment of this application;
[0024] Figure 3 This is a schematic diagram of a scenario for segmenting video clips provided in an embodiment of this application;
[0025] Figure 4 This is a schematic diagram of a first network structure provided in an embodiment of this application;
[0026] Figure 5 This is a schematic block diagram of a video clip retrieval model generation device provided in one embodiment of this application;
[0027] Figure 6 This is a schematic block diagram of a computer device provided in one embodiment of this application. Detailed Implementation
[0028] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0029] The flowcharts shown in the accompanying drawings are merely illustrative and do not necessarily include all content and operations / steps, nor do they necessarily need to be performed in the described order. For example, some operations / steps can be broken down, combined, or partially merged, so the actual execution order may change depending on the actual situation. Furthermore, although functional modules are divided in the device diagram, in some cases, a different module division may be used.
[0030] The term “and / or” as used in this application specification and the appended claims means any combination of one or more of the associated listed items, as well as all possible combinations, and includes such combinations.
[0031] This application proposes a method for generating a video segment retrieval model, a device for generating a video segment retrieval model, a computer device, and a computer-readable storage medium. This method can effectively capture the relationships between video blocks and the interaction between video and text, thereby enabling accurate retrieval of complete videos through video segments, improving the reliability and efficiency of video retrieval, and offering flexible application scenarios.
[0032] This method can be applied to servers or terminal devices to generate corresponding image classification models. Terminal devices can include fixed terminals such as mobile phones, tablets, and personal digital assistants (PDAs). Servers can be, for example, standalone servers or server clusters. However, for ease of understanding, the following embodiments will describe in detail a method for generating a video clip retrieval model applied to a server.
[0033] The following detailed description of some embodiments of this application is provided in conjunction with the accompanying drawings. Unless otherwise specified, the following embodiments and features can be combined with each other.
[0034] like Figure 1 As shown, the video segment retrieval model generation method provided in this application embodiment can be applied to, for example... Figure 1The application environment shown includes a terminal device 110 and a server 120, wherein the terminal device 110 can communicate with the server 120 via a network. Specifically, the server 120 acquires video segments and segments them to obtain multiple video blocks and their corresponding video text information and prior boxes; it uses the prior boxes as nodes in a first graph network and the video text information as nodes in a second graph network, and constructs a first graph network structure based on the first and second graph network nodes; it performs feature encoding processing on the multiple video blocks and their corresponding video text information to generate video encoding information and text encoding information corresponding to the multiple video blocks; it updates the nodes in the first graph network structure based on the video encoding information and the text encoding information to obtain a second graph network structure; it determines the loss function of the video segment retrieval model based on the labels corresponding to the prior boxes; it trains the video segment retrieval model based on the second graph network structure and the loss function to obtain a target video segment retrieval model, and sends the target video segment retrieval model to the terminal device 110. The server 120 can be a standalone server or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms. The terminal device 110 can be a smartphone, tablet, laptop, desktop computer, smart speaker, smartwatch, etc., but is not limited to these. The terminal and server can be directly or indirectly connected via wired or wireless communication; this application does not impose any restrictions on this connection.
[0035] Please see Figure 2 , Figure 2 This is a schematic flowchart illustrating a method for generating a video clip retrieval model according to an embodiment of this application. This method effectively captures the relationships between video blocks and the interaction between video and text, thereby enabling accurate retrieval of complete videos from video clips, improving the reliability and efficiency of video retrieval, and offering flexible application scenarios.
[0036] like Figure 2 As shown, the method for generating the video clip retrieval model includes steps S101 to S106.
[0037] S101. Obtain video segments and segment the video segments to obtain multiple video blocks and their corresponding video text information and prior boxes.
[0038] The video clips can be the video segments to be retrieved, corresponding to the person or animal for which compensation is sought, or video clips from other scenes. Video blocks can be one or more frames of video images within a video clip. Video text information can be the text information contained within each video block. Prior bounding boxes can be anchor boxes, which are boxes of several sizes and aspect ratios created based on a center point.
[0039] The embodiments of this application can acquire and process relevant data based on artificial intelligence technology. Artificial intelligence (AI) refers to the theories, methods, technologies, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to obtain optimal results.
[0040] Foundational technologies for artificial intelligence generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies mainly encompass computer vision, robotics, biometrics, speech processing, natural language processing, and machine learning / deep learning.
[0041] In some embodiments, the video segment is segmented according to the number of frames to obtain multiple video blocks; text extraction is performed on the multiple video blocks to obtain video text information corresponding to the multiple video blocks; and clustering is performed on the multiple video blocks to obtain prior bounding boxes corresponding to the multiple video blocks. This allows for the accurate extraction of multiple video blocks and their corresponding video text information and prior bounding boxes from the video segment.
[0042] The video clips can be at 30 or 60 frames per second. The video block can be one or more frames of video image from the video clip.
[0043] Specifically, the video segment can be segmented according to the number of frames to obtain the video image of each frame, and the video image can be used as a video block; text extraction processing can be performed on multiple video blocks to extract the text information contained in the multiple video blocks; then the K-means clustering algorithm can be used to cluster the multiple video blocks to obtain the prior boxes corresponding to the multiple video blocks.
[0044] like Figure 3 As shown, for example, an N-frame video V can be divided into N equal intervals. b Given N video blocks, based on all grounding truth (GT) in the dataset, use the K-means clustering algorithm to cluster N... aAnchor boxes of varying lengths are used. For each video block, N anchor boxes are pre-defined, centered on the midpoint of the video block. a There are N anchor boxes in total. b ×N a Anchor boxes.
[0045] S102. Perform feature encoding processing on multiple video blocks and their corresponding video text information to generate video encoding information and text encoding information corresponding to multiple video blocks.
[0046] The video encoding information can be the video representation information corresponding to the Anchor box, i.e., the node representation information corresponding to the network nodes in the first graph. The text encoding information can be the text representation information corresponding to the video text information, i.e., the node representation information corresponding to the network nodes in the second graph.
[0047] In some embodiments, the video block is vector-transformed to obtain the embedding vector corresponding to the video block; the embedding vector is encoded to obtain the video encoding information corresponding to the video block; the video text information is input into a pre-trained encoder to obtain text encoding information, and the feature dimensions of the video encoding information and the text encoding information are the same.
[0048] The embedding vector corresponding to a video block can be the embedding representation of the video block.
[0049] Specifically, each frame of video V can be fed into a pre-trained ViT model to obtain the embedding vector corresponding to the video block. These vectors are then concatenated to obtain the representation of the entire video segment. This representation is then further input into a multi-layered convolutional neural network and a Transformer encoder to obtain N. b ×N a The video representation of each anchor box is as follows: That is, video encoding information.
[0050] The ViT model segments the image into a series of patches and then linearly embeds these patches into a high-dimensional space, generating a series of sequence embeddings. These sequence embeddings, along with position embeddings (which provide spatial location information for each patch), are then fed into a standard Transformer encoder for processing. Finally, a classification head (fully connected layer) decodes the output of the Transformer encoder to obtain the video encoding information.
[0051] Specifically, given a text Q containing multiple words, input it into the Sentence-BERT pre-trained sentence encoder (PSE) to obtain the text representation X of the video text information.Q ∈R d That is, text-encoded information.
[0052] It should be noted that the feature dimensions of the video representation and the text representation can be kept the same during the encoding process, so that the dot product operation can be performed later.
[0053] S103. Use the prior bounding boxes as nodes of the first graph network, use the video text information as nodes of the second graph network, and construct the first graph network structure based on the first graph network nodes and the second graph network nodes.
[0054] Among them, the generated N can be b ×N a Anchor boxes serve as nodes in the first graph network, while nodes in the second graph network can be global nodes. These first and second graph network nodes are used to construct the first graph network structure. The first graph network structure is an unupdated graph network structure.
[0055] In some embodiments, node information of first graph network nodes and second graph network nodes is obtained; based on the node information of the first graph network nodes, positional association information of each first graph network node is determined; based on the node information of the second graph network nodes, interaction association information between the first graph network nodes and the second graph network nodes is determined; based on the positional association information and interaction association information, network construction is performed on the first graph network nodes and the second graph network nodes to generate a first graph network structure. Thus, a first graph network structure can be created using the first graph network nodes and the second graph network nodes.
[0056] The node information for the first graph network nodes can include the node representation and length of the prior bounding box. The node information for the second graph network nodes can include the node representation of video and text information. Position association information can be used to represent the positional relationships between the various first graph network nodes, such as adjacent, row-in-row, or column-in-row relationships. Interaction association information can be used to represent the information transmission relationships between the first graph network nodes and the second graph network nodes.
[0057] Specifically, the node representations and length information of the first graph network nodes are obtained. Based on these representations and length information, the corresponding row-side nodes, column-side nodes, and adjacent nodes of each first graph network node are determined, thereby enabling the determination of the positional association information of each first graph network node. This allows for the adaptive selection of first graph network nodes corresponding to anchor boxes of different lengths, thus better adapting to the characteristics of different video blocks.
[0058] like Figure 4 As shown, for example, since the graph network structure is regular N a ×N bA rectangle, where each column contains N nodes from the same video block. a The anchor boxes are sorted by length, and each row has N nodes. b Anchor boxes of the same length are defined for each video block. Next, the network edges E need to be defined, therefore the length relationships of the anchor boxes and their corresponding positional relationships at different times within the video block must be considered. For each first-graph network node, the four first-graph network nodes above, below, left, and right of that node can be considered as adjacent nodes. That is, for a given anchor box, it will exchange messages with other anchor boxes in the same video block and anchor boxes of the same length in the preceding and following video blocks.
[0059] Specifically, based on the node information of the second graph network nodes, the interaction and association information between the first and second graph network nodes is determined. Based on the location association information and the interaction association information, a network is constructed between the first and second graph network nodes to generate the first graph network structure. At this point, the second graph network nodes can serve as global nodes, effectively capturing the interaction between the two modalities. This allows each anchor box to integrate video text information, helping to solve the problem of information transmission between distant nodes and improving the performance of video segment retrieval.
[0060] Furthermore, the second graph network nodes can facilitate the multiple transmission of node information. In video clip retrieval tasks, there may be situations where the information depends on nodes that are far apart. For example, if the text contains a word like "the second time," information from the preceding "first time" is needed. The second graph network nodes, i.e., global nodes, can act as bridges for transmitting information between distant nodes, thereby learning richer and more complex graph representations. This allows for the generation of interactive association information between the first and second graph network nodes, facilitating the construction of the first graph network structure.
[0061] S104. Based on the video encoding information and text encoding information, update the nodes in the first graph network structure to obtain the second graph network structure.
[0062] The second graph network structure is an updated graph network structure derived from the first graph network structure.
[0063] In some embodiments, the weight ratios of the first graph network nodes are determined based on video encoding information; the first graph network nodes are updated according to their weight ratios; the updated first graph network nodes and second graph network nodes are subjected to a dot product operation based on text encoding information to obtain updated second graph network nodes; and a second graph network structure is generated based on the updated first graph network nodes and the updated second graph network nodes. Thus, a second graph network structure can be generated by updating the first graph network nodes and the second graph network nodes.
[0064] Specifically, message passing in the network uses GAT, which introduces an attention mechanism that allows each node to dynamically allocate weights based on the importance of its neighbors. This enables the model to capture the complex relationships between different nodes, and thus the weight ratio of the network nodes in the first graph can be determined through video encoding information.
[0065] For example, GAT can be used to update nodes:
[0066]
[0067] in, For node representation, X V For the video representation of the Anchor box corresponding to node v in the first graph network, and as the initial representation of the node, for k = 1, 2, ..., K,
[0068]
[0069] Among them, W (k ) is the weighting constant, α (k) Attention mechanism A (k Generate, and normalize to ensure that the sum of all neighbors of each node is 1:
[0070]
[0071] It should be noted that the information transmission with adjacent nodes here only occurs in the Anchor box node set V, i.e., the network nodes of the first graph, and does not include the network nodes of the second graph, i.e., the global node O.
[0072] Specifically, in order to interact with the text, each node representation also needs to be compared with the representation of node O in the second graph network. Perform a dot product operation to update the node representation again:
[0073]
[0074] Here, β is a hyperparameter used to control the impact of the second graph network nodes on the updates of each first graph network node.
[0075] The representation update of node O in the second graph network is also based on attention GAT. Since it connects to all other nodes in the network, its neighboring nodes are all v∈V, as shown in the following formula:
[0076]
[0077] Among them, X Q The text representation of the video text information serves as the initial representation of node O in the second graph network, for k = 1, 2, ..., K.
[0078]
[0079] Among them, W (k) Let α be the weighting constant. (k) Attention mechanism A (k) For generation, please refer to the above embodiments.
[0080] Using the above formula, the weight ratio of the first graph network nodes can be allocated based on the video encoding information to update the first graph network nodes; then, based on the text encoding information, the updated first graph network nodes and the second graph network nodes are processed by dot product to obtain the updated second graph network nodes; finally, the updated first graph network nodes and the updated second graph network nodes are used to generate the second graph network structure.
[0081] S105. Determine the loss function of the video segment retrieval model based on the labels corresponding to the prior bounding boxes.
[0082] The labels corresponding to the prior bounding boxes can include the confidence label, midpoint position label, and length label. The video segment retrieval model is an untrained model, and its performance may not be optimal. The loss function of the video segment retrieval model can be used to train the model to obtain the best-performing target video segment retrieval model.
[0083] In some embodiments, before determining the loss function of the video segment retrieval model based on the labels corresponding to the prior boxes, the node representation information, midpoint position information, and length information corresponding to each prior box are obtained; confidence labels are constructed based on the node representation information corresponding to each prior box; midpoint position labels are constructed based on the midpoint position information corresponding to each prior box; and length labels are constructed based on the length information corresponding to each prior box. This allows for the accurate construction of confidence labels, midpoint position labels, and length labels.
[0084] Among them, the node representation information can be the video representation corresponding to the prior bounding box, the midpoint position information can be the midpoint position coordinates corresponding to the prior bounding box, and the length information can be the border length corresponding to the prior bounding box.
[0085] Specifically, the nodes of the first graph network in the last layer can be represented as... The concatenation of v∈V yields the final multimodal representation X, which is then used to generate the confidence score C′ through a convolutional layer:
[0086] C′=σ(Conv(X))∈(0,1)
[0087] Where σ is the sigmoid function.
[0088] Specifically, the midpoint position and length predictions are biases, based on the difference between the ground truth and the anchor box, rather than directly using the ground truth position. This makes subsequent regression training easier, and the bias is predicted by another convolutional layer.
[0089] (t′ M ,t′ L )=Conv(X)
[0090] Therefore, for N b ×N a For each anchor box, a confidence label C can be constructed using node representation information, and a midpoint position label t can be constructed using midpoint position information. M Length label t is constructed using length information. L .
[0091] In some embodiments, the categories corresponding to the confidence label, midpoint position label, and length label are determined; based on the confidence label, midpoint position label, and length label and their corresponding categories, the loss function of the video segment retrieval model is determined. This allows for the accurate determination of the loss function of the video segment retrieval model.
[0092] Among them, the confidence label, midpoint position label and length label can all be classified into positive example labels and negative example labels.
[0093] Specifically, the confidence label C can be determined as a positive or negative label based on the Interaction-U (IOU) ratio between the Anchor box and the Ground Frame (GT). If the IOU is greater than a preset threshold, the confidence label C is considered a positive label (P=1); if the IOU is not greater than the preset threshold, the confidence label C is considered a negative label (P=0). The midpoint position label t... M and length label and t L Both can be determined by difference calculation to determine whether the corresponding label is a positive or negative label.
[0094] Specifically, the confidence score uses cross-entropy as the loss function, t M t L The mean squared error (MSE) is used as the loss function. The formula for the algorithm's loss function is as follows:
[0095]
[0096] Where, λ box , λ obj , λ noobj This is a weight constant; if it's a positive example label, The output is 1 if it is a negative label. The output will be 0; if it is a positive label, The output will be 0; if it is a negative label, The output will be 1.
[0097] S106. Based on the second graph network structure and loss function, the video segment retrieval model is trained to obtain the target video segment retrieval model.
[0098] Among them, the target video segment retrieval model is a pre-trained video segment retrieval model, which has the best performance in video segment retrieval.
[0099] Specifically, the video segment retrieval model can be trained and automated machine learning can be performed using the second graph network structure and loss function to obtain the target video segment retrieval model.
[0100] The video segment retrieval model generation method disclosed in this application involves acquiring video segments and segmenting them to obtain multiple video blocks and their corresponding video text information and prior boxes; performing feature encoding on the multiple video blocks and their corresponding video text information to generate video encoding information and text encoding information corresponding to the multiple video blocks; using the prior boxes as nodes in a first graph network and the video text information as nodes in a second graph network, and constructing a first graph network structure based on the first and second graph network nodes; updating the nodes in the first graph network structure based on the video encoding information and text encoding information to obtain a second graph network structure; determining the loss function of the video segment retrieval model based on the labels corresponding to the prior boxes; and training the video segment retrieval model based on the second graph network structure and the loss function to obtain a target video segment retrieval model. This effectively captures the relationships between video blocks and the interaction between video and text, enabling accurate retrieval of complete videos through video segments, improving the reliability and efficiency of video retrieval, and offering flexible application scenarios.
[0101] Please see Figure 5 , Figure 5 This is a schematic block diagram of a video clip retrieval model generation device provided in one embodiment of this application. The video clip retrieval model generation device can be configured in a server to execute the aforementioned video clip retrieval model generation method.
[0102] like Figure 5 As shown, the generation device 200 for the video segment retrieval model includes: a video segmentation module 201, a feature encoding module 202, a network structure construction module 203, a network structure update module 204, a loss function determination module 205, and a module model training module 206.
[0103] The video segmentation module 201 is used to acquire video segments and segment the video segments to obtain multiple video blocks and their corresponding video text information and prior boxes;
[0104] Feature encoding module 202 is used to perform feature encoding processing on the plurality of video blocks and their corresponding video text information to generate video encoding information and text encoding information corresponding to the plurality of video blocks;
[0105] The network structure construction module 203 is used to use the prior box as a first graph network node, the video text information as a second graph network node, and construct a first graph network structure based on the first graph network node and the second graph network node.
[0106] The network structure update module 204 is used to update the nodes in the first graph network structure according to the video encoding information and the text encoding information to obtain the second graph network structure.
[0107] The loss function determination module 205 is used to determine the loss function of the video segment retrieval model based on the labels corresponding to the prior boxes;
[0108] The model training module 206 is used to train the video segment retrieval model based on the second graph network structure and the loss function to obtain the target video segment retrieval model.
[0109] In some embodiments, the video segmentation module 201 is further configured to segment the video segment according to the number of frames of the video segment to obtain multiple video blocks; perform text extraction processing on the multiple video blocks to obtain video text information corresponding to the multiple video blocks; and perform clustering processing on the multiple video blocks to obtain prior boxes corresponding to the multiple video blocks.
[0110] In some embodiments, the feature encoding module 202 is further configured to perform vector transformation processing on the video block to obtain the embedding vector corresponding to the video block; encode the embedding vector to obtain the video encoding information corresponding to the video block; and input the video text information into a pre-trained encoder to obtain text encoding information, wherein the feature dimensions of the video encoding information and the text encoding information are the same.
[0111] In some embodiments, the network structure construction module 203 is further configured to obtain node information of the first graph network nodes and the second graph network nodes; determine the position association information of each of the first graph network nodes based on the node information of the first graph network nodes; determine the interaction association information between the first graph network nodes and the second graph network nodes based on the node information of the second graph network nodes; and construct a network for the first graph network nodes and the second graph network nodes based on the position association information and the interaction association information to generate a first graph network structure.
[0112] In some embodiments, the network structure update module 204 is further configured to: determine the weight ratio of the first graph network node according to the video encoding information; update the first graph network node according to the weight ratio of the first graph network node; perform dot product processing on the updated first graph network node and the second graph network node according to the text encoding information to obtain the updated second graph network node; and generate the second graph network structure according to the updated first graph network node and the updated second graph network node.
[0113] In some embodiments, the video segment retrieval model generation device 200 further includes a label determination module 207, which is used to obtain node representation information, midpoint position information and length information corresponding to each of the prior boxes; construct confidence labels based on the node representation information corresponding to each of the prior boxes; construct midpoint position labels based on the midpoint position information corresponding to each of the prior boxes; and construct length labels based on the length information corresponding to each of the prior boxes.
[0114] In some embodiments, the loss function determination module 205 is further configured to determine the categories corresponding to the confidence label, midpoint position label, and length label; and determine the loss function of the video segment retrieval model based on the confidence label, midpoint position label, and length label and their corresponding categories.
[0115] It should be noted that those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the devices, modules, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0116] The methods and apparatus of this application can be used in a wide variety of general-purpose or special-purpose computing system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer terminal devices, network PCs, minicomputers, mainframe computers, distributed computing environments including any of the above systems or devices, etc.
[0117] For example, the above-described method and apparatus can be implemented as a computer program, which can be used in, for example... Figure 6 It runs on the computer device shown.
[0118] Please see Figure 6 , Figure 6 This is a schematic diagram of a computer device provided in an embodiment of this application. The computer device may be a server.
[0119] like Figure 6 As shown, the computer device includes a processor, memory, and network interface connected via a system bus, wherein the memory may include volatile storage media, non-volatile storage media, and internal memory.
[0120] Non-volatile storage media can store operating systems and computer programs. These computer programs include program instructions that, when executed, cause the processor to perform a method for generating any video segment retrieval model.
[0121] The processor provides computing and control capabilities, supporting the operation of the entire computer device.
[0122] Internal memory provides an environment for the execution of computer programs in non-volatile storage media. When these computer programs are executed by a processor, the processor can perform any method for generating a video segment retrieval model.
[0123] This network interface is used for network communication, such as sending assigned tasks. Those skilled in the art will understand that the structure of this computer device is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than shown in the figure, or combine certain components, or have different component arrangements.
[0124] It should be understood that the processor can be a Central Processing Unit (CPU), but it can also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Among these, a general-purpose processor can be a microprocessor or any conventional processor.
[0125] In some embodiments, the processor is used to run a computer program stored in a memory to perform the following steps: acquiring a video segment and segmenting the video segment to obtain multiple video blocks and their corresponding video text information and prior boxes; performing feature encoding processing on the multiple video blocks and their corresponding video text information to generate multiple video encoding information and text encoding information corresponding to the video blocks; using the prior boxes as nodes in a first graph network and the video text information as nodes in a second graph network, and constructing a first graph network structure based on the first graph network nodes and the second graph network nodes; updating the nodes in the first graph network structure based on the video encoding information and the text encoding information to obtain a second graph network structure; determining the loss function of the video segment retrieval model based on the labels corresponding to the prior boxes; and training the video segment retrieval model based on the second graph network structure and the loss function to obtain a target video segment retrieval model.
[0126] In some embodiments, the processor is further configured to segment the video segment according to the number of frames of the video segment to obtain multiple video blocks; perform text extraction processing on the multiple video blocks to obtain video text information corresponding to the multiple video blocks; and perform clustering processing on the multiple video blocks to obtain prior boxes corresponding to the multiple video blocks.
[0127] In some embodiments, the processor is further configured to perform vector transformation processing on the video block to obtain an embedding vector corresponding to the video block; encode the embedding vector to obtain video encoding information corresponding to the video block; and input the video text information into a pre-trained encoder to obtain text encoding information, wherein the feature dimensions of the video encoding information and the text encoding information are the same.
[0128] In some embodiments, the processor is further configured to acquire node information of the first graph network nodes and the second graph network nodes; determine position association information of each of the first graph network nodes based on the node information of the first graph network nodes; determine interaction association information between the first graph network nodes and the second graph network nodes based on the node information of the second graph network nodes; and construct a network for the first graph network nodes and the second graph network nodes based on the position association information and the interaction association information to generate a first graph network structure.
[0129] In some embodiments, the processor is further configured to: determine the weight ratio of the first graph network node based on the video encoding information; update the first graph network node based on the weight ratio of the first graph network node; perform dot product processing on the updated first graph network node and the second graph network node based on the text encoding information to obtain the updated second graph network node; and generate the second graph network structure based on the updated first graph network node and the updated second graph network node.
[0130] In some embodiments, the processor is further configured to acquire node representation information, midpoint position information, and length information corresponding to each of the prior boxes; construct confidence labels based on the node representation information corresponding to each of the prior boxes; construct midpoint position labels based on the midpoint position information corresponding to each of the prior boxes; and construct length labels based on the length information corresponding to each of the prior boxes.
[0131] In some implementations, the processor is further configured to determine the categories corresponding to the confidence labels, midpoint position labels, and length labels; and to determine the loss function of the video segment retrieval model based on the confidence labels, midpoint position labels, length labels, and their corresponding categories.
[0132] This application also provides a computer-readable storage medium storing a computer program, the computer program including program instructions, which, when executed, implement the generation method of any video segment retrieval model provided in this application.
[0133] The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiments, such as the hard disk or memory of the computer device. The computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, SmartMedia Card (SMC), Secure Digital (SD) card, or Flash Card equipped on the computer device.
[0134] Furthermore, the computer-readable storage medium may primarily include a program storage area and a data storage area, wherein the program storage area may store the operating system, at least one application required for a function, etc.; and the data storage area may store data created based on the use of blockchain nodes, etc.
[0135] This invention refers to a novel application model of computer technologies such as storage, peer-to-peer transmission, consensus mechanisms, and encryption algorithms in the blockchain language model. A blockchain, essentially a decentralized database, is a chain of data blocks linked using cryptographic methods. Each data block contains information about a batch of network transactions, used to verify the validity of the information (anti-counterfeiting) and generate the next block. A blockchain can include an underlying blockchain platform, a platform product service layer, and an application service layer.
[0136] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any person skilled in the art can easily conceive of equivalent modifications or substitutions within the technical scope disclosed in this application, and such modifications or substitutions should all be covered within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method for generating a video segment retrieval model, characterized in that, The method includes: Acquire video segments and segment the video segments to obtain multiple video blocks and their corresponding video text information and prior boxes; The plurality of video blocks and their corresponding video text information are subjected to feature encoding processing to generate video encoding information and text encoding information corresponding to the plurality of video blocks; The prior bounding boxes are used as first graph network nodes, the video text information is used as second graph network nodes, and a first graph network structure is constructed based on the first graph network nodes and the second graph network nodes. Based on the video encoding information and the text encoding information, the nodes in the first graph network structure are updated to obtain the second graph network structure; Based on the labels corresponding to the prior bounding boxes, determine the loss function of the video segment retrieval model; Based on the second graph network structure and the loss function, the video segment retrieval model is trained to obtain the target video segment retrieval model; The step of updating the nodes in the first graph network structure according to the video encoding information and the text encoding information to obtain the second graph network structure includes: Based on the video encoding information, determine the weight ratio of the network nodes in the first graph; The network nodes in the first graph are updated according to their weight ratios. Based on the text encoding information, the updated first graph network node and the second graph network node are subjected to dot product processing to obtain the updated second graph network node. Based on the updated network nodes of the first graph and the updated network nodes of the second graph, generate the network structure of the second graph; Before determining the loss function of the video segment retrieval model based on the labels corresponding to the prior bounding boxes, the method further includes: Obtain the node representation information, midpoint position information, and length information corresponding to each of the prior boxes; Confidence labels are constructed based on the node representation information corresponding to each of the prior boxes; Construct midpoint position labels based on the midpoint position information corresponding to each of the prior boxes; Length labels are constructed based on the length information corresponding to each of the prior boxes.
2. The method according to claim 1, characterized in that, The segmentation process of the video segment, resulting in multiple video blocks and their corresponding video text information and prior bounding boxes, includes: Based on the number of frames in the video segment, the video segment is segmented to obtain multiple video blocks; Text extraction processing is performed on the multiple video blocks to obtain the video text information corresponding to the multiple video blocks; Clustering is performed on the multiple video blocks to obtain the prior bounding boxes corresponding to the multiple video blocks.
3. The method according to claim 1, characterized in that, The step of performing feature encoding processing on the plurality of video blocks and their corresponding video text information to generate video encoding information and text encoding information corresponding to the plurality of video blocks includes: The video block is subjected to vector transformation processing to obtain the embedding vector corresponding to the video block; The embedding vector is encoded to obtain the video encoding information corresponding to the video block; The video text information is input into a pre-trained encoder to obtain text encoding information, wherein the feature dimensions of the video encoding information and the text encoding information are the same.
4. The method according to claim 1, characterized in that, The step of constructing the first graph network structure based on the first graph network nodes and the second graph network nodes includes: Obtain node information for the network nodes in the first graph and the network nodes in the second graph; Based on the node information of the first graph network nodes, determine the location association information of each first graph network node; Based on the node information of the second graph network nodes, determine the interaction and association information between the first graph network nodes and the second graph network nodes; Based on the location association information and the interaction association information, network construction is performed on the first graph network nodes and the second graph network nodes to generate the first graph network structure.
5. The method according to claim 1, characterized in that, The labels include confidence labels, midpoint position labels, and length labels. Determining the loss function of the video segment retrieval model based on the labels corresponding to the prior bounding boxes includes: Determine the categories corresponding to the confidence label, midpoint position label, and length label; The loss function of the video segment retrieval model is determined based on the confidence label, midpoint position label, length label, and their corresponding categories.
6. A device for generating a video segment retrieval model, characterized in that, The device for generating the video segment retrieval model includes: The video segmentation module is used to acquire video segments and segment the video segments to obtain multiple video blocks and their corresponding video text information and prior boxes; The feature encoding module is used to perform feature encoding processing on the multiple video blocks and their corresponding video text information to generate video encoding information and text encoding information corresponding to the multiple video blocks; The network structure construction module is used to use the prior bounding boxes as first graph network nodes, the video text information as second graph network nodes, and to construct a first graph network structure based on the first graph network nodes and the second graph network nodes. The network structure update module is used to update the nodes in the first graph network structure according to the video encoding information and the text encoding information to obtain the second graph network structure. The loss function determination module is used to determine the loss function of the video segment retrieval model based on the labels corresponding to the prior boxes; The model training module is used to train the video segment retrieval model based on the second graph network structure and the loss function to obtain the target video segment retrieval model; The step of updating the nodes in the first graph network structure according to the video encoding information and the text encoding information to obtain the second graph network structure includes: Based on the video encoding information, determine the weight ratio of the network nodes in the first graph; The network nodes in the first graph are updated according to their weight ratios. Based on the text encoding information, the updated first graph network node and the second graph network node are subjected to dot product processing to obtain the updated second graph network node. Based on the updated network nodes of the first graph and the updated network nodes of the second graph, generate the network structure of the second graph; Before determining the loss function of the video segment retrieval model based on the labels corresponding to the prior bounding boxes, the method further includes: Obtain the node representation information, midpoint position information, and length information corresponding to each of the prior boxes; Confidence labels are constructed based on the node representation information corresponding to each of the prior boxes; Construct midpoint position labels based on the midpoint position information corresponding to each of the prior boxes; Length labels are constructed based on the length information corresponding to each of the prior boxes.
7. A computer device, characterized in that, The computer device includes a memory and a processor; The memory is used to store computer programs; The processor is configured to execute the computer program and, in executing the computer program, implement: The method for generating a video segment retrieval model as described in any one of claims 1-5.
8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to implement the method for generating a video segment retrieval model as described in any one of claims 1-5.