Video retrieval method, medium, apparatus, and computing device
By performing segment-level and video-level feature extraction on the video sequences to be retrieved, and combining a 3D feature extraction network and an aggregation model, the problems of slow speed and poor performance in existing video retrieval are solved, achieving efficient and accurate video retrieval.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HANGZHOU NETEASE ZHIQI TECH CO LTD
- Filing Date
- 2023-06-12
- Publication Date
- 2026-06-19
AI Technical Summary
Existing video retrieval methods are slow and ineffective, failing to distinguish duplicate videos, resulting in insufficient retrieval efficiency and accuracy.
By extracting features from the video sequence to be retrieved, segment-level and video-level feature information are obtained and combined for retrieval, replacing the method of extracting frame-level features frame by frame. The 3D feature extraction network and aggregation model are used to improve retrieval speed and accuracy.
It significantly improves the speed and accuracy of video retrieval, avoids the inefficiency and information loss of frame-level feature retrieval, and ensures the efficiency and accuracy of retrieval.
Smart Images

Figure CN116680442B_ABST
Abstract
Description
Technical Field
[0001] The embodiments of this disclosure relate to the field of Internet technology, and more specifically, the embodiments of this disclosure relate to a video retrieval method, medium, apparatus, and computing device. Background Technology
[0002] This section is intended to provide background or context for the embodiments of this disclosure as set forth in the claims. The description herein is not intended to be related art simply because it is included in this section.
[0003] In related technologies, video content is experiencing rapid development, with a proliferation of video sharing platforms, short video platforms, and video-based live streaming platforms, leading to a vast increase in video resources. Therefore, video retrieval is becoming increasingly valuable in various scenarios. These include finding recommended videos from a large pool of videos, identifying duplicate videos, and conducting security audits of video content. For instance, in the scenario of finding duplicate videos, the number of duplicate and similar videos on the internet is increasing, requiring video retrieval methods to quickly identify these duplicates. However, the encodings (such as MD5 hashes) used to distinguish these videos are usually different; therefore, direct file deduplication methods cannot differentiate them, necessitating content-based judgment using video retrieval methods.
[0004] Existing video retrieval methods typically extract spatial feature parameters from each frame of the video to form frame-level features, and then compare the frame-level features of different videos to achieve retrieval. This method suffers from problems such as slow extraction speed, loss of a large amount of video information during the extraction process, and poor retrieval results. Summary of the Invention
[0005] This disclosure provides a video retrieval method, medium, apparatus, and computing device to solve the problems of slow video retrieval speed and poor retrieval results in related technologies.
[0006] In a first aspect of this disclosure, a video retrieval method is provided, comprising:
[0007] Feature extraction is performed on sequence segments of the video sequence to be retrieved to obtain segment-level feature information corresponding to the sequence segments;
[0008] The segment-level feature information is aggregated to obtain the video-level feature information corresponding to the video sequence to be retrieved;
[0009] The video library is searched based on segment-level and video-level feature information to obtain the corresponding search results.
[0010] In a second aspect of this disclosure, a computer-readable storage medium is provided, comprising:
[0011] The computer-readable storage medium stores computer-executable instructions that, when executed by a processor, are used to implement the video retrieval method as described in the first aspect of this disclosure.
[0012] In a third aspect of this disclosure, a video retrieval device is provided, comprising:
[0013] The extraction module is used to extract features from sequence segments of the video sequence to be retrieved, so as to obtain segment-level feature information corresponding to the sequence segments;
[0014] The processing module is used to aggregate segment-level feature information to obtain video-level feature information corresponding to the video sequence to be retrieved;
[0015] The retrieval module is used to search the video library based on fragment-level and video-level feature information to obtain the corresponding search results.
[0016] In a fourth aspect of this disclosure, a computing device is provided, comprising: at least one processor;
[0017] and memory that is communicatively connected to at least one processor;
[0018] The memory stores instructions that can be executed by at least one processor to cause the computing device to perform the video retrieval method as described in the first aspect of this disclosure.
[0019] According to the video retrieval method, medium, apparatus, and computing device of this disclosure, feature extraction is performed on sequence segments of the video sequence to be retrieved to obtain segment-level feature information corresponding to the sequence segments; then, the segment-level feature information is aggregated to obtain video-level feature information corresponding to the video sequence to be retrieved; and finally, a retrieval is performed in a video database based on the segment-level feature information and the video-level feature information to obtain the corresponding retrieval results. Therefore, by using segment-level feature information and video-level feature information for retrieval, instead of using frame-level features obtained by extracting video features frame by frame in existing methods, the retrieval speed is significantly improved. At the same time, by combining segment-level feature information and video-level feature information, the poor retrieval results that may occur with retrieval based solely on video-level features are avoided, while ensuring the accuracy and efficiency of the retrieval. Attached Figure Description
[0020] The above and other objects, features, and advantages of this disclosure will become readily apparent from the following detailed description of exemplary embodiments, taken in conjunction with the accompanying drawings. Several embodiments of this disclosure are illustrated in the drawings by way of example and not limitation, in which:
[0021] Figure 1 An application scenario diagram illustrating an embodiment of the present disclosure is shown schematically;
[0022] Figure 2 A flowchart illustrating a video retrieval method according to another embodiment of the present disclosure is shown schematically;
[0023] Figure 3a A flowchart illustrating a video retrieval method according to another embodiment of the present disclosure is shown schematically;
[0024] Figure 3b schematically shown Figure 3a A schematic diagram of the structure of the three-dimensional feature extraction network provided in the embodiment shown;
[0025] Figure 3c schematically shown Figure 3a A schematic diagram of the multi-head attention network provided in the illustrated embodiment;
[0026] Figure 3d schematically shown Figure 3a The flowchart of the method for extracting fragment-level feature information provided in the embodiment shown is as follows;
[0027] Figure 3e schematically shown Figure 3a The flowchart of the method for obtaining original video feature information through an aggregation model provided in the embodiment shown is shown.
[0028] Figure 4a A flowchart illustrating a video retrieval method according to another embodiment of the present disclosure is shown schematically;
[0029] Figure 4b schematically shown Figure 4a The flowchart of the method for preprocessing the video sequence to be retrieved provided in the embodiment shown is as follows;
[0030] Figure 5 A schematic diagram of the structure of a storage medium according to another embodiment of the present disclosure is shown;
[0031] Figure 6 A schematic diagram of the structure of a video retrieval device according to another embodiment of the present disclosure is shown;
[0032] Figure 7 A schematic diagram of the structure of a computing device according to another embodiment of the present disclosure is shown.
[0033] In the accompanying drawings, the same or corresponding reference numerals indicate the same or corresponding parts. Detailed Implementation
[0034] The principles and spirit of this disclosure will now be described with reference to several exemplary embodiments. It should be understood that these embodiments are given merely to enable those skilled in the art to better understand and implement this disclosure, and are not intended to limit the scope of this disclosure in any way. Rather, these embodiments are provided to make this disclosure more thorough and complete, and to fully convey the scope of this disclosure to those skilled in the art.
[0035] Those skilled in the art will recognize that embodiments of this disclosure can be implemented as a system, apparatus, device, method, or computer program product. Therefore, this disclosure can be specifically implemented in the following forms: entirely hardware, entirely software (including firmware, resident software, microcode, etc.), or a combination of hardware and software.
[0036] According to embodiments of this disclosure, a video retrieval method, medium, apparatus, and computing device are proposed.
[0037] The following is a description of the terms used in this publication:
[0038] Video retrieval: For purposes such as deduplication, content review, and recommendation, the process involves pre-determining the videos or segments to be retrieved (these could be pre-defined videos used to find identical videos, content segments to be identified and excluded during content review, or videos in which users have interacted). Then, the features of the videos or segments to be retrieved are extracted, and the search is compared with the stored videos in the corresponding video retrieval database or video library to find the target videos that meet the retrieval requirements.
[0039] In this document, it should be understood that the terminology used is for convenience of understanding only and does not imply any limitation on its meaning. Furthermore, any number of elements in the accompanying drawings is for illustrative purposes only and not for limitation, and any naming is for distinction only and has no limiting meaning.
[0040] In addition, the data involved in this disclosure may be data authorized by the user or fully authorized by all parties. The collection, dissemination and use of the data shall comply with the requirements of relevant national laws and regulations. The implementation methods / executives of this disclosure may be combined with each other. Invention Overview
[0042] The inventors have discovered that video content is increasingly developing in related technologies, with a proliferation of video sharing platforms, short video platforms, and video-based live streaming platforms, resulting in a growing amount of video resources. Therefore, video retrieval is becoming increasingly valuable in various scenarios. For example, video retrieval methods can be used to find recommended videos based on video browsing history, identify duplicate videos from a large number of videos, and perform security audits on videos based on pre-determined content to be filtered. In the scenario of finding duplicate videos, due to the increasing number of duplicate and similar videos on the internet, to reduce the inventory burden of duplicate videos, improve retrieval efficiency and accuracy, and prevent infringement, it is necessary to quickly find these duplicate videos using video retrieval methods. However, the encodings (such as MD5 codes) used to distinguish these videos are usually different. Therefore, methods that directly check for duplicates based on encoding cannot distinguish them; it is necessary to use video retrieval methods to search and check for duplicates based on the specific content of these videos.
[0043] Existing video retrieval methods typically extract spatial feature parameters from each frame of a video to form frame-level features. These frame-level features from different videos are then compared (e.g., comparing the similarity of individual frame-level features) to identify target videos (e.g., duplicate videos, videos requiring review and exclusion, or videos to be recommended). However, this method requires extracting a large number of frame-level features for each video comparison (the more video frames, the more frame-level features need to be extracted), resulting in slow extraction speed. Furthermore, the extraction process only involves spatial features, leading to the loss of temporal feature information in the video and poor retrieval performance. Alternatively, video-level features can be obtained by aggregating frame-level features and then used for retrieval to reduce redundancy (since each video has only one video-level feature) and improve retrieval efficiency. However, directly using video-level features may result in insufficient information, and directly obtaining them from frame-level features that do not contain temporal features will also lead to information loss, further reducing retrieval accuracy.
[0044] In this scheme, video frames are extracted from the video to be retrieved to obtain sequence segments, and then segment-level features and video-level features are extracted from the sequence segments. The two are then combined for retrieval, which can fully reflect the spatial and temporal features of the video to be retrieved, ensuring the accuracy of the retrieval, while avoiding the inefficiency of frame-level feature retrieval and improving retrieval efficiency.
[0045] After introducing the basic principles of this disclosure, various non-limiting embodiments of this disclosure will be described in detail below.
[0046] Application Scenarios Overview
[0047] First refer to Figure 1As shown, during the video retrieval process, the server 100 receives the video sequence 110 to be retrieved (sent by the client or selected by the processor in the server), extracts the segment-level feature information and video-level feature information of the video sequence 110 to be retrieved, and compares it with the inventory videos 120 used for retrieval comparison to determine the corresponding retrieval results, thereby completing the video retrieval process.
[0048] It should be noted that, Figure 1 In the scenario shown, only one of the server, the video sequence to be retrieved, the video library, and the stock video is used as an example for illustration. However, this disclosure is not limited to this. That is to say, the number of servers, the video sequence to be retrieved, the video library, and the stock video can be arbitrary.
[0049] Exemplary methods
[0050] The following is combined with Figure 1 Application scenarios, refer to Figures 2 to 4b This document describes a video retrieval method according to exemplary embodiments of the present disclosure. It should be noted that the above application scenarios are shown only to facilitate understanding of the spirit and principles of the present disclosure, and the embodiments of the present disclosure are not limited in any way. Rather, the embodiments of the present disclosure can be applied to any applicable scenario.
[0051] Figure 2 This is a flowchart illustrating a video retrieval method provided in one embodiment of this disclosure. Figure 2 As shown, the video retrieval method provided in this embodiment includes the following steps:
[0052] Step S201: Extract features from the sequence segments of the video sequence to be retrieved to obtain segment-level feature information corresponding to the sequence segments.
[0053] Specifically, the video to be retrieved is one or more videos used to retrieve a corresponding video in a video library, video retrieval library, or video collection used for retrieval (hereinafter collectively referred to as video library for ease of description). The video to be retrieved is a collection of video frames obtained by decoding and extracting frames (e.g., extracting one frame per second) from the video to be retrieved, i.e., the video sequence to be retrieved.
[0054] The video to be searched can be determined by directly uploading the corresponding video data, or by selecting one or more existing video data in the video library (in this case, the purpose of video retrieval is to find duplicate videos that have already been entered into the video library, such as by determining the video to be searched by its encoding number, or by selecting one or more videos from a specific video form).
[0055] A sequence segment is one or more video segments obtained by dividing a video sequence to be retrieved. The specific segmentation method can be equal-length segmentation or segmentation according to a set length. The longer the video sequence to be retrieved (the longer the duration of the video to be retrieved), the more sequence segments there are; conversely, the shorter the video sequence to be retrieved, the fewer the number of sequence segments. When the video sequence to be retrieved is extremely short (e.g., less than 10 seconds in duration), there may only be one corresponding sequence segment; in this case, the sequence segment is the video sequence to be retrieved itself.
[0056] Feature extraction involves using a feature extraction algorithm to extract corresponding feature information from sequence fragments in order to perform retrieval based on that feature information.
[0057] Fragment-level feature information is feature information extracted based on sequence fragments. It mainly reflects the temporal and spatial features of video frames in the sequence fragments. Since there are many sequence fragments corresponding to the video sequence to be retrieved, the total amount of fragment-level feature information is relatively large compared to video-level features (one video sequence to be retrieved corresponds to only one video-level feature). It can more accurately correspond to the features of the video to be retrieved and the videos in the video library. Moreover, compared with the frame-level features in related technologies, it effectively extracts the temporal features in the sequence fragments. Therefore, compared with frame-level features, it can better avoid the loss of feature information when extracting features, thereby ensuring the accuracy of retrieval.
[0058] Step S202: Aggregate the segment-level feature information to obtain video-level feature information corresponding to the video sequence to be retrieved.
[0059] Specifically, when there are many segment sequences (such as long videos to be retrieved or complex scene changes, in which case there are usually many corresponding segment sequences), the workload of retrieval based on segment-level feature information is close to that of retrieval based on frame-level features, resulting in low retrieval efficiency. Therefore, it is necessary to obtain video-level features through aggregation processing on the basis of segment-level feature information, so as to first filter by video-level features, and then compare and retrieve by segment-level features after filtering, in order to improve retrieval efficiency.
[0060] Each video to be retrieved typically has only one video-level feature, which contains less information than segment-level features. By comparing video-level features, we can quickly filter out videos that may exist in the search results (corresponding target videos), thereby reducing the number of videos that need to be compared using segment-level features, and thus improving overall retrieval efficiency.
[0061] Step S203: Search the video library based on segment-level feature information and video-level feature information to obtain the corresponding search results.
[0062] Specifically, the video library stores pre-extracted segment-level and video-level feature information for each stock video. By comparing the segment-level and video-level feature information of the video sequence to be retrieved with the segment-level and video-level feature information of the stock videos in the video library, stock videos with high similarity can be retrieved.
[0063] By first using video-level feature information to quickly filter out videos in the video library that may be search results (corresponding to the target video), and then using segment-level feature information for retrieval, it is possible to improve retrieval efficiency by utilizing video-level feature information and significantly improve retrieval accuracy by using segment-level feature information, thereby achieving a simultaneous improvement in retrieval efficiency and retrieval accuracy.
[0064] Based on the retrieval of segment-level and video-level feature information, the corresponding retrieval results can be obtained, namely, target videos that are similar to or the same as the video sequence to be retrieved, and based on the retrieval...
[0065] According to the video retrieval method of this disclosure, feature extraction is performed on sequence segments of the video sequence to be retrieved to obtain segment-level feature information corresponding to the sequence segments; then, the segment-level feature information is aggregated to obtain video-level feature information corresponding to the video sequence to be retrieved; and finally, a search is performed in the video database based on the segment-level feature information and the video-level feature information to obtain the corresponding search results. Therefore, by using segment-level feature information and video-level feature information for retrieval, instead of using frame-level features obtained by extracting video features frame by frame in existing methods, the retrieval speed is significantly improved. At the same time, by combining segment-level feature information and video-level feature information, the poor retrieval effect that may occur with retrieval based solely on video-level features is avoided, while ensuring the accuracy and efficiency of the retrieval.
[0066] Figure 3a This is a flowchart illustrating a video retrieval method provided in one embodiment of this disclosure. Figure 3a As shown, the video retrieval method provided in this embodiment includes the following steps:
[0067] Step S301: Input the sequence fragments into the three-dimensional feature extraction network to extract fragment-level feature information.
[0068] Specifically, the segment-level feature information is a three-dimensional tensor feature, which corresponds to the corresponding video frame in the sequence segment; the three-dimensional tensor feature includes the height feature, width feature and temporal feature of the video frame.
[0069] Three-dimensional tensor features are feature information extracted sequentially from each video frame in a sequence segment. They comprehensively reflect the feature information within each video frame and between different video frames through the three dimensions of height, width, and time. Specifically, height and width features correspond to spatial features within the sequence segment, while the temporal features of the video frames correspond to temporal features within the sequence segment. Spatial features reflect the characteristics of each video frame itself (e.g., the features of the image in each video frame), while temporal features reflect the characteristics between video frames (e.g., the changing features of the image between different video frames). Combining the two can more accurately and comprehensively reflect the characteristics of the video to be retrieved. Compared to related technologies that only use spatial features for retrieval, this method ensures that the retrieved video is not only similar to the video frame features of the video to be retrieved at the level of a single video frame, but also similar to the changes between video frames of the video to be retrieved at the level of changes between multiple consecutive video frames. This effectively guarantees a high similarity between the retrieved video and the video to be retrieved, thus avoiding the situation where a video that is only similar to the video to be retrieved but has different content is retrieved as a duplicate video (e.g., an animated music video AMV or video clip based on the original video, in which the scenes have a lot of repetition with the original video as material, resulting in similar spatial features to the original video, but actually significantly different temporal features, and significant differences in content and expressed information, which can be regarded as an original video different from the original video). This significantly improves the accuracy of video retrieval.
[0070] The 3D feature extraction network is a pre-trained neural network used to extract 3D tensor features from video frames.
[0071] The three-dimensional feature extraction network used in this embodiment is an attention network model configured on the basis of the VisionTransformer (vit) network. The three-dimensional feature extraction network can process the spatial and temporal features of the input sequence fragments, fully ensuring the comprehensiveness of the extracted features.
[0072] In one embodiment of this disclosure, such as Figure 3b As shown, this is a schematic diagram of the structure of a three-dimensional feature extraction network. The specific structure of the three-dimensional feature extraction network 300 includes:
[0073] Video frame splitter 310 is used to split an input video frame into a set number of tiles and output them.
[0074] The video frame tagger 320 is used to receive a tile, add a position information tag to the tile according to its position in the corresponding video frame, add a classification learning tag to the video frame corresponding to the tile, and output the position information tag and the classification learning tag.
[0075] The feature extractor 330 includes a multi-head attention network 331 and a multilayer perceptron 332. The multi-head attention network and the multilayer perceptron have residual connections. The feature extractor is used to receive location information labels, classification learning labels and video frame tiles, and output the corresponding feature vectors. The multi-head attention network is used to process the spatial and temporal information of the corresponding sequence segments of the video frames.
[0076] The head-mounted multilayer perceptron 340 is used to receive feature vectors and output corresponding classification vectors.
[0077] Classifier 350 is used to receive the input classification vector and output fragment-level feature information.
[0078] The video frame splitter 310 sequentially splits each video frame in the input sequence segment into several patches (such as 9, 16, or other set numbers) and transmits them to the video frame tagger 320. The video frame tagger 320 then sequentially tags the patches (i.e., adds position information tags, such as different numbers based on different positions) and adds an additional classification learning tag corresponding to the video frame to which these patches belong. When the feature extractor 330 extracts the features of each patch, it synchronously updates the value information corresponding to the classification learning tag based on the position information tag of the patch. When the feature extractor 330 has extracted the features of all patches, the classification learning tag will also obtain a vector composed of the corresponding value information, which is the output feature vector.
[0079] The multilayer perceptron head (MLP head) is a module that performs classification based on the output feature vector. The MLP head can determine the category to which the output feature vector belongs and output the corresponding category vector. The classifier 350 can assign the received category vector to the corresponding classification result based on the preset relationship between the category and the category vector. This classification result (usually a parameter or vector) is the segment-level feature information of the corresponding video frame.
[0080] In one embodiment of this disclosure, the multi-head attention network 331 in the feature extractor 330 has a variety of different set sizes (the parameters involved in the size include the number of layers, the overall volume, etc., and the larger the size, the larger the values of these parameters). The larger the size, the higher the accuracy of the determined fragment-level feature information and the slower the calculation speed. The smaller the size, the lower the accuracy of the determined fragment-level feature information and the faster the calculation speed.
[0081] like Figure 3c The diagram shown illustrates the structure of a multi-head attention network. Specific types of multi-head attention networks include:
[0082] A joint time-space attention network for processing time- and spatial-dimensional information;
[0083] Alternatively, a separate temporal-spatial attention network can be constructed, comprising a temporal attention network and a spatial attention network, with residual connections between the temporal and spatial attention networks. The temporal attention network is used to process temporal dimension information, and the spatial attention network is used to process spatial dimension information.
[0084] Alternatively, sparse local-global attention networks, including local attention networks and global attention networks, with residual connections between local and global attention networks, where both local and global attention networks are used to process temporal and spatial information.
[0085] Alternatively, an axial attention network can be constructed, comprising a temporal attention network, a width attention network, and a height attention network, with residual connections between them sequentially. The temporal attention network is used to process temporal dimension information, while the width and height attention networks are used to process spatial dimension information.
[0086] Specifically, different types of multi-head attention networks (MNBs) process sequence segments (including processing the temporal dimension information contained in corresponding tiles between multiple consecutive input video frames, and processing the spatial dimension information contained in each tile within each video frame) through multiple different types of attention networks (or combinations of attention networks). This effectively extracts the three-dimensional tensor features corresponding to the video frames and also processes the temporal feature information in the sequence segments. Those skilled in the art can select the appropriate type of MNB based on the actual situation; all can effectively extract the segment-level feature information corresponding to the sequence segments.
[0087] Furthermore, such as Figure 3d The diagram shown is a flowchart of a method for extracting fragment-level feature information, which specifically includes the following steps:
[0088] Step S3011: Input the sequence fragment into the three-dimensional feature extraction network and output the original fragment feature information corresponding to the sequence fragment.
[0089] Specifically, the original segment feature information is the feature information directly output by the 3D feature extraction network. For different video sequences to be retrieved, the original segment feature information may vary in size (if the original segment feature information is a vector, the difference in information content may be reflected in the different number of vector dimensions; for example, the longer each shot in the video to be retrieved and the larger the video size, the more information the corresponding original segment feature information contains, and the larger the original segment feature information is). This makes it inconvenient to compare the original segment feature information of different videos to be retrieved, failing to meet the comparison and recognition requirements in retrieval. Therefore, further processing of the original segment feature information is needed to obtain segment-level feature information that can be used for retrieval comparison.
[0090] Step S3012: Simplify the original fragment feature information to obtain fragment-level feature information.
[0091] Specifically, the methods for processing the original fragment feature information are collectively referred to as simplification processing. The specific simplification processing methods can be based on the principal component analysis algorithm to reduce the dimensionality of the original fragment feature information and then normalize the processed original fragment feature information; or, based on the principal component analysis algorithm, to whiten the original fragment feature information and then normalize the processed original fragment feature information.
[0092] Dimensionality reduction reduces the dimensionality of the original fragment feature information, decreasing the amount of information and improving subsequent retrieval efficiency. Whitening, on the other hand, has a relatively smaller impact on the amount of information in the original fragment feature information, ensuring that the obtained fragment-level feature information contains sufficient information, thereby guaranteeing retrieval accuracy (but usually resulting in lower retrieval efficiency). Therefore, in practical applications, the appropriate method can be selected based on the priority of efficiency and accuracy requirements.
[0093] Normalization involves organizing the original segment feature information after dimensionality reduction or whitening to ensure that the segment-level feature information corresponding to different videos to be retrieved can meet the retrieval requirements.
[0094] Step S302: Input the original segment feature information into the aggregation model and output the aggregated original video feature information.
[0095] Specifically, since segment-level feature information is obtained by simplifying the original segment feature information, it suffers from information loss compared to the original segment feature information. In order to reduce the information loss when obtaining video-level feature information, the original segment feature information is used directly to obtain video-level feature information (instead of using video-level feature information to obtain it).
[0096] Furthermore, the aggregation model includes interconnected feature extractor models and pooling processors. The feature extractor model is the combination of the aforementioned multi-head attention network and multilayer perceptron.
[0097] like Figure 3e The diagram shows a flowchart of a method for obtaining original video feature information through an aggregation model, which specifically includes the following steps:
[0098] Step S3021: Input the original fragment feature information into the feature extractor model and output the aggregated features corresponding to the original fragment feature information.
[0099] Specifically, by inputting the feature information of all the original segments of a video to be retrieved into the feature extractor model, the feature information of these original segments can be extracted, i.e., aggregated features.
[0100] The multi-head attention network used in the feature extractor model can be a different multi-head attention network than that used in step S301, or it can be the same multi-head attention network; there is no limitation here.
[0101] Step S3022: Input the aggregated features into the pooling processor and output the original video feature information.
[0102] Specifically, the pooling processor is used to perform pooling processing on aggregated features. The specific pooling method can be selected according to the settings, such as average pooling, max pooling, min pooling, etc., and there are no restrictions here. By performing pooling processing on aggregated features, the feature dimensionality in the aggregated features can be further reduced, preventing overfitting.
[0103] After pooling the aggregated features, the corresponding original video feature information can be obtained.
[0104] Step S303: Simplify the original video feature information to obtain video-level feature information.
[0105] Specifically, similar to the original segment feature information, the original video feature information also suffers from varying information content depending on the video (e.g., more corresponding shots result in more original segment feature information, leading to a larger amount of information in the resulting original video feature information), which in turn affects retrieval performance. Therefore, it is necessary to simplify the original video feature information to obtain video-level feature information suitable for retrieval.
[0106] Specific simplification methods can be based on principal component analysis (PCA) to reduce the dimensionality of the original video feature information and then normalize the processed original video feature information; or, based on PCA, to whiten the original video feature information and then normalize the processed original video feature information.
[0107] The principle is similar to that of processing the feature information of the original fragment, and will not be elaborated here.
[0108] Step S304: Search the video library based on segment-level feature information and video-level feature information to obtain the corresponding search results.
[0109] Specifically, this step is related to Figure 2 Step S203 in the illustrated embodiment is the same and will not be repeated here.
[0110] The video retrieval method according to this disclosure involves inputting sequence segments into a three-dimensional feature extraction network to extract segment-level feature information. The original segment feature information is then input into an aggregation model, which outputs aggregated original video feature information. This original video feature information is further simplified to obtain video-level feature information. Finally, a search is performed in a video database based on both the segment-level and video-level feature information to obtain the corresponding search results. This method accurately extracts segment-level and video-level feature information reflecting the temporal and spatial characteristics of the video sequence to be retrieved. While ensuring information accuracy during processing, it maximizes retrieval efficiency and accuracy, thereby improving the retrieval effect.
[0111] Figure 4a This is a flowchart illustrating a video retrieval method provided in one embodiment of this disclosure. Figure 4a As shown, the video retrieval method provided in this embodiment includes the following steps:
[0112] Step S401: In response to the received video entry request, determine that the video sequence to be entered into the database corresponding to the video entry request is the video sequence to be retrieved.
[0113] Specifically, one application scenario for video retrieval is when new videos need to be added to the video library. In this case, the new videos are those awaiting inclusion. By comparing these videos with the existing video library, it's determined whether they are duplicates. The non-duplicate videos are then added to the library to avoid duplicate entries that would lead to invalid data entry (including duplicate videos wastes server storage space, reduces search performance, and degrades the user experience).
[0114] The video sequence to be added to the database is the set of video frames obtained by decoding and extracting all these new videos, which is also the video sequence to be retrieved for subsequent retrieval.
[0115] Step S402: Preprocess the video to be retrieved to obtain sequence segments of the video sequence to be retrieved.
[0116] Specifically, since the actual content being searched is a sequence segment of the video sequence to be searched, rather than the video itself, after the video to be searched is determined, it will be preprocessed to obtain the sequence segment corresponding to the video sequence to be searched.
[0117] Furthermore, such as Figure 4b The diagram shows a flowchart of a method for preprocessing the video to be retrieved. The specific steps for obtaining sequence segments of the video sequence to be retrieved include:
[0118] Step S4021: Decode the video to be retrieved to obtain the decoded video to be retrieved.
[0119] Specifically, since different videos to be retrieved have different encoding formats, all videos to be retrieved first need to be decoded in order to facilitate further processing.
[0120] Specific decoding methods can be achieved using existing video decoding methods and tools (such as FFmpeg or OpenCV video encoding and decoding tools), which will not be elaborated here.
[0121] Step S4022: Extract video frames of the decoded video to be retrieved at equal intervals to obtain the video sequence to be retrieved.
[0122] Specifically, after decoding the video to be retrieved, video frames need to be extracted from the video sequence to obtain the corresponding video sequence to be retrieved.
[0123] The commonly used method for extracting video frames is to extract frames at equal intervals, such as extracting one frame every set number of frames (e.g., every 12 frames or every 72 frames), or to extract frames at intervals of a set duration (e.g., every 0.2 seconds), in order to reduce the number of video frames used for subsequent analysis.
[0124] The set of video frames corresponding to each video to be retrieved after frame extraction is used as the video sequence to be retrieved.
[0125] Step S4023: Divide the video sequence to be retrieved into at least one segment of the same length to obtain the sequence segments of the video sequence to be retrieved.
[0126] Specifically, the lengths of the video sequences to be retrieved are different for videos of different lengths. In order to facilitate feature comparison between different videos to be retrieved, the video sequences to be retrieved need to be divided into multiple segments (at this time, the segments corresponding to different videos to be retrieved usually have relatively small length differences, and the accuracy of feature comparison is higher).
[0127] The specific segmentation method for the video sequence to be retrieved can be as follows: the length of each segment can be preset and the segment can be divided according to the preset length (e.g., one segment every 30 frames). If the last segment is not long enough, it can be discarded or padded (i.e., repeat the last video frame until the length of the last segment reaches the set length). Alternatively, the video sequence to be retrieved can be divided into several segments (usually more than two, but in special cases, there can be only one, such as a video with still images) based on a preset length range (where the preset length includes multiple values, such as one segment every 28 to 32 frames).
[0128] Step S403: Extract features from the sequence segments of the video sequence to be retrieved to obtain segment-level feature information corresponding to the sequence segments.
[0129] Step S404: Aggregate the segment-level feature information to obtain video-level feature information corresponding to the video sequence to be retrieved.
[0130] Specifically, this step is related to Figure 2 The steps S201 to S202 in the illustrated embodiment are the same, and will not be repeated here.
[0131] Step S405: Determine the first feature similarity between the video-level feature information of the video sequence to be retrieved and the video-level feature information of the stored videos in the video library.
[0132] Specifically, after obtaining video-level and segment-level feature information, we can compare these two types of information with the corresponding categories of videos in the database to find the video with the highest similarity to the video sequence to be retrieved, and then determine whether it can be added to the database.
[0133] The first thing to compare is the video-level feature information of the video sequence to be retrieved and the stored videos. Because video-level feature information has relatively little content, the comparison speed is faster, which can quickly complete the first round of screening, reduce the number of stored videos when comparing them using segment-level feature information, and thus improve the overall comparison and screening efficiency.
[0134] The first feature similarity is the similarity obtained by comparing video-level feature information. If the video-level feature information is a vector, then the first feature similarity is vector similarity, and its specific calculation method can be any vector similarity calculation method. If the video-level feature information is a numerical value, then the first feature similarity is numerical similarity (such as the ratio of the difference between two video-level feature information to the video-level feature information of the video to be retrieved).
[0135] Step S406: Based on the first feature similarity ranking, determine at least one inventory video with the highest similarity to the video-level feature information of the video sequence to be retrieved, as a candidate video.
[0136] Specifically, by sequentially calculating and sorting the first feature similarity between the video sequence to be retrieved and all the inventory videos, one or more inventory videos with the highest video-level feature information similarity to the video sequence to be retrieved can be identified (usually, each video to be retrieved needs at least two corresponding candidate videos for further processing), i.e., candidate videos, and further screening can be carried out based on these candidate videos.
[0137] Step S407: Determine the second feature similarity between the segment-level feature information of the video sequence to be retrieved and the segment-level feature information of the candidate video.
[0138] Specifically, after identifying candidate videos for further filtering, the segment-level feature information of the video to be retrieved can be compared sequentially with that of the corresponding candidate videos to achieve further filtering. Since the comparison of segment-level feature information is relatively slow, reducing the number of videos requiring comparison (from the inventory of videos to the candidate videos) can effectively improve comparison efficiency, thereby improving retrieval efficiency.
[0139] The second feature similarity is the similarity obtained by comparing fragment-level feature information. The calculation principle of the second feature similarity is the same as that of the first feature similarity. Both determine the algorithm based on the specific type of fragment-level feature information, which will not be elaborated here.
[0140] Step S408: Based on the second feature similarity ranking, at least one candidate video with the highest segment-level feature similarity to the video sequence to be retrieved is identified as the target video to obtain the retrieval results.
[0141] Specifically, the target video is at least one of the inventory videos selected from the candidate videos that has the highest similarity to the video sequence to be retrieved (including the highest or one of the highest segment-level feature similarity and video-level feature similarity). By comparing the target video with the set similarity index or threshold, it can be determined whether the video to be retrieved and the corresponding target video constitute a duplicate video (or, in the application scenario of video recommendation, it can also be determined whether the target video can be used as a recommended video with a style and content similar to the video to be retrieved).
[0142] By combining first feature similarity ranking and second feature similarity ranking, multi-granularity re-ranking is achieved. Compared with ranking using only a single granularity or index, it combines the high efficiency of first feature similarity ranking with the accuracy of second feature similarity ranking, thereby significantly improving the retrieval ranking effect.
[0143] Step S409: If the similarity between the target video and the video sequence to be retrieved is less than a set threshold, the retrieval result is determined to be that the video sequence to be retrieved meets the conditions for inclusion in the database.
[0144] Specifically, if the first feature similarity and the second feature similarity between the target video and the corresponding video to be retrieved are both less than the corresponding set threshold, it means that the similarity between the target video and the corresponding video to be retrieved has not reached the level that can be identified as a duplicate video, even if they are very similar (such as the same performance on different dates in the same scene; even if the performance content and scene are the same, resulting in a very high similarity, there will still be some subtle differences, which will prevent them from being identified as duplicate videos).
[0145] At this point, it can be determined that the video to be retrieved meets the inclusion criteria, and the video to be retrieved can be added to the database.
[0146] Step S410: Store the segment-level feature information and video-level feature information corresponding to the video sequence to be added to the database for retrieval in the video database.
[0147] Specifically, while adding the video sequences to be retrieved into the database, it is also necessary to store the segment-level feature information and video-level feature information corresponding to these videos. This way, when the videos are retrieved as stored videos later, the already calculated segment-level feature information and video-level feature information can be directly called without recalculation, thereby improving processing efficiency.
[0148] Step S411: If the similarity with the video sequence to be retrieved is greater than or equal to the set threshold, the retrieval result is determined to be that the video sequence to be retrieved does not meet the conditions for inclusion in the database.
[0149] Specifically, if the similarity between the target video and the video to be retrieved (including segment-level feature similarity and video-level feature similarity) is greater than or equal to the set threshold, it means that there is at least one inventory video (i.e., the target video) that is duplicated with the video to be retrieved (or some of the videos to be retrieved). Therefore, it is not necessary to add the video to be retrieved to the database to avoid having multiple duplicate videos in the video database and reducing the quality of the video database.
[0150] By using video retrieval based on segment-level and video-level feature information during video input, the features of the video to be input can be accurately extracted, and the existence of duplicate videos similar to or identical to the video to be input can be accurately determined. This effectively identifies substantially identical videos with a high degree of similarity to the video to be input (e.g., adding an intro and outro to the original video does not actually affect most of the segment-level feature information, nor does it significantly affect the video-level feature information. Similarity can be determined by comparing video-level feature information, and substantial duplication can be identified by comparing segment-level feature information, thus classifying them as substantially identical or duplicate videos). This ensures the validity of the videos in the video library (duplicate videos are actually of little value to the video library and can be considered invalid input). Step S411 is an optional step parallel to steps S409 to S410, and those skilled in the art can select the corresponding step to execute according to the actual situation.
[0151] In one embodiment of this disclosure, the video retrieval method provided can be applied not only in the scenario of adding videos to a database, but also in the scenario of recommending videos. In this case, based on the user's historical video browsing records or collection records, at least one video that the user has recently viewed or interacted with (including collection) can be used as the video to be retrieved. Then, the corresponding target video is found in the database video and recommended to the user as a recommended video. This can effectively ensure the similarity between the recommended video and the video that the user prefers (or has interacted with), thus ensuring the user experience.
[0152] In one embodiment of this disclosure, in addition to the scenario of adding videos to the database, it can also be applied in the scenario of background deduplication of the video database. By periodically selecting a portion of the inventory videos (or sequentially traversing all inventory videos) as the videos to be searched in the background of the server, it can be determined whether there are duplicate videos, which can effectively improve the quality of the video database, minimize the number of duplicate videos, improve the user's search and viewing experience, facilitate management, and at the same time not affect the user's real-time viewing.
[0153] According to the video retrieval method of this disclosure, upon receiving a video entry request, the method determines that the video sequence to be entered into the database corresponding to the video entry request is the video sequence to be retrieved. The video sequence to be retrieved is preprocessed to obtain sequence segments of the video sequence to be retrieved. Then, feature extraction is performed to obtain segment-level feature information and video-level feature information corresponding to the video sequence to be retrieved. Then, the corresponding target video of the video sequence to be retrieved is retrieved. Finally, based on the similarity between the target video and the video sequence to be retrieved, it is determined whether to enter the video to be retrieved into the database. Therefore, by extracting sequence segments from the video sequence to be retrieved and obtaining corresponding segment-level and video-level feature information for retrieval, and by first using video-level feature information for retrieval and sorting, retrieval efficiency can be significantly improved. By further re-sorting segment-level feature information based on the video-level feature information retrieval and sorting results, a multi-granularity re-sorting effect can be achieved, ensuring retrieval accuracy and accurately identifying the corresponding target video. Based on the retrieval purpose and target video, the retrieval results can be determined, improving the video retrieval effect. By storing the segment-level and video-level feature information corresponding to the video, subsequent retrieval can be facilitated. At the same time, compared with storing video frame features, the storage space occupied by feature information can be significantly reduced, improving storage efficiency.
[0154] Exemplary media
[0155] After introducing the methods of exemplary embodiments of this disclosure, the following references are made. Figure 5 The storage medium of the exemplary embodiments of this disclosure will be described.
[0156] refer to Figure 5 As shown, a program product 50 for implementing the above-described method according to an embodiment of the present disclosure is described. This product may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto.
[0157] The program product may employ any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of readable storage media (a non-exhaustive list) include: an electrical connection having one or more wires, a portable disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.
[0158] A readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying readable program code. This propagated data signal may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A readable signal medium may also be any readable medium other than a readable storage medium.
[0159] Program code for performing the operations disclosed herein can be written in any combination of one or more programming languages, including object-oriented programming languages such as Java and C++, and conventional procedural programming languages such as C or similar languages. The program code can execute entirely on the user's computing device, partially on the user's computing device, partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing devices can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN).
[0160] Exemplary device
[0161] Having introduced the medium of exemplary embodiments of this disclosure, the following references are made to... Figure 6 The video retrieval apparatus of the exemplary embodiments of this disclosure is described to implement the video retrieval method in any of the above method embodiments. Its implementation principle and technical effect are similar to those of the corresponding methods described above, and will not be repeated here.
[0162] The video retrieval device 600 provided in this disclosure includes:
[0163] The extraction module 610 is used to extract features from sequence segments of the video sequence to be retrieved, so as to obtain segment-level feature information corresponding to the sequence segments;
[0164] The processing module 620 is used to aggregate the segment-level feature information to obtain video-level feature information corresponding to the video sequence to be retrieved;
[0165] The retrieval module 630 is used to perform retrieval in the video library based on fragment-level feature information and video-level feature information to obtain corresponding retrieval results.
[0166] In one exemplary embodiment of this disclosure, the extraction module 610 is specifically used to: input sequence fragments into a three-dimensional feature extraction network to extract fragment-level feature information.
[0167] In one exemplary embodiment of this disclosure, the extraction module 610 includes: segment-level feature information including three-dimensional tensor features, the three-dimensional tensor features corresponding to the corresponding video frames in the sequence segments; the three-dimensional tensor features include the height features, width features, and time features of the video frames.
[0168] In one exemplary embodiment of this disclosure, the extraction module 610 specifically includes: a three-dimensional feature extraction network structure comprising: a video frame splitter, used to split the input video frame into a set number of tiles and output them; a video frame tagger, used to receive tiles, add position information tags to the tiles according to their positions in the corresponding video frames, and add classification learning tags to the video frames corresponding to the tiles, and output the position information tags and classification learning tags; a feature extractor, including a multi-head attention network and a multilayer perceptron, with residual connections between the multi-head attention network and the multilayer perceptron, the feature extractor being used to receive the position information tags, classification learning tags, and the tiles of the video frames, and output the corresponding feature vectors, the multi-head attention network being used to process the spatial and temporal dimensions of the sequence segments corresponding to the video frames; a head multilayer perceptron, used to receive the feature vectors and output the corresponding classification vectors; and a classifier, used to receive the input classification vectors and output segment-level feature information.
[0169] In one exemplary embodiment of this disclosure, the extraction module 610 specifically includes: a multi-head attention network having a set size, the multi-head attention network including: a joint time-space attention network for processing time dimension information and spatial dimension information; or, a separate time-space attention network including a time attention network and a spatial attention network, with residual connections between the time attention network and the spatial attention network, the time attention network being used to process time dimension information and the spatial attention network being used to process spatial dimension information; or, a sparse local-global attention network including a local attention network and a global attention network, with residual connections between the local attention network and the global attention network, both the local attention network and the global attention network being used to process time dimension information and spatial dimension information; or, an axial attention network including a time attention network, a width attention network, and a height attention network, with residual connections sequentially between the time attention network, the width attention network, and the height attention network, the time attention network being used to process time dimension information, and the width attention network and the height attention network being used to process spatial dimension information.
[0170] In one exemplary embodiment of this disclosure, the extraction module 610 is specifically used to: input the sequence fragment into a three-dimensional feature extraction network, output the original fragment feature information corresponding to the sequence fragment, and simplify the original fragment feature information to obtain fragment-level feature information.
[0171] In one exemplary embodiment of this disclosure, the processing module 620 is specifically used to: input the original segment feature information into the aggregation model, output the aggregated original video feature information; and simplify the original video feature information to obtain video-level feature information.
[0172] In one exemplary embodiment of this disclosure, the processing module 620 is specifically configured to: if the aggregation model includes an interconnected feature extractor model and a pooling processor, input the original segment feature information into the feature extractor model and output the aggregated features corresponding to the original segment feature information; input the aggregated features into the pooling processor and output the original video feature information.
[0173] In one exemplary embodiment of this disclosure, the processing module 620 specifically includes: simplification processing includes: performing dimensionality reduction processing on the original segment feature information based on the principal component analysis algorithm, and normalizing the processed original segment feature information; or, performing dimensionality reduction processing on the original video feature information based on the principal component analysis algorithm, and normalizing the processed original video feature information.
[0174] In one exemplary embodiment of this disclosure, the processing module 620 specifically includes: simplification processing includes: whitening the original segment feature information based on the principal component analysis algorithm, and normalizing the processed original segment feature information; or, whitening the original video feature information based on the principal component analysis algorithm, and normalizing the processed original video feature information.
[0175] In one exemplary embodiment of this disclosure, the extraction module 610 is further configured to: preprocess the video to be retrieved before performing feature extraction on the sequence segments of the video sequence to be retrieved to obtain segment-level feature information corresponding to the sequence segments, thereby obtaining sequence segments of the video sequence to be retrieved.
[0176] In one exemplary embodiment of this disclosure, the extraction module 610 is specifically used for: decoding the video to be retrieved to obtain the decoded video to be retrieved; extracting video frames of the decoded video to be retrieved at equal intervals to obtain the video sequence to be retrieved; and dividing the video sequence to be retrieved into at least one segment of the same length to obtain a sequence segment of the video sequence to be retrieved.
[0177] In one exemplary embodiment of this disclosure, the retrieval module 630 is specifically configured to: determine a first feature similarity between the video-level feature information of the video sequence to be retrieved and the video-level feature information of stored videos in the video library; based on the first feature similarity ranking, determine at least one stored video with the highest similarity to the video-level feature information of the video sequence to be retrieved as a candidate video; determine a second feature similarity between the segment-level feature information of the video sequence to be retrieved and the segment-level feature information of the candidate videos; based on the second feature similarity ranking, determine at least one candidate video with the highest segment-level feature similarity to the video sequence to be retrieved as the target video, thereby obtaining a retrieval result.
[0178] In an exemplary embodiment of this disclosure, the extraction module 610 is further configured to: before performing feature extraction on the sequence segments of the video sequence to be retrieved to obtain segment-level feature information corresponding to the sequence segments, in response to the received video entry request, determine that the video sequence to be entered into the database corresponding to the video entry request is the video sequence to be retrieved; the retrieval module 630 is further configured to: after sorting based on the second feature similarity and determining at least one candidate video with the highest segment-level feature similarity to the video sequence to be retrieved as the target video to obtain the retrieval result, if the similarity between the target video and the video sequence to be retrieved is less than a set threshold, determine that the video sequence to be retrieved meets the entry conditions; or, if the similarity between the target video and the video sequence to be retrieved is greater than or equal to the set threshold, determine that the video sequence to be retrieved does not meet the entry conditions.
[0179] In one exemplary embodiment of this disclosure, the retrieval module 630 is further configured to: if the similarity between the target video and the video sequence to be retrieved is less than a set threshold, and after determining that the retrieval result is that the video sequence to be retrieved meets the entry conditions, store the segment-level feature information and video-level feature information corresponding to the video sequence to be entered into the database for retrieval in the video database.
[0180] Exemplary computing device
[0181] Having described the methods, media, and apparatus of exemplary embodiments of this disclosure, the following references... Figure 7 A computing device according to an exemplary embodiment of the present disclosure will be described.
[0182] Figure 7 The computing device 700 shown is merely an example and should not be construed as limiting the functionality and scope of use of the embodiments disclosed herein.
[0183] like Figure 7As shown, the computing device 700 is presented in the form of a general-purpose computing device. The components of the computing device 700 may include, but are not limited to: at least one processing unit 701, at least one storage unit 702, and a bus 703 connecting different system components (including the processing unit 701 and the storage unit 702).
[0184] The 703 bus includes a data bus, a control bus, and an address bus.
[0185] Storage unit 702 may include readable media in the form of volatile memory, such as random access memory (RAM) 7021 and / or cache memory 7022, and may further include readable media in the form of non-volatile memory, such as read-only memory (ROM) 7023.
[0186] Storage unit 702 may also include a program / utility 7025 having a set (at least one) program module 7024, such program module 7024 including but not limited to: operating system, one or more application programs, other program modules and program data, each or some combination of these examples may include an implementation of a network environment.
[0187] The computing device 700 can also communicate with one or more external devices 704 (e.g., keyboard, pointing device, etc.). This communication can be performed via the input / output (I / O) interface 705. Furthermore, the computing device 700 can also communicate with one or more networks (e.g., local area network (LAN), wide area network (WAN), and / or public networks, such as the Internet) via a network adapter 707. Figure 7 As shown, network adapter 707 communicates with other modules of computing device 700 via bus 703. It should be understood that, although not shown in the figure, other hardware and / or software modules may be used in conjunction with computing device 700, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems.
[0188] It should be noted that although several units / modules or sub-units / modules of the supply chain strategy determination device and the object scoring model training device are mentioned in the detailed description above, this division is merely exemplary and not mandatory. In fact, according to embodiments of this disclosure, the features and functions of two or more units / modules described above can be embodied in one unit / module. Conversely, the features and functions of one unit / module described above can be further divided and embodied by multiple units / modules.
[0189] Furthermore, although the operations of the methods disclosed herein are described in a specific order in the accompanying drawings, this does not require or imply that these operations must be performed in that specific order, or that all of the operations shown must be performed to achieve the desired result. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step, and / or one step may be broken down into multiple steps.
[0190] While the spirit and principles of this disclosure have been described with reference to several specific embodiments, it should be understood that this disclosure is not limited to the disclosed specific embodiments, and the division of aspects does not imply that features in these aspects cannot be combined for benefit; such division is merely for convenience of expression. This disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
Claims
1. A video retrieval method characterized by, The method includes: Feature extraction is performed on sequence segments of the video sequence to be retrieved to obtain segment-level feature information corresponding to the sequence segments; the sequence segments are one or more video segments obtained by dividing the video sequence to be retrieved according to the duration; the segment-level feature information is a three-dimensional tensor feature, and the three-dimensional tensor feature corresponds to the corresponding video frame in the sequence segment; the three-dimensional tensor feature includes the height feature, width feature and time feature of the video frame; The segment-level feature information is aggregated to obtain video-level feature information corresponding to the video sequence to be retrieved; Based on the segment-level feature information and the video-level feature information, a search is performed in the video library to obtain the corresponding search results; The step of extracting features from sequence segments of the video sequence to be retrieved, to obtain segment-level feature information corresponding to the sequence segments, includes: The sequence fragments are input into a three-dimensional feature extraction network to extract the fragment-level feature information; The structure of the three-dimensional feature extraction network includes: A video frame splitter is used to split an input video frame into a set number of tiles and output them. A video frame tagger is used to receive the image patch, add position information tags to the image patch according to the position of the image patch in the corresponding video frame, add classification learning tags to the video frame corresponding to the image patch, and output the position information tags and the classification learning tags. The feature extractor includes a multi-head attention network and a multilayer perceptron, with residual connections between the multi-head attention network and the multilayer perceptron. The feature extractor is used to receive the location information label, the classification learning label, and the patch of the video frame, and output the corresponding feature vector. The multi-head attention network is used to process the spatial and temporal information of the sequence segment corresponding to the video frame. A head-mounted multilayer perceptron is used to receive the feature vector and output the corresponding classification vector. A classifier is used to receive the input classification vector and output the fragment-level feature information.
2. The video retrieval method according to claim 1, characterized in that, The multi-head attention network has a set size, and the multi-head attention network includes: A joint time-space attention network for processing time- and spatial-dimensional information; Alternatively, a separate temporal-spatial attention network can be constructed, comprising a temporal attention network and a spatial attention network, with residual connections between the two networks. The temporal attention network is used to process temporal dimension information, and the spatial attention network is used to process spatial dimension information. Alternatively, a sparse local-global attention network, comprising a local attention network and a global attention network, with residual connections between the local attention network and the global attention network, wherein both the local attention network and the global attention network are used to process temporal and spatial dimension information; Alternatively, an axial attention network may be used, comprising a temporal attention network, a width attention network, and a height attention network, wherein the temporal attention network, the width attention network, and the height attention network are sequentially residually connected. The temporal attention network is used to process temporal dimension information, and the width attention network and the height attention network are used to process spatial dimension information.
3. The video retrieval method of claim 1, wherein, The step of inputting the sequence fragment into a three-dimensional feature extraction network to extract the fragment-level feature information includes: The sequence fragment is input into a three-dimensional feature extraction network, which outputs the original fragment feature information corresponding to the sequence fragment. The original fragment feature information is simplified to obtain the fragment-level feature information.
4. The video retrieval method according to claim 3, characterized by, The aggregation process of the segment-level feature information to obtain video-level feature information corresponding to the video sequence to be retrieved includes: The original segment feature information is input into the aggregation model, and the aggregated original video feature information is output. The original video feature information is simplified to obtain the video-level feature information.
5. The video retrieval method of claim 4, wherein, The aggregation model includes interconnected feature extractor models and pooling processors. The step of inputting the original segment feature information into the aggregation model and outputting the aggregated original video feature information includes: The original fragment feature information is input into the feature extractor model, and the aggregated features corresponding to the original fragment feature information are output. The aggregated features are input into the pooling processor, which outputs the original video feature information.
6. The video retrieval method of claim 4, wherein, The simplification process includes: Based on the principal component analysis algorithm, the original fragment feature information is subjected to dimensionality reduction processing, and the processed original fragment feature information is then normalized. Alternatively, based on the principal component analysis algorithm, the original video feature information can be dimensionality reduced, and the processed original video feature information can be normalized.
7. The video retrieval method of claim 4, wherein, The simplification process includes: Based on the principal component analysis algorithm, the original fragment feature information is whitened, and the processed original fragment feature information is normalized. Alternatively, based on the principal component analysis algorithm, the original video feature information can be whitened, and the processed original video feature information can be normalized.
8. The video retrieval method of any one of claims 1 to 7, characterized in that, Before performing feature extraction on sequence segments of the video sequence to be retrieved to obtain segment-level feature information corresponding to the sequence segments, the method further includes: The video to be retrieved is preprocessed to obtain sequence segments of the video sequence to be retrieved.
9. The video retrieval method of claim 8, wherein, The preprocessing of the video to be retrieved to obtain sequence segments of the video sequence to be retrieved includes: The video to be searched is decoded to obtain the decoded video to be searched. Video frames of the decoded video sequence to be retrieved are extracted at equal intervals to obtain the video sequence to be retrieved; The video sequence to be retrieved is divided into at least one segment of the same length to obtain the sequence segments of the video sequence to be retrieved.
10. The video retrieval method of any one of claims 1-7, wherein, The process of searching the video library based on the segment-level feature information and the video-level feature information to obtain corresponding search results includes: Determine the first feature similarity between the video-level feature information of the video sequence to be retrieved and the video-level feature information of the videos in the video library; Based on the first feature similarity ranking, at least one inventory video with the highest similarity to the video-level feature information of the video sequence to be retrieved is determined as a candidate video; Determine the second feature similarity between the segment-level feature information of the video sequence to be retrieved and the segment-level feature information of the candidate video; Based on the second feature similarity ranking, at least one candidate video with the highest segment feature similarity to the video sequence to be retrieved is determined as the target video to obtain the retrieval result.
11. The video retrieval method of claim 10, wherein, Before performing feature extraction on sequence segments of the video sequence to be retrieved to obtain segment-level feature information corresponding to the sequence segments, the method further includes: In response to a received video entry request, the video sequence to be entered into the database corresponding to the video entry request is determined to be the video sequence to be retrieved; After determining at least one candidate video with the highest segment feature similarity to the video sequence to be retrieved as the target video based on the second feature similarity ranking, and obtaining the retrieval results, the process further includes: If the similarity between the target video and the video sequence to be retrieved is less than a set threshold, the retrieval result is determined to be that the video sequence to be retrieved meets the conditions for inclusion in the database; Alternatively, if the similarity between the video sequence to be retrieved and the video sequence to be retrieved is greater than or equal to the set threshold, the retrieval result is determined to be that the video sequence to be retrieved does not meet the conditions for inclusion in the database.
12. The video retrieval method of claim 11, wherein, If the similarity between the target video and the video sequence to be retrieved is less than a set threshold, and the retrieval result indicates that the video sequence to be retrieved meets the inclusion criteria, the method further includes: The segment-level feature information and video-level feature information corresponding to the video sequence to be added to the database are stored for retrieval in the video database.
13. A computer-readable storage medium comprising: The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, are used to implement the video retrieval method as described in any one of claims 1 to 12.
14. A video retrieval apparatus characterized by comprising: The device includes: The extraction module is used to extract features from sequence segments of the video sequence to be retrieved, so as to obtain segment-level feature information corresponding to the sequence segments; the sequence segments are one or more video segments obtained by dividing the video sequence to be retrieved according to the duration. The processing module is used to aggregate the segment-level feature information to obtain video-level feature information corresponding to the video sequence to be retrieved; The retrieval module is used to perform retrieval in the video library based on the segment-level feature information and the video-level feature information to obtain corresponding retrieval results; The extraction module is specifically used for: The sequence fragments are input into a three-dimensional feature extraction network to extract the fragment-level feature information; The extraction module includes: the segment-level feature information is a three-dimensional tensor feature, the three-dimensional tensor feature corresponds to the corresponding video frame in the sequence segment; the three-dimensional tensor feature includes the height feature, width feature and time feature of the video frame; The extraction module specifically includes: The structure of the three-dimensional feature extraction network includes: A video frame splitter is used to split an input video frame into a set number of tiles and output them. A video frame tagger is used to receive the image patch, add position information tags to the image patch according to the position of the image patch in the corresponding video frame, add classification learning tags to the video frame corresponding to the image patch, and output the position information tags and the classification learning tags. The feature extractor includes a multi-head attention network and a multilayer perceptron, with residual connections between the multi-head attention network and the multilayer perceptron. The feature extractor is used to receive the location information label, the classification learning label, and the patch of the video frame, and output the corresponding feature vector. The multi-head attention network is used to process the spatial and temporal information of the sequence segment corresponding to the video frame. A head-mounted multilayer perceptron is used to receive the feature vector and output the corresponding classification vector. A classifier is used to receive the input classification vector and output the fragment-level feature information.
15. The video retrieval apparatus according to claim 14, wherein The extraction module specifically includes: The multi-head attention network has a set size, and the multi-head attention network includes: A joint time-space attention network for processing time- and spatial-dimensional information; Alternatively, a separate temporal-spatial attention network can be constructed, comprising a temporal attention network and a spatial attention network, with residual connections between the two networks. The temporal attention network is used to process temporal dimension information, and the spatial attention network is used to process spatial dimension information. Alternatively, a sparse local-global attention network, comprising a local attention network and a global attention network, with residual connections between the local attention network and the global attention network, wherein both the local attention network and the global attention network are used to process temporal and spatial dimension information; Alternatively, an axial attention network may be used, comprising a temporal attention network, a width attention network, and a height attention network, wherein the temporal attention network, the width attention network, and the height attention network are sequentially residually connected. The temporal attention network is used to process temporal dimension information, and the width attention network and the height attention network are used to process spatial dimension information.
16. The video retrieval apparatus according to claim 14, wherein The extraction module is specifically used for: The sequence fragment is input into a three-dimensional feature extraction network, which outputs the original fragment feature information corresponding to the sequence fragment. The original fragment feature information is simplified to obtain the fragment-level feature information.
17. The video retrieval apparatus according to claim 16, wherein The processing module is specifically used for: The original segment feature information is input into the aggregation model, and the aggregated original video feature information is output. The original video feature information is simplified to obtain the video-level feature information.
18. The video retrieval apparatus according to claim 17, wherein The processing module is specifically used for: If the aggregation model includes interconnected feature extractor models and pooling processors, the original fragment feature information is input into the feature extractor model, and the aggregated features corresponding to the original fragment feature information are output. The aggregated features are input into the pooling processor, which outputs the original video feature information.
19. The video retrieval apparatus according to claim 17, wherein The processing module specifically includes: The simplification process includes: Based on the principal component analysis algorithm, the original fragment feature information is subjected to dimensionality reduction processing, and the processed original fragment feature information is then normalized. Alternatively, based on the principal component analysis algorithm, the original video feature information can be dimensionality reduced, and the processed original video feature information can be normalized.
20. The video retrieval apparatus of claim 17, wherein, The processing module specifically includes: The simplification process includes: Based on the principal component analysis algorithm, the original fragment feature information is whitened, and the processed original fragment feature information is normalized. Alternatively, based on the principal component analysis algorithm, the original video feature information can be whitened, and the processed original video feature information can be normalized.
21. The video retrieval apparatus according to any one of claims 14 to 20, characterized in that, The extraction module is also used for: Before extracting features from the sequence segments of the video sequence to be retrieved to obtain segment-level feature information corresponding to the sequence segments, the video to be retrieved is preprocessed to obtain the sequence segments of the video sequence to be retrieved.
22. The video retrieval apparatus according to claim 21, wherein The extraction module is specifically used for: The video to be searched is decoded to obtain the decoded video to be searched. Video frames of the decoded video sequence to be retrieved are extracted at equal intervals to obtain the video sequence to be retrieved; The video sequence to be retrieved is divided into at least one segment of the same length to obtain the sequence segments of the video sequence to be retrieved.
23. The video retrieval apparatus according to any one of claims 14 to 20, characterized in that, The retrieval module is specifically used for: Determine the first feature similarity between the video-level feature information of the video sequence to be retrieved and the video-level feature information of the videos in the video library; Based on the first feature similarity ranking, at least one inventory video with the highest similarity to the video-level feature information of the video sequence to be retrieved is determined as a candidate video; Determine the second feature similarity between the segment-level feature information of the video sequence to be retrieved and the segment-level feature information of the candidate video; Based on the second feature similarity ranking, at least one candidate video with the highest segment feature similarity to the video sequence to be retrieved is determined as the target video to obtain the retrieval result.
24. The video retrieval apparatus according to claim 23, wherein, The extraction module is also used for: Before performing feature extraction on the sequence segments of the video sequence to be retrieved to obtain segment-level feature information corresponding to the sequence segments, in response to the received video entry request, the video sequence to be entered into the database corresponding to the video entry request is determined to be the video sequence to be retrieved. The retrieval module is also used for: After sorting based on the second feature similarity, at least one candidate video with the highest segment feature similarity to the video sequence to be retrieved is determined as the target video, and the retrieval result is obtained, if the similarity between the target video and the video sequence to be retrieved is less than a set threshold, the retrieval result is determined to be that the video sequence to be retrieved meets the entry conditions. Alternatively, if the similarity between the video sequence to be retrieved and the video sequence to be retrieved is greater than or equal to the set threshold, the retrieval result is determined to be that the video sequence to be retrieved does not meet the conditions for inclusion in the database.
25. The video retrieval apparatus of claim 24, wherein, The retrieval module is also used for: If the similarity between the target video and the video sequence to be retrieved is less than a set threshold, and the retrieval result indicates that the video sequence to be retrieved meets the entry conditions, then the segment-level feature information and video-level feature information corresponding to the video sequence to be entered into the database are stored for retrieval in the video database.
26. A computing device comprising: At least one processor; and memory that is communicatively connected to at least one processor; The memory stores instructions that can be executed by at least one processor, which, when executed by at least one processor, cause the computing device to perform the video retrieval method as described in any one of claims 1 to 12.
Citation Information
Patent Citations
Repeated video detection method and device
CN109522451A
Video retrieval method and device, electronic equipment and storage medium
CN110688524A
Video retrieval method and device, computer readable storage medium and electronic equipment
CN113656639A