Methods and apparatus for training low-quality video detection and video understanding models
The video understanding model built using a self-attention mechanism network and a classification network solves the problems of low efficiency and insufficient accuracy in video quality review, achieving efficient and accurate detection of low-quality videos and capable of handling complex scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BAIDU COM TIMES TECH (BEIJING) CO LTD
- Filing Date
- 2022-11-29
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies are inefficient in video quality review, with insufficient recall and detection accuracy, and cannot effectively cover multiple review dimensions. Furthermore, deep learning detection methods suffer from high overhead, low operating efficiency, and poor robustness.
A video understanding model is constructed using a self-attention mechanism network and a classification network. The video to be detected is divided into multiple segments. The self-attention mechanism network is used to extract features from the video segments, and the classification network is used to predict the low-quality category confidence of each segment. The overall quality review result of the video is determined.
It improves video review efficiency, enhances the recall rate and detection accuracy of low-quality videos, covers more review dimensions, and has good robustness and low operating overhead.
Smart Images

Figure CN115761595B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of artificial intelligence technology, specifically to the fields of deep learning, video recognition, and video processing, and can be applied to scenarios such as video review, video search, or recommendation. In particular, it relates to a method and apparatus for training a low-quality video detection and video understanding model. Background Technology
[0002] With the development of network information technology and the popularization of electronic devices, video, as an important information carrier, is playing an increasingly significant role in people's work, entertainment, and study. For example, people can upload short videos to service platforms for others to enjoy. Due to limitations in personal skills or shooting equipment, the quality of videos uploaded to service platforms often varies. By reviewing the quality of videos before recommending them to users and filtering out low-quality videos, service platforms can significantly optimize the user experience.
[0003] Currently, the service platform can review videos using deep learning-based detection methods. Specifically, a neural network model can be trained for each review dimension. Then, for the video to be reviewed, the neural network model corresponding to each review dimension is used to analyze the video, obtaining the detection result for each review dimension. By comprehensively analyzing the detection results for each review dimension, the review result for the video can be obtained. Summary of the Invention
[0004] This disclosure provides a method and apparatus for training a low-quality video detection and video understanding model, which can effectively improve the efficiency of video review, as well as the recall rate and detection accuracy of low-quality videos.
[0005] According to a first aspect of this disclosure, a method for detecting low-quality video is provided, the method comprising:
[0006] The video to be detected is divided into at least two video segments. Each video segment is predicted using a pre-defined video understanding model to obtain the prediction confidence of each video segment belonging to each low-quality category. The low-quality categories include at least two types. The video understanding model is trained using a video understanding network, which includes a self-attention mechanism network and a classification network. The self-attention mechanism network is used to extract features of the video segment in each dimension corresponding to each low-quality category. The classification network is used to predict the prediction confidence of each video segment belonging to each low-quality category based on the features extracted by the self-attention mechanism network. The quality review result of the video to be detected is determined based on the proportion of the target video segment in the at least two video segments. The target video segment is the video segment in the at least two video segments in which the prediction confidence of at least one low-quality category meets the first pre-defined condition. The quality review result includes: the video to be detected is a low-quality video or a non-low-quality video.
[0007] According to a second aspect of this disclosure, a method for training a video understanding model is provided, the method comprising:
[0008] Obtain at least two sample video segments obtained from the segmentation of the sample video, and the low-quality category label corresponding to each sample video segment; construct a video understanding network based on a self-attention mechanism network and a classification network; train the video understanding network using at least two sample video segments and the low-quality category label corresponding to each sample video segment to obtain a video understanding model. The video understanding model is used to detect low quality in the video to be detected. The self-attention mechanism network is used to extract features of the video segments in the dimension corresponding to each low-quality category. The classification network is used to predict the prediction confidence of each video segment belonging to each low-quality category based on the features extracted by the self-attention mechanism network.
[0009] According to a third aspect of this disclosure, a low-quality video detection apparatus is provided, the apparatus comprising:
[0010] The video processing unit is used to divide the video to be detected into at least two video segments; the video understanding unit is used to predict each video segment using a preset video understanding model to obtain the prediction confidence of each video segment belonging to each low-quality category. The low-quality categories include at least two. The video understanding model is trained using a video understanding network, which includes a self-attention mechanism network and a classification network. The self-attention mechanism network is used to extract features of the video segment in the dimension corresponding to each low-quality category, and the classification network is used to predict the prediction confidence of each video segment belonging to each low-quality category based on the features extracted by the self-attention mechanism network; the low-quality identification unit is used to determine the quality review result of the video to be detected based on the proportion of the target video segment in the at least two video segments. The target video segment is a video segment in the at least two video segments in which the prediction confidence of at least one low-quality category meets the first preset condition. The quality review result includes: the video to be detected is a low-quality video or a non-low-quality video.
[0011] According to a fourth aspect of this disclosure, a video understanding model training apparatus is provided, the apparatus comprising:
[0012] The acquisition unit is used to acquire at least two sample video segments obtained from the segmentation of the sample video, and the low-quality category label corresponding to each sample video segment; the construction unit is used to construct a video understanding network based on the self-attention mechanism network and the classification network; the training unit is used to train the video understanding network using at least two sample video segments and the low-quality category label corresponding to each sample video segment to obtain a video understanding model. The video understanding model is used to perform low-quality detection on the video to be detected. The self-attention mechanism network is used to extract features of the video segments in the dimension corresponding to each low-quality category. The classification network is used to predict the prediction confidence of each video segment belonging to each low-quality category based on the features extracted by the self-attention mechanism network.
[0013] According to a fifth aspect of this disclosure, an electronic device is provided, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the first or second aspect.
[0014] According to a sixth aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions for causing a computer to perform the method described according to the first or second aspect.
[0015] According to a seventh aspect of this disclosure, a computer program product is provided, comprising a computer program that, when executed by a processor, implements the method according to the first or second aspect.
[0016] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0017] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein:
[0018] Figure 1 A flowchart illustrating the low-quality video detection method provided in this embodiment of the disclosure;
[0019] Figure 2 This is a schematic diagram illustrating the principle of a video understanding network provided in an embodiment of the present disclosure;
[0020] Figure 3 This is another schematic flowchart of the low-quality video detection method provided in the embodiments of this disclosure;
[0021] Figure 4 Another schematic diagram illustrating the principle of a video understanding network provided in this disclosure embodiment;
[0022] Figure 5 A schematic diagram illustrating the processing flow of the video understanding model provided in this embodiment of the disclosure;
[0023] Figure 6 This is another schematic flowchart of the low-quality video detection method provided in the embodiments of this disclosure;
[0024] Figure 7 This is another schematic flowchart of the low-quality video detection method provided in the embodiments of this disclosure;
[0025] Figure 8 A flowchart illustrating the video understanding model training method provided in this embodiment of the disclosure;
[0026] Figure 9 This is a schematic diagram of the composition of the low-quality video detection device provided in the embodiments of this disclosure;
[0027] Figure 10 This is a schematic diagram illustrating the composition of the video understanding model training device provided in the embodiments of this disclosure;
[0028] Figure 11 A schematic block diagram of an example electronic device 1100 that can be used to implement embodiments of the present disclosure is shown. Detailed Implementation
[0029] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.
[0030] It should be understood that in the embodiments of this disclosure, the character " / " generally indicates that the preceding and following objects are in an "or" relationship. The terms "first," "second," etc., are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated.
[0031] With the development of network information technology and the popularization of electronic devices, video, as an important information carrier, is playing an increasingly significant role in people's work, entertainment, and study, and the mature application of streaming media has made this trend even more obvious. For example, people can upload short videos to service platforms for others to enjoy. These service platforms can be short video service platforms or other service platforms that support video functionality.
[0032] Due to limitations in personal skills or filming equipment, the quality of videos uploaded to service platforms often varies. If low-quality videos are recommended to a user's homepage, it can lead to a poor browsing experience. By reviewing videos before recommending them to users and filtering out low-quality ones, service platforms can significantly improve the user experience.
[0033] For example, the video quality review rules can be preset. For instance, the review rules can define videos with long PowerPoint (PPT) template switching, long periods of stillness, poor clarity, blurry image quality, black screen, or jittery image quality as poor quality videos, or they can also be called low-quality videos.
[0034] Currently, the main methods used by service platforms to review video quality include manual review or review based on deep learning detection methods.
[0035] Manual review requires significant manpower and is inefficient.
[0036] The deep learning-based review process typically involves training a neural network model for each review dimension, such as prolonged PPT template switching, prolonged static images, poor clarity, blurry images, black screens, and image jitter. Then, for the video to be reviewed, the neural network model corresponding to each review dimension is used to analyze the video, obtaining the detection result for each dimension. By comprehensively analyzing the detection results for each review dimension, the review result for the video can be obtained, such as: low-quality video or non-low-quality video.
[0037] However, the aforementioned deep learning-based inspection methods suffer from several drawbacks. When there are many inspection dimensions, a large number of neural network models are required, leading to excessive overhead, low efficiency, and high failure rates. Conversely, if the number of neural network models is too small, it cannot cover all the inspection dimensions, resulting in poor robustness in the real world and an inability to handle complex issues in real-world scenarios. For example, a video may contain multiple quality issues or defects, but if the number of neural network models is too small, the deep learning inspection method cannot accurately detect these issues.
[0038] In other words, current methods for video quality review are inadequate to meet the requirements of video review. Furthermore, current methods have relatively low recall and detection accuracy for low-quality videos.
[0039] This disclosure provides a low-quality video detection method that utilizes a self-attention mechanism network for end-to-end discrimination of the video to be detected, thereby determining whether the video is low-quality. This method can effectively improve video review efficiency, as well as the recall and detection accuracy of low-quality videos. Furthermore, this method can cover more review dimensions when detecting videos, better addressing various complex problems in real-world scenarios, exhibiting good robustness, low overhead, high operating efficiency, and a low failure rate.
[0040] The execution subject of this method can be a computer or server, or other devices with data processing capabilities. No restrictions are placed on the execution subject of this method.
[0041] In some embodiments, the server can be a single server, or it can be a server cluster consisting of multiple servers. In some embodiments, the server cluster can also be a distributed cluster. This disclosure does not limit the specific implementation of the server.
[0042] For example, the server can be the backend server or data server of the service platform. The service platform can be a short video service platform, other multimedia resource service platforms, or information service platforms, etc., and there is no limitation on the specific type of service platform.
[0043] The following is an exemplary description of this low-quality video detection method.
[0044] Figure 1 This is a schematic flowchart illustrating the low-quality video detection method provided in this embodiment of the disclosure. Figure 1 As shown, this low-quality video detection method may include:
[0045] S101. Divide the video to be detected into at least two video segments.
[0046] For example, the video to be detected can be a short video or other videos shot or produced by the user; there is no limitation on the type of video to be detected.
[0047] For example, taking a 160-second video to be tested as an example, the video can be divided into 10 video segments of 16 seconds each. For example, 0 seconds to 16 seconds is the first video segment, 17 seconds to 32 seconds is the second video segment, and so on, with 144 seconds to 160 seconds being the tenth video segment.
[0048] Optionally, before S101, the method may further include: acquiring the video to be detected. For example, taking a short video service platform as an example, the short video service platform can receive short videos uploaded by users. After acquiring the short video uploaded by the user, the short video can be reviewed according to the method provided in the embodiments of this disclosure. After the review is passed (e.g., it is determined that the short video is not a low-quality video), it can be recommended to other users on the short video service platform.
[0049] S102. Predict each video segment using a preset video understanding model to obtain the prediction confidence of each video segment belonging to each low-quality category.
[0050] The low-quality category includes at least two types. For example, the low-quality category may include: long-duration PowerPoint (PPT) template switching, long periods of static screen, poor clarity, blurry image, black screen, and jittery image, among other types. In this embodiment of the disclosure, a video that meets the low-quality category can be referred to as a low-quality video.
[0051] The video understanding model is trained using a video understanding network, which includes a self-attention mechanism network and a classification network. The self-attention mechanism network is used to extract features of video segments in each low-quality category, and the classification network is used to predict the confidence level of each video segment belonging to each low-quality category based on the features extracted by the self-attention mechanism network.
[0052] For example, let's divide the video to be detected into N video segments, with low-quality categories including M categories, where N and M are both integers greater than 1. Figure 2 This is a schematic diagram illustrating the principle of a video understanding network provided in an embodiment of this disclosure. Figure 2 As shown, the video understanding network can include a self-attention mechanism network and a classification network. In step S102, N video segments, from video segment 1 to video segment N, can be input into the self-attention mechanism network. The self-attention mechanism network can extract features for each video segment in M low-quality categories, such as features corresponding to low-quality category 1, low-quality category 2, and low-quality category M. The classification network can predict the prediction confidence of each video segment belonging to each of the M low-quality categories, based on the features of each video segment in the M low-quality categories, such as low-quality category 1 to low-quality category M. For example, the classification network can predict the prediction confidence of video segment 1 belonging to low-quality category 1, low-quality category 2, and low-quality category M, based on the features of video segment 1 in the M low-quality categories, such as low-quality category 1 to low-quality category M. All the results output by the classification network can form a video segment classification set. The video clip classification set includes: the predicted confidence level of each video clip belonging to each low-quality category.
[0053] For example, taking low-quality categories such as: blurry image quality, black screen image quality, and jittery image quality, the prediction confidence of a certain video clip belonging to the blurry image quality category can be 50%, the prediction confidence of the black screen image quality category can be 70%, and the prediction confidence of the jittery image quality category can be 80%, etc.
[0054] It should be noted that the low-quality category can be set according to the video quality review rules, which can be preset or manually defined. The low-quality categories mentioned in the embodiments of this disclosure are all illustrative, and this disclosure does not limit the specific type and number of low-quality categories.
[0055] S103. Determine the quality review result of the video to be detected based on the proportion of the target video segment in at least two video segments. The target video segment is a video segment in at least two video segments in which the prediction confidence of at least one low-quality category meets the first preset condition.
[0056] The quality audit results include: the video to be inspected is either a low-quality video or a non-low-quality video.
[0057] Optionally, before S103, the method may further include: determining, based on the predicted confidence of each video segment belonging to each low-quality category, that there exists a video segment whose predicted confidence of at least one low-quality category satisfies a first preset condition as the target video segment.
[0058] For example, taking video segments including video segments 1 to N, and low-quality categories including low-quality categories 1 to M as an example, for each video segment in video segments 1 to N, when the prediction confidence of the video segment belonging to low-quality categories 1 to M respectively meets the first preset condition, the video segment can be determined as the target video segment.
[0059] Understandably, the target video segment may include one or more video segments from video segment 1 to video segment N.
[0060] In some implementations, the first preset condition may include: the prediction confidence level is greater than (or equal to) a preset confidence threshold. For example, the preset confidence threshold may be 80%, 90%, etc., and the preset confidence threshold may be a preset value. Here, there is no restriction on the size of the preset confidence threshold.
[0061] Taking a preset confidence threshold of 90% as an example, when a video segment belongs to at least one low-quality category among the predicted confidence scores of low-quality categories 1 to M, and the predicted confidence score of the video segment belonging to that low-quality category is greater than 90%, the video segment can be used as the target video segment.
[0062] In other implementations, the first preset condition may also include: the prediction confidence level is among the highest proportions of all prediction confidence levels. For example, the first proportion can be 30%, 20%, etc., and there is no limitation here.
[0063] Taking a first proportion of 20% as an example, if a video segment belongs to at least one low-quality category among the predicted confidence scores of low-quality categories 1 to M, and the predicted confidence score of the video segment belonging to that low-quality category is among the highest 20% of all predicted confidence scores, then the video segment can be used as the target video segment.
[0064] Optionally, all prediction confidence can include: prediction confidence of video segments that need to be determined as low-quality categories 1 to M respectively, or prediction confidence of each video segment belonging to low-quality categories 1 to M respectively.
[0065] After identifying the target video segment as described above, the quality review result of the video to be inspected can be determined based on the proportion of the target video segment among at least two video segments. For example, the quality review result of the video to be inspected can be determined based on the proportion of the target video segment among video segments 1 to N mentioned above.
[0066] For example, the quality review result of the video to be tested is determined based on the proportion of the target video segment in at least two video segments, including: when the proportion of the target video segment in at least two video segments is greater than a preset ratio, the quality review result of the video to be tested is determined to be a low-quality video; or, when the proportion of the target video segment in at least two video segments is less than a preset ratio, the quality review result of the video to be tested is determined to be a non-low-quality video.
[0067] The preset ratio can be 50%, 30%, 20%, etc. The size of the preset ratio can be set according to business needs or video review rules. This disclosure does not limit the size of the preset ratio.
[0068] Optionally, when the proportion of the target video segment in at least two video segments is equal to a preset ratio, the quality review result of the video to be tested can be determined as a low-quality video, or the quality review result of the video to be tested can be determined as a non-low-quality video, and no restriction is imposed here.
[0069] This embodiment of the disclosure divides the video to be detected into at least two video segments. A preset video understanding model is used to predict the quality of each video segment and determine its predicted confidence level for each low-quality category. The quality review result of the video to be detected is determined based on the proportion of video segments whose predicted confidence level for at least one low-quality category meets a first preset condition among the aforementioned at least two video segments. This method utilizes the self-attention mechanism network in the video understanding model to perform end-to-end discrimination of the video to be detected, thereby determining whether the video to be detected is a low-quality video. By dividing the video to be detected into video segments and performing end-to-end discrimination to determine whether the video to be detected is a low-quality video, it can not only effectively improve video review efficiency and reduce manual review costs, but also improve the recall rate and detection accuracy of low-quality videos.
[0070] In addition, this method uses a self-attention mechanism network to perform end-to-end discrimination of the video to be detected. It can also extract features of multiple low-quality categories from the same video segment at the same time when reviewing or detecting the video to be detected, covering more review dimensions and better dealing with various complex problems in real-world scenarios, with good robustness.
[0071] Compared to current methods of video quality review, the low-quality video detection method provided in this disclosure has lower overhead, higher operating efficiency, and lower failure rate.
[0072] For example, the low-quality video detection method provided in this disclosure can effectively recall short videos with low-quality problems from a massive amount of short videos.
[0073] Optionally, in embodiments of this disclosure, the method may further include the step of training a video understanding model.
[0074] For example, Figure 3 Another flowchart illustrating the low-quality video detection method provided in this disclosure embodiment is shown below. Figure 3 As shown, this low-quality video detection method may further include:
[0075] S301. Obtain at least two sample video segments obtained from the segmentation of the sample video, and the low-quality category label corresponding to each sample video segment.
[0076] For example, sample videos can be collected from real-world business scenarios. The more sample videos there are, the better the video understanding model will perform.
[0077] The method for dividing the sample video into sample video segments can refer to the aforementioned method of dividing the video to be detected into at least two video segments, and will not be repeated here.
[0078] The low-quality category labels corresponding to the sample video clips can be manually labeled. The low-quality category labels correspond to the low-quality categories mentioned in the previous embodiments. For example, low-quality categories may include: long PPT template switching, long static screen, poor clarity, blurry image quality, black screen, jittery image quality, etc. The corresponding low-quality category labels can be the labels that correspond one-to-one with the aforementioned low-quality categories.
[0079] Optionally, each sample video clip may include one or more low-quality category labels.
[0080] In this embodiment of the disclosure, the method of labeling video clips with low-quality category tags has low labeling costs and is relatively easy to implement.
[0081] S302. Construct a video understanding network based on the self-attention mechanism network and the classification network.
[0082] For example, the video understanding network can refer to the aforementioned Figure 2 As shown, this will not be repeated here.
[0083] S303. Using at least two sample video segments and the low-quality category labels corresponding to each sample video segment, train the video understanding network to obtain the video understanding model.
[0084] For example, for each sample video segment, the sample video segment can be used as the input to the video understanding network, and the low-quality category label corresponding to the sample video segment can be used as the input to the video understanding network. The video understanding network is trained until it converges (e.g., the loss is less than the pre-loss). The converged video understanding network is the video understanding model.
[0085] This embodiment can train the video understanding model described in the foregoing embodiments, which can accurately predict the prediction confidence of each video segment belonging to each low-quality category.
[0086] In some embodiments, the video understanding network also includes a feature selection network. For example, taking the case where the video to be detected is divided into N video segments, and the low-quality categories include M categories, Figure 4 This is another schematic diagram illustrating the principle of the video understanding network provided in this disclosure. For example... Figure 4 As shown above, in the above Figure 2 Based on the illustrated embodiment, the video understanding network also includes a feature selection network.
[0087] In this embodiment, after the self-attention mechanism network extracts features of the video segments in the dimensions corresponding to each of the M low-quality categories (from low-quality category 1 to low-quality category M), the feature filtering network can first filter the features of the video segments in the dimensions corresponding to each of the M low-quality categories to obtain a filtered feature map. The classification network can then predict the prediction confidence of each video segment belonging to each of the M low-quality categories (from low-quality category 1 to low-quality category M) based on the filtered feature map.
[0088] The following is combined Figure 5 The processing flow of a video understanding model that includes a feature selection network is described. For example, Figure 5 This is a schematic diagram illustrating the processing flow of the video understanding model provided in an embodiment of this disclosure. Figure 5 As shown, the step of predicting each video segment using a preset video understanding model to obtain the prediction confidence of each video segment belonging to each low-quality category may include:
[0089] S501, Each video segment is encoded and mapped into a first embedding vector, a second embedding vector, and a third embedding vector.
[0090] For example, for each video clip, the video clip can be input into the video understanding model. The video understanding model can encode and map the input video clip into three intermediate embedding vectors: Q, K, and V, where Q is the first embedding vector, K is the second embedding vector, and V is the third embedding vector.
[0091] S502. The first embedding vector, the second embedding vector, and the third embedding vector corresponding to each video segment are extracted using a self-attention mechanism network to obtain the feature maps corresponding to the first embedding vector, the second embedding vector, and the third embedding vector, respectively.
[0092] For example, a multi-path attention mechanism network can be constructed using the attention calculation formula to obtain a self-attention mechanism network. The self-attention mechanism network can learn the feature extraction capability of each dimension corresponding to each low-quality category, and perform feature extraction based on the first embedding vector, the second embedding vector, and the third embedding vector corresponding to the video segment to obtain the feature maps corresponding to the first embedding vector, the second embedding vector, and the third embedding vector, respectively.
[0093] S503. The feature maps corresponding to the first embedding vector, the second embedding vector, and the third embedding vector of each video segment are concatenated to obtain the fused feature map corresponding to each video segment.
[0094] For each video segment, after obtaining the feature maps corresponding to the first embedding vector, the second embedding vector, and the third embedding vector, the feature maps corresponding to the first embedding vector, the second embedding vector, and the third embedding vector of the video segment can be concatenated to obtain the fused feature map of the video segment, and the fused feature map is sent to the feature selection network.
[0095] S504. Obtain the fourth, fifth, and sixth embedding vectors corresponding to the fused feature map of each video segment through the feature filtering network, and filter the fused feature map according to the fourth, fifth, and sixth embedding vectors to obtain the filtered feature map corresponding to each video segment.
[0096] After the fused feature map is fed into the feature filtering network, the network can again calculate three intermediate embedding vectors: Q, K, and V. Q can be called the fourth embedding vector, K the fifth, and V the sixth. After obtaining these vectors, the network can use the attention formula to calculate a feature relation matrix. Then, based on the weight coefficients of the feature relation matrix, the fused feature map is filtered to obtain a filtered feature map. For example, features with weight coefficients greater than a preset weight threshold can be retained. The size of the weight threshold is not limited here.
[0097] S505. The filtered feature map corresponding to each video segment is predicted by the classification network to obtain the prediction confidence of each video segment belonging to each low-quality category.
[0098] S505 can be referred to in the explanation of S102 above, and will not be repeated here.
[0099] The video understanding model in this embodiment can effectively identify features of multiple dimensions in the video to be detected, thereby significantly improving the recognition speed and accuracy of low-quality videos. Furthermore, by adding a feature filtering network to the video understanding network, this embodiment can select more effective features for the classification network to predict, further improving video review efficiency, recall rate, and detection accuracy of low-quality videos.
[0100] Figure 6 This is another flowchart illustrating the low-quality video detection method provided in this embodiment of the disclosure.
[0101] like Figure 6 As shown, in some embodiments, the low-quality video detection method may further include:
[0102] S601. Based on the predicted confidence of each video segment belonging to each low-quality category, determine the weighted confidence of each low-quality category.
[0103] For example, taking video segments including video segments 1 to N and low-quality categories including low-quality categories 1 to M as an example, for each low-quality category from low-quality category 1 to low-quality category M, the average of the predicted confidence scores of video segments 1 to N belonging to that low-quality category can be calculated as the weighted confidence score of that low-quality category.
[0104] For example, suppose there are video clips 1, 2, and 3. The prediction confidence of video clip 1 belonging to low-quality category 1 is 70%, that of video clip 2 belonging to low-quality category 1 is 80%, and that of video clip 3 belonging to low-quality category 1 is 90%. Then the weighted confidence of low-quality category 1 is 80%. Similarly, the weighted confidence of each low-quality category can be obtained.
[0105] S602. Based on the weighted confidence of each low-quality category, determine the low-quality category whose weighted confidence satisfies the second preset condition as the low-quality category present in the video to be detected.
[0106] In some implementations, the second preset condition may include: the weighted confidence level is greater than (or equal to) a preset weighted confidence level threshold. For example, the preset weighted confidence level threshold may be 80%, 90%, etc., and the preset weighted confidence level threshold may be a preset value. Here, there is no restriction on the size of the preset weighted confidence level threshold.
[0107] Taking a preset weighted confidence threshold of 90% as an example, when the weighted confidence of a certain low-quality category is greater than 90%, it can be determined that the low-quality category exists in the video to be detected.
[0108] In other implementations, the second precondition may also include: the weighted confidence level being the largest among all weighted confidence levels. For example, the second proportion could be 30%, 20%, etc., and there is no limitation here.
[0109] Taking the second proportion of 20% as an example, when the weighted confidence of a certain low-quality category is among the largest 20% of all weighted confidences, then this low-quality category can be regarded as the low-quality category of the video to be detected.
[0110] Understandably, the low-quality categories present in the video to be detected may include one or more.
[0111] This embodiment determines the weighted confidence level of each low-quality category based on the predicted confidence level of each video segment belonging to each low-quality category. Based on the weighted confidence level of each low-quality category, the low-quality category whose weighted confidence level satisfies the second preset condition is determined as the low-quality category of the video to be detected. This can comprehensively consider each video segment to determine the low-quality category of the video to be detected, and can accurately detect the low-quality category of the video to be detected.
[0112] Figure 7 This is another flowchart illustrating the low-quality video detection method provided in this embodiment of the disclosure.
[0113] like Figure 7As shown, in some other embodiments, the low-quality video detection method may further include:
[0114] S701. Based on the predicted confidence of each video segment belonging to each low-quality category, determine the low-quality category whose predicted confidence is greater than the first threshold.
[0115] For example, the first threshold can be 50%, 60%, etc., and the size of the first threshold can be set according to business needs. Here, there is no limitation on the size of the first threshold.
[0116] Taking a first threshold of 50% as an example, for each video segment, based on the predicted confidence level of the video segment belonging to each low-quality category, low-quality categories with a predicted confidence level less than or equal to 50% can be excluded. The remaining low-quality categories are those with a predicted confidence level greater than 50% for that video segment. By filtering each video segment, all low-quality categories with a predicted confidence level greater than the first threshold can be identified.
[0117] S702. Among the low-quality categories with a prediction confidence greater than the first threshold, the low-quality categories whose corresponding video segments are greater than the second threshold are identified as the low-quality categories in which the video to be detected exists.
[0118] After obtaining the low-quality categories with a prediction confidence greater than the first threshold, the low-quality categories with a number of video segments greater than the second threshold can be selected as the low-quality categories where the video to be detected exists, based on the number of video segments corresponding to each low-quality category with a prediction confidence greater than the first threshold.
[0119] The second threshold can be 2, 5, 8, etc., and there is no restriction on the size of the second threshold.
[0120] For example, suppose there are video clips 1, 2, 3, 4, and 5, and the low-quality categories include low-quality category 1, low-quality category 2, and low-quality category 3. The first threshold is 85%, and the second threshold is 2.
[0121] Among them, the prediction confidence of video clip 1 belonging to low quality category 1 is 90%, the prediction confidence of video clip 1 belonging to low quality category 2 is 40%, and the prediction confidence of video clip 1 belonging to low quality category 3 is 90%.
[0122] The prediction confidence level for video clip 2 belonging to low-quality category 1 is 40%, the prediction confidence level for video clip 2 belonging to low-quality category 2 is 50%, and the prediction confidence level for video clip 2 belonging to low-quality category 3 is 88%.
[0123] The prediction confidence level for video clip 3 belonging to low-quality category 1 is 30%, the prediction confidence level for video clip 3 belonging to low-quality category 2 is 90%, and the prediction confidence level for video clip 3 belonging to low-quality category 3 is 90%.
[0124] The prediction confidence for video clip 4 belonging to low-quality category 1 is 45%, the prediction confidence for video clip 4 belonging to low-quality category 2 is 70%, and the prediction confidence for video clip 4 belonging to low-quality category 3 is 95%.
[0125] The prediction confidence level for video clip 5 belonging to low-quality category 1 is 60%, the prediction confidence level for video clip 5 belonging to low-quality category 2 is 65%, and the prediction confidence level for video clip 5 belonging to low-quality category 3 is 90%.
[0126] Therefore, in S701, based on the predicted confidence level of video segment 1 belonging to each of the low-quality categories 1 to 3, low-quality category 2 with a predicted confidence level less than or equal to 85% is excluded, leaving low-quality categories 1 and 3. Based on the predicted confidence level of video segment 2 belonging to each of the low-quality categories 1 to 3, low-quality categories 1 and 2 with a predicted confidence level less than or equal to 85% are excluded, leaving low-quality category 3. Based on the predicted confidence level of video segment 3 belonging to each of the low-quality categories 1 to 3, low-quality category 1 with a predicted confidence level less than or equal to 85% is excluded, leaving low-quality categories 2 and 3. Based on the predicted confidence level of video segment 4 belonging to each of the low-quality categories 1 to 3, low-quality categories 1 and 2 with a predicted confidence level less than or equal to 85% are excluded, leaving low-quality category 3. Based on the predicted confidence of video clip 5 belonging to each of the low-quality categories 1 to 3, low-quality categories 1 and 2 with a predicted confidence of less than or equal to 85% are excluded, leaving low-quality category 3.
[0127] In S701, the low-quality categories with a prediction confidence greater than 85% include: low-quality category 1 and low-quality category 3 corresponding to video clip 1, low-quality category 3 corresponding to video clip 2, low-quality category 2 and low-quality category 3 corresponding to video clip 3, low-quality category 3 corresponding to video clip 4, and low-quality category 3 corresponding to video clip 5.
[0128] Based on the execution result of S701, it can be seen that among the low-quality categories with a prediction confidence greater than 85%, the video segments corresponding to low-quality category 1 include video segment 1, with a quantity of 1; the video segments corresponding to low-quality category 2 include video segment 3, with a quantity of 1; and the video segments corresponding to low-quality category 3 include video segment 1, video segment 2, video segment 3, video segment 4, and video segment 5, with a quantity of 5. Therefore, in S705, it can be determined that low-quality category 3, with a quantity of video segments greater than 2, is the low-quality category in which the video to be detected exists.
[0129] This embodiment determines low-quality categories with a predicted confidence level greater than a first threshold based on the predicted confidence level of each video segment belonging to each low-quality category. Among the low-quality categories with a predicted confidence level greater than the first threshold, the low-quality categories with a corresponding number of video segments greater than a second threshold are identified as the low-quality categories present in the video to be detected. This can also more accurately identify the low-quality categories present in the video to be detected and output richer and more accurate low-quality video detection results.
[0130] In some embodiments, S101 may include: dividing the video to be detected into at least two video segments according to a target duration, wherein the target duration is related to the duration of the video to be detected.
[0131] The target duration can be 10 seconds, 16 seconds, 18 seconds, etc. The size of the target duration can be set according to business needs (such as detection speed requirements, detection accuracy requirements, etc.). There is no limit to the size of the target duration here.
[0132] For example, if the target duration is 16 seconds, and the video to be detected is 160 seconds long, the video to be detected can be divided into 10 video segments.
[0133] Optionally, when dividing the video to be detected into video segments according to the target duration, if the duration of the video to be detected is not an integer multiple of the target duration, the extra video segments can be deleted during the segmentation process. For example, if the video to be detected is 165 seconds long, it can be divided into 10 video segments, and the first or last 5 seconds of the video can be deleted.
[0134] The target duration can be related to the duration of the video to be detected. When the duration of the video to be detected is long, the target duration can be adaptively adjusted to a larger value; when the duration of the video to be detected is short, the target duration can be adaptively adjusted to a smaller value. Specific adjustment rules are not limited herein.
[0135] This embodiment divides the video to be detected into at least two video segments according to a target duration. The target duration is related to the duration of the video to be detected. This allows for dynamic adjustment of the number of video segments or the duration of video segments based on the video to be detected, thereby further improving the accuracy and speed of low-quality video detection.
[0136] Optionally, in this embodiment of the present disclosure, S101 may also include: dividing the video to be detected into at least two video segments according to the target frame number, wherein the target frame number is related to the duration of the video to be detected.
[0137] For example, the target number of frames can be 10 frames, 16 frames, etc., and the size of the target number of frames can also be set according to business requirements (such as detection speed requirements, detection accuracy requirements, etc.). There is no limit to the size of the target number of frames here.
[0138] For example, taking a target frame count of 16 frames as an example, for a video to be detected with 80 frames of images, each 16 frames can be taken as a video segment, resulting in 5 video segments.
[0139] This disclosure does not impose any restrictions on the specific method of dividing the video to be detected into at least two video segments.
[0140] Optionally, the video to be detected is obtained by sampling frames from the target video to be detected according to the target frame sampling frequency.
[0141] For example, the target frame sampling frequency can be 2 seconds / frame, 3 seconds / frame, etc. (or the target frame sampling frequency can also be 2 frames / second, 3 frames / second, etc.). The size of the target frame sampling frequency can also be set according to business requirements (such as detection speed requirements, detection accuracy requirements, etc.). Here, there is no limitation on the size of the target frame sampling frequency.
[0142] Taking a target frame sampling frequency of 2 seconds / frame as an example, when the target video to be detected is 160 seconds long, one frame can be sampled every 2 seconds to obtain 80 frames of the video to be detected.
[0143] In some embodiments, before predicting each video segment using a preset video understanding model, the low-quality video detection method may further include: filtering at least two video segments according to a preset segment extraction number or a preset ratio.
[0144] For example, the preset number of segments to be extracted can be 10, 15, 20, etc. The preset ratio can be 50%, 60%, etc. This disclosure does not limit the size of the preset number of segments to be extracted or the preset ratio.
[0145] In some implementations, at least two video segments can be selected according to a preset number of segments to be extracted.
[0146] For example, taking 20 video clips obtained in S101 and a preset clip extraction quantity of 10, before predicting each video clip using the preset video understanding model, 10 video clips can be selected from the 20 video clips. Then, the 10 selected video clips are fed into the video understanding model for prediction.
[0147] Optionally, the method of filtering at least two video segments according to the preset number of segments to be extracted may include random filtering, sequential filtering, or filtering by interval, etc., and there is no limitation here.
[0148] In other implementations, at least two video clips can be selected according to a preset ratio.
[0149] For example, taking 16 video clips obtained in S101 with a preset ratio of 50%, before predicting each video clip using the preset video understanding model, 8 video clips can be selected from the 16 video clips according to the 50% ratio. Then, the 8 selected video clips are respectively fed into the video understanding model for prediction.
[0150] Similarly, video clips of a preset proportion can be filtered by random selection, sequential selection, or interval selection.
[0151] Before predicting each video segment using a preset video understanding model, this embodiment filters at least two video segments according to a preset number of segments or a preset ratio. This allows for dynamic setting of the number of video segments to be predicted, further improving the efficiency (or speed) of low-quality video detection.
[0152] In some embodiments, before predicting each video segment using a preset video understanding model, the low-quality video detection method may further include: scaling each video segment to a preset size and adjusting it to a preset format.
[0153] The preset size is the image size that matches the input size of the video understanding network.
[0154] For example, the preset size can be 224*224, and the preset format can be RGB format. For instance, each frame of a video clip can be adjusted to an RGB image with a size of 224*224.
[0155] Alternatively, the video to be detected can be scaled to a preset size and adjusted to a preset format before segmenting the video.
[0156] In this embodiment, before predicting each video segment using a preset video understanding model, each video segment is scaled to a preset size and adjusted to a preset format. This can improve the processing speed of the video understanding model, thereby improving the efficiency of low-quality video detection.
[0157] By way of example, this disclosure also provides a method for training a video understanding model. Figure 8 This is a flowchart illustrating the video understanding model training method provided in this embodiment of the disclosure. Figure 8 As shown, the video understanding model training method may include:
[0158] S801. Obtain at least two sample video segments obtained from the segmentation of the sample video, and the low-quality category label corresponding to each sample video segment.
[0159] S802. Construct a video understanding network based on a self-attention mechanism network and a classification network.
[0160] S803. Using at least two sample video segments and the low-quality category labels corresponding to each sample video segment, train the video understanding network to obtain the video understanding model.
[0161] S801-S803 can refer to the training process of the video understanding model described in the foregoing embodiments, and will not be repeated here.
[0162] Among them, the video understanding model is used to detect low quality in the video to be detected, the self-attention mechanism network is used to extract the features of the video segment in each dimension corresponding to each low quality category, and the classification network is used to predict the prediction confidence of each video segment belonging to each low quality category based on the features extracted by the self-attention mechanism network.
[0163] Optionally, the video understanding network also includes a feature selection network; the feature selection network is used to select the features extracted by the self-attention mechanism network before the classification network predicts the prediction confidence of each video segment belonging to each low-quality category based on the features extracted by the self-attention mechanism network.
[0164] In an exemplary embodiment, this disclosure also provides a low-quality video detection apparatus, which can be used to implement the low-quality video detection method as described in the foregoing embodiments. Figure 9 This is a schematic diagram illustrating the composition of a low-quality video detection device provided in an embodiment of this disclosure. Figure 9 As shown, the low-quality video detection device may include: a video processing unit 901, a video understanding unit 902, and a low-quality recognition unit 903.
[0165] The video processing unit 901 is used to divide the video to be detected into at least two video segments.
[0166] The video understanding unit 902 is used to predict each video segment using a preset video understanding model, and obtain the prediction confidence of each video segment belonging to each low-quality category. The low-quality categories include at least two. The video understanding model is trained using a video understanding network, which includes a self-attention mechanism network and a classification network. The self-attention mechanism network is used to extract features of the video segment in the dimension corresponding to each low-quality category, and the classification network is used to predict the prediction confidence of each video segment belonging to each low-quality category based on the features extracted by the self-attention mechanism network.
[0167] The low-quality identification unit 903 is used to determine the quality review result of the video to be detected based on the proportion of the target video segment in at least two video segments. The target video segment is a video segment in at least two video segments whose predicted confidence of at least one low-quality category meets a first preset condition. The quality review result includes: the video to be detected is a low-quality video or a non-low-quality video.
[0168] Optionally, the video understanding unit 902 is further configured to: acquire at least two sample video segments obtained by dividing the sample video, and low-quality category labels corresponding to each sample video segment; construct a video understanding network based on a self-attention mechanism network and a classification network; and train the video understanding network using at least two sample video segments and low-quality category labels corresponding to each sample video segment to obtain a video understanding model.
[0169] Optionally, the video understanding network further includes a feature selection network; the video understanding unit 902 is specifically used for: encoding and mapping each video segment into a first embedding vector, a second embedding vector, and a third embedding vector; extracting features from the first embedding vector, the second embedding vector, and the third embedding vector corresponding to each video segment through a self-attention mechanism network to obtain feature maps corresponding to the first embedding vector, the second embedding vector, and the third embedding vector respectively; concatenating the feature maps corresponding to the first embedding vector, the second embedding vector, and the third embedding vector corresponding to each video segment to obtain a fused feature map corresponding to each video segment; obtaining the fourth embedding vector, the fifth embedding vector, and the sixth embedding vector corresponding to the fused feature map corresponding to each video segment through the feature selection network, and selecting the fused feature map according to the fourth embedding vector, the fifth embedding vector, and the sixth embedding vector to obtain a selected feature map corresponding to each video segment; and predicting the selected feature map corresponding to each video segment through a classification network to obtain the prediction confidence of each video segment belonging to each low-quality category.
[0170] Optionally, the low-quality identification unit 903 is further configured to: determine the weighted confidence of each low-quality category based on the predicted confidence of each video segment belonging to each low-quality category; and determine the low-quality category whose weighted confidence satisfies a second preset condition as the low-quality category in the video to be detected based on the weighted confidence of each low-quality category.
[0171] Optionally, the low-quality identification unit 903 is further configured to: determine the low-quality category whose predicted confidence is greater than a first threshold based on the predicted confidence of each video segment belonging to each low-quality category; and identify the low-quality category in which the number of corresponding video segments is greater than a second threshold among the low-quality categories whose predicted confidence is greater than the first threshold as the low-quality category in which the video to be detected exists.
[0172] Optionally, the video processing unit 901 is specifically used to: divide the video to be detected into at least two video segments according to a target duration, wherein the target duration is related to the duration of the video to be detected.
[0173] Optionally, the video processing unit 901 is further configured to filter at least two video segments according to a preset segment extraction number or a preset ratio before the video understanding unit 902 makes a prediction on each video segment using a preset video understanding model.
[0174] Optionally, the video processing unit 901 is further configured to scale each video segment to a preset size and adjust it to a preset format before the video understanding unit 902 makes a prediction on each video segment using a preset video understanding model.
[0175] In an exemplary embodiment, this disclosure also provides a video understanding model training apparatus, which can be used to implement the video understanding model training method as described in the foregoing embodiments. Figure 10 This is a schematic diagram illustrating the composition of a video understanding model training device provided in an embodiment of this disclosure. Figure 10 As shown, the low-quality video detection device may include: an acquisition unit 1001, a construction unit 1002, and a training unit 1003.
[0176] The acquisition unit 1001 is used to acquire at least two sample video segments obtained by dividing the sample video, and the low-quality category label corresponding to each sample video segment.
[0177] Building unit 1002 is used to build a video understanding network based on a self-attention mechanism network and a classification network.
[0178] Training unit 1003 is used to train the video understanding network using at least two sample video segments and the low-quality category label corresponding to each sample video segment to obtain a video understanding model. The video understanding model is used to detect low quality in the video to be detected. The self-attention mechanism network is used to extract the features of the video segments in each dimension corresponding to each low-quality category. The classification network is used to predict the prediction confidence of each video segment belonging to each low-quality category based on the features extracted by the self-attention mechanism network.
[0179] Optionally, the video understanding network also includes a feature selection network; the feature selection network is used to select the features extracted by the self-attention mechanism network before the classification network predicts the prediction confidence of each video segment belonging to each low-quality category based on the features extracted by the self-attention mechanism network.
[0180] The acquisition, storage, and application of user personal information involved in the technical solution disclosed herein comply with the provisions of relevant laws and regulations and do not violate public order and good morals.
[0181] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.
[0182] In an exemplary embodiment, an electronic device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method described in the above embodiments. The electronic device may be the computer or server described above.
[0183] In an exemplary embodiment, the readable storage medium may be a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method described in the above embodiments.
[0184] In an exemplary embodiment, the computer program product includes a computer program that, when executed by a processor, implements the method described in the above embodiments.
[0185] Figure 11A schematic block diagram of an example electronic device 1100 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.
[0186] like Figure 11 As shown, the electronic device 1100 includes a computing unit 1101, which can perform various appropriate actions and processes according to a computer program stored in a read-only memory (ROM) 1102 or a computer program loaded from a storage unit 1108 into a random access memory (RAM) 1103. The RAM 1103 may also store various programs and data required for the operation of the device 1100. The computing unit 1101, ROM 1102, and RAM 1103 are interconnected via a bus 1104. An input / output (I / O) interface 1105 is also connected to the bus 1104.
[0187] Multiple components in electronic device 1100 are connected to I / O interface 1105, including: input unit 1106, such as keyboard, mouse, etc.; output unit 1107, such as various types of displays, speakers, etc.; storage unit 1108, such as disk, optical disk, etc.; and communication unit 1109, such as network card, modem, wireless transceiver, etc. Communication unit 1109 allows electronic device 1100 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0188] The computing unit 1101 can be various general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 1101 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1101 performs the various methods and processes described above, such as low-quality video detection methods or video understanding model training methods. For example, in some embodiments, the low-quality video detection method or video understanding model training method can be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 1108. In some embodiments, part or all of the computer program can be loaded and / or installed on the electronic device 1100 via ROM 1102 and / or communication unit 1109. When the computer program is loaded into RAM 1103 and executed by the computing unit 1101, one or more steps of the low-quality video detection method or video understanding model training method described above can be performed. Alternatively, in other embodiments, computing unit 1101 may be configured by any other suitable means (e.g., by means of firmware) to perform low-quality video detection methods or video understanding model training methods.
[0189] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0190] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0191] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0192] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0193] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with embodiments of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.
[0194] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, servers in distributed systems, or servers incorporating blockchain technology.
[0195] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.
[0196] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.
Claims
1. A method for detecting low-quality video, the method comprising: Divide the video to be detected into at least two video segments; Each video segment is predicted using a pre-defined video understanding model to obtain the prediction confidence of each video segment belonging to each low-quality category. The low-quality categories include at least two types. The video understanding model is trained using a video understanding network, which includes a self-attention mechanism network, a feature selection network, and a classification network. The self-attention mechanism network is used to extract features of the video segment in each dimension corresponding to each low-quality category. The feature selection network is used to select the features extracted by the self-attention mechanism network. The classification network is used to predict the prediction confidence of each video segment belonging to each low-quality category based on the features selected by the feature selection network. The quality review result of the video to be detected is determined based on the proportion of the target video segment in the at least two video segments. The target video segment is a video segment in the at least two video segments in which the predicted confidence of at least one of the low-quality categories meets the first preset condition. The quality review result includes: the video to be detected is a low-quality video or a non-low-quality video.
2. The method according to claim 1, further comprising: Obtain at least two sample video segments obtained from the segmentation of the sample video, and the low-quality category label corresponding to each sample video segment; A video understanding network is constructed based on the self-attention mechanism network and the classification network; The video understanding network is trained using the at least two sample video segments and the low-quality category labels corresponding to each sample video segment to obtain the video understanding model.
3. The method according to claim 1 or 2, wherein predicting each video segment using a preset video understanding model to obtain the prediction confidence of each video segment belonging to each low-quality category includes: Each video segment is encoded and mapped into a first embedding vector, a second embedding vector, and a third embedding vector; The self-attention mechanism network is used to extract features from the first embedding vector, the second embedding vector, and the third embedding vector corresponding to each video segment, so as to obtain feature maps corresponding to the first embedding vector, the second embedding vector, and the third embedding vector respectively. The feature maps corresponding to the first embedding vector, the second embedding vector, and the third embedding vector of each video segment are concatenated to obtain the fused feature map corresponding to each video segment. The feature filtering network obtains the fourth embedding vector, the fifth embedding vector, and the sixth embedding vector corresponding to the fused feature map for each video segment, and filters the fused feature map according to the fourth embedding vector, the fifth embedding vector, and the sixth embedding vector to obtain the filtered feature map corresponding to each video segment. The classification network is used to predict the filtered feature map corresponding to each video segment, thereby obtaining the prediction confidence of each video segment belonging to each of the low-quality categories.
4. The method according to any one of claims 1-3, further comprising: Based on the predicted confidence of each video segment belonging to each of the low-quality categories, a weighted confidence level for each of the low-quality categories is determined; Based on the weighted confidence level of each of the low-quality categories, the low-quality categories whose weighted confidence level satisfies the second preset condition are determined as the low-quality categories in which the video to be detected exists.
5. The method according to any one of claims 1-3, further comprising: Based on the predicted confidence of each video segment belonging to each of the low-quality categories, determine the low-quality categories whose predicted confidence is greater than a first threshold; Among the low-quality categories with a prediction confidence greater than a first threshold, the low-quality categories whose corresponding number of video segments is greater than a second threshold are identified as the low-quality categories in which the video to be detected exists.
6. The method according to any one of claims 1-5, wherein dividing the video to be detected into at least two video segments comprises: The video to be detected is divided into at least two video segments according to the target duration, wherein the target duration is related to the duration of the video to be detected.
7. The method according to any one of claims 1-6, wherein before predicting each video segment using a preset video understanding model, the method further comprises: The at least two video segments are filtered according to a preset number of segments to be extracted or a preset ratio.
8. The method according to any one of claims 1-7, wherein before predicting each video segment using a preset video understanding model, the method further comprises: Each video segment is scaled to a preset size and adjusted to a preset format.
9. A method for training a video understanding model, the method comprising: Obtain at least two sample video segments obtained from the segmentation of the sample video, and the low-quality category label corresponding to each sample video segment; A video understanding network is constructed based on a self-attention mechanism network, a feature selection network, and a classification network. The video understanding network is trained using at least two sample video segments and the low-quality category label corresponding to each sample video segment to obtain a video understanding model. The video understanding model is used to detect low quality in the video to be detected. The self-attention mechanism network is used to extract features of the video segments in each dimension corresponding to each low-quality category. The feature filtering network is used to filter the features extracted by the self-attention mechanism network. The classification network is used to predict the prediction confidence of each video segment belonging to each low-quality category based on the features filtered by the feature filtering network.
10. The method of claim 9, wherein the feature filtering network is used to filter the features extracted by the self-attention mechanism network before the classification network predicts the prediction confidence of each video segment belonging to each low-quality category based on the features extracted by the self-attention mechanism network.
11. A low-quality video detection device, the device comprising: A video processing unit is used to divide the video to be detected into at least two video segments; A video understanding unit is used to predict each video segment using a preset video understanding model, and to obtain the prediction confidence of each video segment belonging to each low-quality category. The low-quality categories include at least two types. The video understanding model is trained using a video understanding network, which includes a self-attention mechanism network, a feature selection network, and a classification network. The self-attention mechanism network is used to extract features of the video segment in each dimension corresponding to each low-quality category. The feature selection network is used to select the features extracted by the self-attention mechanism network. The classification network is used to predict the prediction confidence of each video segment belonging to each low-quality category based on the features selected by the feature selection network. A low-quality identification unit is used to determine the quality review result of the video to be detected based on the proportion of the target video segment in the at least two video segments. The target video segment is a video segment in the at least two video segments in which the predicted confidence of at least one of the low-quality categories meets a first preset condition. The quality review result includes: the video to be detected is a low-quality video or a non-low-quality video.
12. The apparatus according to claim 11, wherein the video understanding unit is further configured to: Obtain at least two sample video segments obtained from the segmentation of the sample video, and the low-quality category label corresponding to each sample video segment; A video understanding network is constructed based on the self-attention mechanism network and the classification network; The video understanding network is trained using the at least two sample video segments and the low-quality category labels corresponding to each sample video segment to obtain the video understanding model.
13. The apparatus according to claim 11 or 12, wherein the video understanding unit is specifically used for: Each video segment is encoded and mapped into a first embedding vector, a second embedding vector, and a third embedding vector; The self-attention mechanism network is used to extract features from the first embedding vector, the second embedding vector, and the third embedding vector corresponding to each video segment, so as to obtain feature maps corresponding to the first embedding vector, the second embedding vector, and the third embedding vector respectively. The feature maps corresponding to the first embedding vector, the second embedding vector, and the third embedding vector of each video segment are concatenated to obtain the fused feature map corresponding to each video segment. The feature filtering network obtains the fourth embedding vector, the fifth embedding vector, and the sixth embedding vector corresponding to the fused feature map for each video segment, and filters the fused feature map according to the fourth embedding vector, the fifth embedding vector, and the sixth embedding vector to obtain the filtered feature map corresponding to each video segment. The classification network is used to predict the filtered feature map corresponding to each video segment, thereby obtaining the prediction confidence of each video segment belonging to each of the low-quality categories.
14. The apparatus according to any one of claims 11-13, wherein the low-quality identification unit is further configured to: Based on the predicted confidence of each video segment belonging to each of the low-quality categories, a weighted confidence level for each of the low-quality categories is determined; Based on the weighted confidence level of each of the low-quality categories, the low-quality categories whose weighted confidence level satisfies the second preset condition are determined as the low-quality categories in which the video to be detected exists.
15. The apparatus according to any one of claims 11-13, wherein the low-quality identification unit is further configured to: Based on the predicted confidence of each video segment belonging to each of the low-quality categories, determine the low-quality categories whose predicted confidence is greater than a first threshold; Among the low-quality categories with a prediction confidence greater than a first threshold, the low-quality categories whose corresponding number of video segments is greater than a second threshold are identified as the low-quality categories in which the video to be detected exists.
16. The apparatus according to any one of claims 11-15, wherein the video processing unit is specifically used for: The video to be detected is divided into at least two video segments according to the target duration, wherein the target duration is related to the duration of the video to be detected.
17. The apparatus according to any one of claims 11-16, wherein the video processing unit is further configured to filter the at least two video segments according to a preset segment extraction number or a preset ratio before the video understanding unit predicts each video segment using a preset video understanding model.
18. The apparatus according to any one of claims 11-17, wherein the video processing unit is further configured to scale each video segment to a preset size and adjust it to a preset format before the video understanding unit predicts each video segment using a preset video understanding model.
19. A video understanding model training device, the device comprising: The acquisition unit is used to acquire at least two sample video segments obtained by dividing the sample video, and the low-quality category label corresponding to each sample video segment; The building unit is used to construct a video understanding network based on a self-attention mechanism network, a feature selection network, and a classification network. The training unit is used to train the video understanding network using the at least two sample video segments and the low-quality category label corresponding to each sample video segment to obtain a video understanding model. The video understanding model is used to perform low-quality detection on the video to be detected. The self-attention mechanism network is used to extract features of the video segments in each dimension corresponding to each low-quality category. The feature filtering network is used to filter the features extracted by the self-attention mechanism network. The classification network is used to predict the prediction confidence of each video segment belonging to each low-quality category based on the features filtered by the feature filtering network.
20. The apparatus of claim 19, wherein the feature filtering network is configured to filter the features extracted by the self-attention mechanism network before the classification network predicts the prediction confidence of each video segment belonging to each low-quality category based on the features extracted by the self-attention mechanism network.
21. An electronic device, comprising: At least one processor; and a memory communicatively connected to the at least one processor; The memory stores instructions executable by the at least one processor, which, when executed by the at least one processor, enable the at least one processor to perform the method according to any one of claims 1-8, or the method according to claim 9 or 10.
22. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method according to any one of claims 1-8, or the method according to claim 9 or 10.
23. A computer program product comprising a computer program that, when executed by a processor, implements the method according to any one of claims 1-8, or the method according to claim 9 or 10.