Illegal video identification method and apparatus, and electronic device

A recognition method and video technology, applied in the field of video recognition, can solve the problems of not being able to identify new illegal videos well, and unable to identify original illegal videos well.

Pending Publication Date: 2020-06-23
BEIJING KINGSOFT CLOUD NETWORK TECH CO LTD +1
0 Cites 1 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0004] However, the above-mentioned pre-trained convolutional neural network is often unable to identify some new violation videos that have not appeared in the sample data; and if new violation videos are included to update the original sample data, use the updated sample data To update the original convolutional neural network, it may cause the problem that after a long period of update and evolution, the convolutional neural network trained with updated s...
View more

Abstract

The embodiment of the invention provides a violation video recognition method and device and electronic equipment, and the method comprises the steps: recognizing a to-be-recognized video based on a preset recognition model, and obtaining a recognition result; wherein the identification model comprises a first identification model and a second identification model; wherein the identification result comprises a first identification result obtained based on the first identification model and a second identification result obtained based on the second identification model. If at least one of thefirst identification result and the second identification result is illegal, determining that the video is illegal; wherein the first identification model is obtained by training an initial convolutional neural network model by using initial sample data in advance, and the second identification model is obtained by training the first identification model by using update data of the initial sampledata in advance. Therefore, through the scheme, violation videos can be stably and reliably recognized without being influenced by long-time updating evolution, and the probability of missing detection is reduced.

Application Domain

Character and pattern recognitionNeural architectures +1

Technology Topic

EngineeringVideo recognition +6

Image

  • Illegal video identification method and apparatus, and electronic device
  • Illegal video identification method and apparatus, and electronic device
  • Illegal video identification method and apparatus, and electronic device

Examples

  • Experimental program(1)

Example Embodiment

[0026] In order to enable those skilled in the art to better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described implementation The examples are only a part of the embodiments of the present invention, not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
[0027] The following first introduces a method for identifying illegal videos provided in an embodiment of the present invention.
[0028] It should be noted that the method for identifying illegal videos provided by the embodiments of the present invention can be applied to electronic devices capable of data processing, including desktop computers, portable computers, Internet TVs, smart mobile terminals, wearable smart terminals, Servers, etc., are not limited here, and any electronic device that can implement the embodiments of the present invention falls within the protection scope of the embodiments of the present invention.
[0029] Such as figure 1 As shown, the process of the method for identifying illegal videos in an embodiment of the present invention may include:
[0030] S101: Recognizing a video to be recognized based on a preset recognition model to obtain a recognition result. The preset recognition model includes a first recognition model and a second recognition model; the recognition result includes: a first recognition result obtained based on the first recognition model, and a second recognition result obtained based on the second recognition model.
[0031] The first recognition model is obtained by pre-training the initial convolutional neural network model using initial sample data, and the second recognition model is obtained by pre-training the first recognition model using updated data of the initial sample data.
[0032] Specifically, the initial sample data is a collection of sample data including illegal videos and legal videos collected before training the first recognition model. The updated data of the initial sample data can be the inclusion of the new offending video when the new offending video first appears, and the sample data set that updates the initial sample data once; it can also be the inclusion of the new offending video every time a new offending video appears. Violation videos, sample data sets that have been updated multiple times on the initial sample data. The training process of the first recognition model and the second recognition model may use an existing method for training a convolutional neural network, for example, a batch stochastic gradient descent algorithm.
[0033] In order to obtain the first recognition result and the second recognition result based on the preset recognition model, the foregoing recognition model based on the preset recognition model recognizes the video to be recognized to obtain the recognition result, which may specifically include the following steps A1 to A2:
[0034] A1: Recognize the video to be recognized based on the first recognition model to obtain the first recognition result;
[0035] A2: Recognize the video to be recognized based on the second recognition model to obtain the second recognition result.
[0036] This embodiment does not limit the execution order of the above steps A1 and A2. They can be executed simultaneously, or they can be executed first and then step A2, or they can be executed first and then step A1.
[0037] It is understandable that since the first recognition model is obtained by training the initial convolutional neural network model with the initial sample data in advance, the second recognition model is obtained by training the first recognition model with the updated data of the initial sample in advance. ; Therefore, the first recognition result is whether the video to be recognized is the recognition result of the violating video included in the initial sample data, and the second recognition result is whether the video to be recognized is the recognition result of the violating video included in the updated data of the initial sample .
[0038] That is to say, the recognition by the first recognition model is: whether the video to be recognized meets the characteristic information of the offending video contained in the initial sample data, that is, whether the video to be recognized is: the corresponding offending type is determined by the initial sample data The type of video that contains the illegal video; and the recognition performed by the second recognition model is: whether the video to be recognized meets the feature information of the offending video contained in the update data of the initial sample, that is, whether the video to be recognized corresponds to: The violation type is a video of the type of the violation video included in the update data of the initial sample.
[0039] In addition, the first recognition result and the second recognition result may specifically be a violation confidence that characterizes the probability that the video to be recognized is a violation video, or may be an identifier that indicates that the video to be recognized is a violation or legal, such as 1 or 0.
[0040] Of course, the specific method for the first recognition model and the second recognition model to recognize the input video can be the existing standard image recognition technology based on the existing video to be recognized for frame checking, or it can be based on the standard image recognition technology to be recognized. Video clip check.
[0041] S102: If at least one of the first recognition result and the second recognition result is in violation, determine that the video is in violation.
[0042] If one of the first recognition result and the second recognition result is in violation, at this time, the video to be recognized is recognized as a violation video by the first recognition model, or is recognized as a violation video by the second recognition model.
[0043] Exemplarily, the feature of the offending video included in the initial sample data is that body parts are exposed, and the feature of the offending video included in the updated data of the initial sample data is that the action of the person in the video violates the regulations. If the video to be recognized is a video with naked body parts, the first recognition result is a violation, and the second recognition result is non-violation; if the video to be recognized is a video in which the actions of people in the video violate the rules, the second recognition result is a violation. The recognition result is non-violation.
[0044] If the first recognition result and the second recognition result are both violations, at this time, the video to be recognized is recognized as a violation video by the first recognition model, and is also recognized as a violation video by the second recognition model.
[0045] Exemplarily, the feature of the offending video included in the initial sample data is that body parts are exposed, and the feature of the offending video included in the updated data of the initial sample data is that the action of the person in the video violates the regulations. If the video to be recognized is a video with naked body parts and there is an illegal action, the first recognition result is a violation, and the second recognition result is a violation.
[0046] It can be seen that if there is at least one violation in the first recognition result and the second recognition result, it indicates that the video to be recognized is recognized as a violation by at least one of the first recognition model and the second recognition model. Therefore, it can be determined that the video is in violation.
[0047] According to the method for identifying illegal videos provided by the embodiment of the present invention, since the preset recognition model includes a first recognition model and a second recognition model, the first recognition model can be used to memorize the initial sample data while using the initial sample data. The second recognition model trained on the updated data of the initial sample data is used to identify the new offending video corresponding to the updated data of the initial sample data. Therefore, the new offending video can be identified while preventing the original offending video corresponding to the initial sample data from being forgotten. It can be seen that this solution can realize stable and reliable identification of illegal videos without being affected by long-term update and evolution, and reduce the probability of missed detection.
[0048] Optional, such as figure 2 As shown, the present invention figure 1 The process of the preset training method of the recognition model in the embodiment may include:
[0049] S201: Input a plurality of collected sample images into an initial convolutional neural network model for training, and obtain a prediction violation confidence of a video segment composed of a plurality of sample images.
[0050] Among them, the predicted violation confidence is the probability that the video segment composed of multiple sample images obtained after the initial convolutional neural network model processes the input sample image belongs to the violation video, which is the initial convolutional neural network model's response to the sample image The test results.
[0051] S202: According to the obtained predicted violation confidence and pre-marked category information of whether each sample image belongs to violation or not, use a preset error function to determine whether the convolutional neural network model in the current training stage converges. If it converges, perform step S203, if it does not converge, perform steps S204 to S205.
[0052] S203: Determine the convolutional neural network model in the current training stage as a preset recognition model.
[0053] Use the preset error function to determine whether the current target detection model has converged. The specific method can be to minimize the preset error function and calculate the minimum value of the preset error function. When the minimum value is obtained, it represents the current target detection The model converges. When the minimum value has not been obtained, it means that the current target detection model does not converge.
[0054] The preset error function is used to calculate the difference between the pre-labeled sample image in each sample image and whether the sample image belongs to the category information of violation, and the difference between the detection result of the sample image by the convolutional neural network model in the current training stage, the smaller the difference , The test result is more accurate. Therefore, when the preset error function obtains the minimum value, the detection result of the sample image by the convolutional neural network model in the current training stage is more the same as the pre-labeled category information. Therefore, when the convolutional neural network model of the current training phase converges, the convolutional neural network model of the current training phase can be determined as the preset target detection model.
[0055] S204: Using a preset gradient function, a stochastic gradient descent algorithm is used to adjust the model parameters of the convolutional neural network model in the current training stage.
[0056] S205: Input the collected multiple sample images into the adjusted convolutional neural network model, and repeat the steps of training and adjusting model parameters until the adjusted convolutional neural network converges.
[0057] The stochastic gradient descent algorithm adjusts the model parameters of the convolutional neural network model in the current training stage, so that the detection result of the convolutional neural network model is improved after the model parameter adjustment, and the difference between the pre-labeled category information and the pre-labeled category information is reduced, thereby Achieve convergence.
[0058] Correspondingly, before the model in the current training stage converges, the steps of training and adjusting model parameters are repeated. Of course, each training is performed on the convolutional neural network model with the latest model parameters adjusted.
[0059] At the same time, it is understandable that both the first recognition model and the second recognition model can use the above figure 2 It is obtained by training in the manner of the embodiment. The difference in training is: when training to obtain the first recognition model, the initial convolutional neural network in step S201 is the initial convolutional neural network, and the multiple sample images in step S201 are the initial sample data When training to obtain the second recognition model, the initial convolutional neural network in step S201 is the first recognition model, and the multiple sample images in step S201 are updated data of the initial sample.
[0060] In addition, in specific applications, if there are vulgar actions in the video, the video is also a violation video. However, the action feature is reflected by the overall feature of the video segment composed of multiple image frames. If the above-mentioned convolutional neural network for image frame recognition is used for recognition, only certain single image frames that constitute vulgar actions can be performed. However, it is impossible to recognize the overall characteristics of a video clip composed of multiple image frames, and it is difficult to recognize vulgar actions, leading to missed detection of illegal videos.
[0061] To this end, optionally, the above-mentioned invention figure 1 At least one of steps A1 and A2 in the embodiment may specifically include the following steps A11 to A14:
[0062] A11. Obtain multiple image frames from the video to be recognized.
[0063] Obtaining multiple image frames may specifically be collecting multiple image frames from the video to be identified according to a preset period, so as to obtain multiple image frames at equal intervals. Since the action is composed of continuous image frames, and the difference between continuous image frames without intervals may be small, compared with continuous image frames without intervals, the image frames with equal intervals can retain as much as possible the characteristics of the action. At the same time, it avoids the slow data processing speed caused by the huge amount of data formed by obtaining continuous image frames without intervals.
[0064] For example, in the video to be recognized, among all the image frames that constitute a person's drinking action, the first to fifth image frames with no interval may be the action of the person's hand touching the cup, and the sixth frame without interval The image frame to the 15th image frame may all be the action of the person picking up the cup, and the 16th frame to the 25th image frame without interval may all be the action of the person drinking water. When multiple image frames are collected according to the preset period, the fifth image frame A of the person's hand touching the cup, the 10th image frame B and the 15th image frame C of the person picking up the cup can be obtained. The 20th image frame D and the 25th image frame E, so that relatively few image frames reflect the action characteristics of the person drinking water.
[0065] A12: Perform feature extraction on multiple image frames respectively to obtain the image frame feature matrix of each image frame.
[0066] Among them, there are many ways to perform feature extraction on the image frame. Exemplarily, the feature extraction may be performed on each image frame using a preset convolutional neural network, which is obtained by pre-training using multiple sample images, and the multiple sample images can be composed of A sample video of the offending video. For feature extraction of multiple image frames, it can also use HOG (Histogram of Oriented Gradient) feature algorithm extraction, or use LBP (Local Binary Pattern, local binary pattern) algorithm and other feature extraction algorithms for each Feature extraction is performed on each image frame.
[0067] Any feature extraction algorithm that can be used to extract the offending features and non-offending features of an image can be used in the present invention, which is not limited in this embodiment.
[0068] For example, feature extraction is performed on image frame A, image frame B, image frame C, image frame D, and image frame E to obtain image feature matrix a of image frame A, image feature matrix b of image frame B, and image frame C Image feature matrix c, image feature matrix d of image frame D, and image feature matrix e of image frame E.
[0069] Optionally, in specific applications, the above-mentioned feature extraction algorithm for extracting the offending and non-offending features of the image can be used as the present invention. figure 1 For the sub-network of the preset recognition model in the embodiment, the above step A12 may specifically include:
[0070] The feature extraction sub-network based on the preset recognition model performs feature extraction on multiple image frames to obtain the image frame feature matrix of each image frame.
[0071] For example, input image frame A and image frame B into the first recognition model F1 and the second recognition model F2, respectively, to obtain the image frame feature matrix a1 of image frame A, the image frame feature matrix a2, and the image frame feature matrix b1 of image frame B , Image frame feature matrix b2. Where n is the number of preset recognition models.
[0072] A13, splicing multiple image frame feature matrices to obtain a video segment feature matrix.
[0073] It is understandable that multiple image frames can form a video segment, and at the same time, the characteristics of the video segment need to reflect the change relationship of each image frame constituting the video segment in the time dimension. Therefore, multiple image frame feature matrices can be spliced ​​to obtain a video segment feature matrix that can reflect the features of a video segment composed of multiple image frames.
[0074] Exemplarily, the image frame obtained during the identification of the illegal video is a three-channel color image, and correspondingly, the image frame feature matrix is ​​a three-dimensional feature matrix. Therefore, to splice the feature matrix of image frames, it can be specifically to splice multiple image frame feature matrices into a four-dimensional feature matrix, for example, splice the image frame feature matrices (c, h, w) of M image frames into M frames The feature matrix (M, c, h, w) of the video segment composed of image frames. Among them, h is the length of the matrix, w is the width of the matrix, and c is the number of channels of the matrix.
[0075] A14. Identify the feature matrix of the video segment to determine whether the video violates the rules.
[0076] The identification of the feature matrix of the video segment to determine whether the video violates the rules, that is, to identify whether the video features reflected by the feature matrix of the video segment violates the rules. Therefore, the specific method for recognizing the feature matrix of the video segment may be any classification algorithm, so as to obtain a recognition result that the video feature matrix is ​​illegal or legal.
[0077] Exemplarily, the classification algorithm may be a classifier model, such as a SoftMax (flexibility maximization) classifier and a two-classifier model. Of course, the classification algorithm is obtained by pre-training with multiple sample images containing violations and non-violations. Any classification algorithm that can be used to distinguish between violation and non-violation features can be used in the present invention, which is not limited in this embodiment.
[0078] Since the video segment feature matrix is ​​obtained by splicing image frame feature matrices of multiple image frames, it can reflect the overall features of the video segment composed of multiple image frames, and thus can reflect the action features in the video segment. Therefore, using the feature matrix of the video segment to identify, compared with the method of identifying only a single image frame in the video, not only the naked picture in a single image frame can be identified, but also the vulgar actions in the video segment can be identified, thereby reducing Unable to identify the probability of missed detection of illegal videos caused by vulgar actions.
[0079] It should be noted that the efficiency of recognizing multiple image frames is lower than the efficiency of recognizing a single image frame. Therefore, when the efficiency requirements for the recognition of illegal videos are prioritized to reduce the probability of missed detection, the first recognition model and the second recognition One of the models can use the above steps A11 to A14 to identify the video segment of the video to be identified, and the other can use the existing method of identifying a single image frame for identification. Of course, for a recognition model that recognizes a single image frame, one of multiple image frames can be selected and input into the recognition model to obtain the recognition result of the image frame of the recognition model.
[0080] When the need to reduce the probability of missed detection takes precedence over the efficiency demand for recognizing illegal videos, both the first recognition model and the second recognition model can use the steps A11 to A14 described above to recognize the video segments of the video to be recognized.
[0081] Optionally, in specific applications, any of the above-mentioned classification algorithms for identifying the feature matrix of video segments can be used as the present invention. figure 1 For the sub-network of the preset recognition model in the embodiment, the above step A14 may specifically include:
[0082] The classifier sub-network based on the preset recognition model recognizes the feature matrix of the video segment to determine whether the video violates the rules.
[0083] Specifically, the classifier sub-network of the preset recognition model recognizes the feature matrix of the video segment, which may be inputting the feature matrix of the video segment into the classifier sub-network of the preset recognition model to obtain the confidence that the feature matrix of the video segment is a violation , It can also be the identification that the feature matrix of the video segment is illegal.
[0084] Optionally, the above-mentioned classifier sub-network based on the preset recognition model recognizes the feature matrix of the video segment to determine whether the video violates the rules, which may include:
[0085] Input the feature matrix of the video segment into the classifier sub-network of the preset recognition model to obtain the violation confidence of the video;
[0086] If the violation confidence level meets the preset violation conditions, the video is determined to be in violation.
[0087] Among them, the classifier sub-network is used to obtain the confidence of the video segment corresponding to the input video feature matrix as the confidence that the video belongs to the violation type. The preset violation conditions may specifically be that the violation confidence falls within a preset confidence interval, or the violation confidence is not less than a preset confidence threshold. The preset reliability interval and the preset reliability threshold are determined when the classification algorithm is trained.
[0088] For example, M frames of image frames are collected from the video to be identified according to a preset period, where M>1, and the M frames are all three-channel RGB image frames of length W and height H. Input the M image frames into the feature extraction sub-network of the preset recognition model. Through the operation of the feature extraction sub-network, the feature matrix (c, h, w) of each image frame of the M image frames is extracted, and then stitched into Video segment feature matrix f of the video segment composed of image frames 1 =(M,c,h,w). Among them, h is the length of the matrix, w is the width of the matrix, and c is the number of channels of the matrix. The feature matrix f 1 =(M,c,h,w) Input the classifier sub-network of the preset recognition model, and obtain the feature matrix f of the video segment through the operation of the classifier sub-network 1 = (M, c, h, w) The violation confidence of the corresponding video segment is taken as the violation confidence of the video.
[0089] Optionally, the step of inputting the feature matrix of the video segment into the classifier sub-network of the preset recognition model to obtain the violation confidence of the video may specifically include:
[0090] Transpose the feature matrix of the video segment to obtain the feature matrix of the transposed video segment;
[0091] The output obtained after the transposed video segment feature matrix is ​​input into the preset first fully connected function is used as the input of the logistic regression loss function to obtain the violation confidence of the video.
[0092] The purpose of transposing the feature matrix of the video segment is to transpose the matrix indicating the feature of the video segment into a form that facilitates the input of the first fully connected function. For example, for the feature matrix f 1 =(M,c,h,w) Perform transposition processing to obtain the transposed video segment feature matrix f 2 = (C, M, h, w). The logistic regression loss function may specifically be a sigmoid activation function.
[0093] It is understandable that, in the transposed video feature matrix used to obtain the violation confidence, the features represented by different elements have different degrees of correlation with the violation features. For example, if feature 1 represented by element 1 is a wall feature, and feature 2 represented by element 2 is a human body feature, then the feature represented by element 1 has a low correlation with the violation feature, and the feature represented by element 2 has a high correlation with the violation feature .
[0094] Therefore, in order to better identify elements with a high degree of correlation with the offending feature, optionally, after the above-mentioned transposing process on the feature matrix of the video segment to obtain the step of transposing the feature matrix of the video segment, the embodiment of the present invention The method for identifying illegal videos may also include the following steps B1 to B3:
[0095] B1, input the transposed video segment feature matrix into the attention mechanism sub-network of the preset recognition model to obtain the spatio-temporal response weight matrix.
[0096] The attention mechanism sub-network can specifically be used to extract the function of the degree of correlation between each element and the offending feature in the feature matrix of the transposed video segment, so that the subsequent step B2 uses the spatio-temporal response weight matrix to perform the feature matrix of the transposed video segment. Weighted processing.
[0097] B2: Use the spatio-temporal response weight matrix to weight the transposed video segment feature matrix to obtain the video feature vector.
[0098] Exemplarily, using the spatio-temporal response weight matrix P 1 = (M, h, w), for the transposed video segment feature matrix f 2 =(c,M,h,w) Perform weighting processing to obtain the video feature vector Where j represents the j-th image frame in the video segment composed of the M image frames, (k,l) represents the rectangular area in the image frame with coordinates (k,l), and i represents the c-dimensional video feature vector i dimensions.
[0099] Correspondingly, the output obtained after inputting the transposed video segment feature matrix into the preset first fully connected function is used as the input of the logistic regression loss function to obtain the violation confidence of the video, including:
[0100] B3, the output obtained after inputting the video feature vector into the preset first fully connected function is used as the input of the logistic regression loss function to obtain the violation confidence of the video.
[0101] Exemplarily, the video feature vector The output obtained after inputting the preset first fully connected function, inputting the preset activation function, such as the sigmoid activation function, to obtain the feature matrix f of the video segment 1 = (M, c, h, w) corresponding video violation confidence.
[0102] Optionally, considering that the recognition of video violations or legality is a binary classification, and the video feature matrix used to recognize action features may not pay attention to color features, therefore, the attention mechanism sub-network can also be used to obtain a value range of [ 0,1], and the space-time response weight matrix algorithm with dimension M·h·w, the above step B1 may include the following steps B11 to B13:
[0103] B11: Transpose and reduce the dimensionality of the feature matrix of the transposed video segment to obtain the feature matrix of the reduced-dimensional video segment;
[0104] B12, input the feature matrix of the reduced-dimensional video segment into the preset second fully connected function and the preset activation function to obtain the response weight matrix;
[0105] B13, deform and restore the response weight matrix to obtain the spatiotemporal response weight matrix.
[0106] Exemplarily, the above steps B11 to B13 may include: transposing the feature matrix f of the video segment 2 =(c,M,h,w) Perform transposition and dimensionality reduction transformation to obtain the feature matrix M·h·w·c of the dimensionality reduction video segment. The output obtained after inputting the feature matrix of the reduced-dimensional video segment into the preset second fully connected function is used as the preset activation function, such as the input of the sigmoid activation function, and the value range is [0,1] and the dimension is M·h · The response weight matrix of w. Deform and restore the response weight matrix to obtain the spatiotemporal response weight matrix P 1 = (M, h, w).
[0107] Of course, the preset first fully connected function and the preset second fully connected function may be a fully connected function with a hidden layer of 1, and both are used to reduce the elements in the feature matrix of the video segment, so as to reflect the violation of the video segment. Or legal feature elements to prevent overfitting. The difference between the two is that the first fully connected function is used to calculate the feature matrix of the transposed video segment, and the second fully connected function is used to calculate the feature matrix of the reduced-dimensional video segment.
[0108] In specific applications, in order to reduce missed inspections when using visual technology to identify illegal videos, when identifying illegal videos, supervisors generally conduct a second review of the illegal videos identified by the recognition model. However, when the violation area of ​​the violation video is very small, for example, in a small area in the corner of the screen, and the supervisor needs to watch multiple violation videos at the same time, it may cause the video with a small violation area to be misjudged as legitimate videos during review. , It will still cause missed detection.
[0109] To this end, optionally, after the above steps B11 to B13 are used to determine the video violation, the embodiment of the present invention can also identify areas with high violation confidence in the violation video, and then output the coordinates of the violation area for the convenience of supervisors Confirm the violation area, avoid the misjudgment caused by the violation area is not easy to be watched, and reduce the probability of missed detection. Specifically, the following steps C1 to C3 can be used to mark the areas with high confidence in the violation in the video:
[0110] C1, normalize the spatio-temporal response weight matrix to obtain the violation response values ​​of each preset rectangular area that constitute the image frame in each image frame corresponding to the spatio-temporal response weight matrix;
[0111] C2: For each violation response value, determine whether the violation response value is greater than the preset violation threshold;
[0112] C3: If the violation response value is greater than the preset violation threshold, output the coordinate information of the preset rectangular area corresponding to the violation type response value.
[0113] Exemplarily, steps C1 to C3 may specifically be: response weight matrix p 1 = (M, h, w) point by point divided by matrix p 1 The sum of all elements in the realization of normalization: p 1Re = P 1 /sum(p 1 ) To obtain the violation type response value p of the rectangular area with the coordinates (k, l) of the j-th image frame in each image frame 1Re. Judge p 1Re Is it greater than the preset violation threshold; when p 1Re When greater than the preset violation threshold, output p 1Re The standard information (k, l) of the corresponding rectangular area.
[0114] In specific applications, since the training of the convolutional neural network is to adjust the parameters of the filter in the convolutional neural network according to the different filtering results of the input sample data by the convolutional neural network, therefore, different sample data can be obtained Convolutional neural network with different parameters. However, if it is desired that a single neural network can identify as comprehensive as possible the violation image features of different violation samples, the single neural network may be complicated or fail to converge due to overfitting. To this end, multiple preset recognition models can be used, and multiple preset recognition models that can recognize as many different violation image features as possible can be used to reduce the missed detection of violation videos and avoid overfitting problems.
[0115] For this, as image 3 As shown, in the process of the method for identifying illegal videos in another embodiment of the present invention, the number of preset identification models is multiple, and the method may include:
[0116] S301: Acquire multiple image frames from a video to be recognized.
[0117] S301 and the present invention figure 1 A11 in the optional embodiment is the same step, which will not be repeated here. For details, please refer to the present invention figure 1 Description of alternative embodiments.
[0118] S302: Input multiple image frames into each preset recognition model for feature extraction to obtain multiple image frame feature matrices of each image frame.
[0119] For example, input image frame A and image frame B into the preset recognition models F1, F2,..., Fn respectively, and obtain image frame feature matrix a1, image frame feature matrix a2,..., image frame feature matrix of image frame A an, image frame feature matrix b1 of image frame B, image frame feature matrix b2, ..., image frame feature matrix bn. Where n is the number of preset recognition models.
[0120] S303, splicing the image frame feature matrices extracted from the same preset recognition model among the obtained multiple image frame feature matrices to obtain a video segment feature matrix corresponding to the same preset recognition model.
[0121] For example, the image frame feature matrix a1 and the image frame feature matrix b1 extracted by the preset recognition model F1 are spliced ​​to obtain the video segment feature matrix a1b1 of the video segment AB composed of the image frame A and the image frame B. The image frame feature matrix a2 and the image frame feature matrix b2 extracted by the preset recognition model F2 are spliced ​​to obtain the video segment feature matrix a2b2 of the video segment AB composed of the image frame A and the image frame B. By analogy, splicing is performed to obtain multiple video segment feature matrices of a video segment composed of multiple image frames.
[0122] S304: Input the obtained multiple video segment feature matrices into a classifier sub-network of a preset recognition model corresponding to the video segment feature matrix, respectively, to obtain multiple violation confidence levels of the video.
[0123] For example, the obtained video segment feature matrix a1b1, video segment feature matrix a2b2,..., video segment feature matrix anbn are input into the classifier sub-network of the preset recognition model F1, and the classifier sub-network of the preset recognition model F2 Network,..., the classifier sub-network of the preset recognition model Fn, and obtain the confidence level P1, the confidence level P2,..., the confidence level Pn that the video to be recognized belongs to the offending video.
[0124] S305: Use preset fusion rules to fuse multiple confidences to obtain target confidences.
[0125] Optionally, S305 may specifically include:
[0126] Input multiple confidence levels into a preset weighted average algorithm to obtain the target confidence level.
[0127] Among them, the preset weighted average algorithm may be a linear weighted average algorithm or a nonlinear weighted average algorithm.
[0128] For example, in the linear weighted average algorithm, the weight of the confidence obtained by each preset recognition model is 1, and the average value can be directly calculated based on multiple confidences to obtain the target confidence.
[0129] In the nonlinear weighted average algorithm, according to the importance or accuracy of each preset recognition model, different weights can be set for the confidence obtained by each preset recognition model. For example, the weight of the confidence level P1 is 0.6, the weight of the confidence level P2 is 0.2,..., the weight of the confidence level Pn is 0.1. Each confidence is weighted according to the set weight, and then the average is calculated based on the weighted confidence to obtain the target confidence.
[0130] Or, S305 may specifically include:
[0131] Count the number of the same confidence level among multiple confidence levels.
[0132] Determine the same confidence with the largest number as the target confidence.
[0133] It is understandable that the detection results of the preset recognition model have a certain degree of error tolerance, or the preset recognition models with different model parameters have different recognition results for the same video feature matrix, and the greater the number of the same recognition results , Indicating that the video corresponding to the video feature matrix is ​​closer to the recognition result. Therefore, the same confidence with the largest number can be determined as the target confidence.
[0134] For example, among the 10 confidence levels obtained, the number of confidence levels with a confidence level of 0.4 is 2, the number of confidence levels with a confidence level of 0.6 is 3, and the number of confidence levels with a confidence level of 0.8 is 5, and the target confidence is determined The degree is 0.8.
[0135] S306: If the target confidence level meets the preset recognition condition, determine that the video is a violation video.
[0136] The preset violation condition may specifically be that the target confidence falls within a preset confidence interval, or the target confidence is not less than a preset confidence threshold. The preset reliability interval and the preset reliability threshold are determined when the preset classification algorithm is trained.
[0137] Corresponding to the above method embodiment, an embodiment of the present invention also provides a device for identifying illegal videos.
[0138] Such as Figure 4 As shown, the structure of the device for identifying illegal videos provided by an embodiment of the present invention may include:
[0139] The recognition module 401 is configured to recognize a video to be recognized based on a preset recognition model to obtain a recognition result; wherein the recognition model includes a first recognition model and a second recognition model; the recognition result includes: based on the first recognition model A first recognition result obtained by a recognition model, and a second recognition result obtained based on the second recognition model;
[0140] The determining module 402 is configured to determine that the video is in violation when at least one of the first recognition result and the second recognition result is in violation;
[0141] Wherein, the first recognition model is obtained by pre-training an initial convolutional neural network model using initial sample data, and the second recognition model is updated data using the initial sample data in advance to identify the first The model is trained.
[0142] In the device for identifying illegal videos provided by the embodiment of the present invention, since the preset recognition model includes a first recognition model and a second recognition model, the first recognition model can be used to memorize the initial sample data while using the initial sample data. The second recognition model trained on the updated data of the initial sample data is used to identify the new offending video corresponding to the updated data of the initial sample data. Therefore, the new offending video can be identified while preventing the original offending video corresponding to the initial sample data from being forgotten. It can be seen that this solution can realize stable and reliable identification of illegal videos without being affected by long-term update and evolution, and reduce the probability of missed detection.
[0143] Optionally, the above-mentioned invention Figure 4 The identification module 401 in the embodiment may include:
[0144] The image acquisition sub-module is used to acquire multiple image frames from the video to be recognized;
[0145] The feature extraction sub-module is configured to perform feature extraction on the multiple image frames to obtain the image frame feature matrix of each image frame;
[0146] The splicing submodule is used to splice a plurality of the image frame feature matrices to obtain a video segment feature matrix;
[0147] The recognition sub-module is used to recognize the feature matrix of the video segment to determine whether the video violates the rules.
[0148] Optionally, the feature extraction submodule in the foregoing embodiment can be specifically used for:
[0149] The feature extraction sub-network based on the preset recognition model respectively performs feature extraction on the multiple image frames to obtain the image frame feature matrix of each image frame.
[0150] Optionally, the identification submodule in the foregoing embodiment may be specifically used for:
[0151] A classifier sub-network based on a preset recognition model recognizes the feature matrix of the video segment to determine whether the video violates regulations.
[0152] Optionally, the identification submodule in the foregoing embodiment may include:
[0153] Confidence degree acquisition sub-module, configured to input the video segment feature matrix into the classifier sub-network to obtain the violation confidence of the video; if the violation confidence meets the preset violation conditions, determine the video Violation.
[0154] Optionally, the confidence acquisition submodule in the above embodiment can be specifically used for:
[0155] Performing transposition processing on the video segment feature matrix to obtain a transposed video segment feature matrix;
[0156] The output obtained after the transposed video segment feature matrix is ​​input into the preset first fully connected function is used as the input of the logistic regression loss function to obtain the violation confidence of the video.
[0157] Optionally, the device for identifying illegal videos provided in the embodiment of the present invention may further include: a weight matrix obtaining submodule and a feature vector obtaining submodule;
[0158] The weight matrix obtaining submodule is configured to perform transposition processing on the video segment feature matrix by the confidence obtaining submodule to obtain the transposed video segment feature matrix, and then input the transposed video segment feature matrix into the preset Set the attention mechanism sub-network of the recognition model to obtain the spatio-temporal response weight matrix;
[0159] The feature vector obtaining submodule is configured to use the spatio-temporal response weight matrix to perform weighting processing on the transposed video segment feature matrix to obtain a video feature vector;
[0160] Correspondingly, the confidence acquisition submodule can be specifically used for:
[0161] The output obtained after inputting the video feature vector into the preset first fully connected function is used as the input of the logistic regression loss function to obtain the violation confidence of the video.
[0162] Optionally, the weight matrix obtaining submodule in the foregoing embodiment is specifically used for:
[0163] Performing transposition and dimensionality reduction transformation on the transposed video segment feature matrix to obtain a dimensionality reduced video segment feature matrix;
[0164] Inputting the feature matrix of the reduced-dimensional video segment into a preset second fully connected function and a preset activation function to obtain a response weight matrix;
[0165] The response weight matrix is ​​deformed and restored to obtain the spatiotemporal response weight matrix.
[0166] Optionally, the device for identifying a violation video provided in the embodiment of the present invention may further include: a violation area marking module, configured to use the following steps to mark an area with high violation confidence in the video:
[0167] After the recognition sub-module determines that the video is in violation, normalize the spatio-temporal response weight matrix to obtain each image frame corresponding to the spatio-temporal response weight matrix, which constitutes each preset rectangular area of ​​the image frame Violation response value of
[0168] For each violation response value, determine whether the violation response value is greater than the preset violation threshold;
[0169] If the violation response value is greater than the preset violation threshold, output the coordinate information of the preset rectangular area corresponding to the violation type response value.
[0170] Optionally, the number of the aforementioned preset recognition models is multiple;
[0171] The feature extraction sub-module is specifically used for:
[0172] Respectively input the multiple image frames into each preset recognition model for feature extraction, and obtain multiple image frame feature matrices of each image frame;
[0173] The splicing sub-module is specifically used for:
[0174] Splicing the image frame feature matrices extracted from the same preset recognition model among the obtained multiple image frame feature matrices to obtain a video segment feature matrix corresponding to the same preset recognition model;
[0175] The confidence obtaining submodule is specifically used for:
[0176] Respectively inputting the obtained multiple video segment feature matrices into the classifier sub-network of the preset recognition model corresponding to the video segment feature matrix to obtain multiple violation confidence levels of the video;
[0177] Use preset fusion rules to fuse the multiple violation confidence levels to obtain the target confidence level;
[0178] If the target confidence level meets a preset recognition condition, it is determined that the video violates the rules.
[0179] Optionally, the above-mentioned confidence acquisition submodule is specifically used for:
[0180] Input the confidence levels of the multiple violations into a preset weighted average algorithm to obtain the target confidence levels.
[0181] Optionally, the above-mentioned confidence acquisition submodule is specifically used for:
[0182] Count the number of the same violation confidence levels among the multiple violation confidence levels;
[0183] Determine the maximum confidence of the same violation as the target confidence.
[0184] The embodiment of the present invention also provides an electronic device, such as Figure 5 As shown, it includes a processor 501, a communication interface 502, a memory 503, and a communication bus 504, where the processor 501, the communication interface 502, and the memories communicate with each other through the communication bus 504 through 503;
[0185] The memory 503 is used to store computer programs;
[0186] The processor 501 is configured to implement all the steps of the above method for identifying illegal videos when executing the computer program stored in the memory 503.
[0187] In the electronic device provided by the embodiment of the present invention, since the preset recognition model includes a first recognition model and a second recognition model, the first recognition model can be used to memorize the initial sample data while using the updated data from the initial sample data. The second recognition model obtained by training is used to identify the new illegal video corresponding to the updated data of the initial sample data. Therefore, it is possible to identify the new illegal video while avoiding the original illegal video corresponding to the initial sample data from being forgotten. It can be seen that this solution can realize stable and reliable identification of illegal videos without being affected by long-term update and evolution, and reduce the probability of missed detection.
[0188] The communication bus mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI for short) bus or an Extended Industry Standard Architecture (EISA for short) bus. The communication bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
[0189] The communication interface is used for communication between the aforementioned electronic device and other devices.
[0190] The memory may include random access memory (Random Access Memory, RAM for short), or non-volatile memory (Non-Volatile Memory, NVM for short), such as at least one disk storage. Optionally, the memory may also be at least one storage device located far away from the foregoing processor.
[0191] The aforementioned processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (DSP) , Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
[0192] An embodiment of the present invention also provides a computer-readable storage medium having a computer program stored in the computer-readable storage medium, and when the computer program is executed by a processor, all steps of the above-mentioned method for identifying illegal videos are realized.
[0193] An embodiment of the present invention provides a computer-readable storage medium that stores a computer program. When the computer program is executed by a processor, since the preset recognition model includes a first recognition model and a second recognition model, the first recognition model can be used While the recognition model memorizes the initial sample data, it uses the second recognition model trained from the updated data of the initial sample to identify the new offending video corresponding to the updated data of the initial sample data. Therefore, it can avoid the original sample data corresponding to the original While the offending video is forgotten, identify the new offending video. It can be seen that this solution can realize stable and reliable identification of illegal videos without being affected by long-term update and evolution, and reduce the probability of missed detection.
[0194] In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present invention are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server, or data center via wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL) or wireless (such as infrared, wireless, microwave, etc.)). The computer-readable storage medium may be any available medium that can be accessed by the computer or a data storage device such as a server, data center, etc., integrated with one or more available media. The available medium may be a magnetic medium, (for example, a floppy disk, a hard disk) , Magnetic tape), optical media (for example, Digital Versatile Disc (DVD)), or semiconductor media (for example, Solid State Disk (SSD)), etc.
[0195] It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply one of these entities or operations. There is any such actual relationship or order between. Moreover, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article, or device that includes a series of elements includes not only those elements, but also includes Other elements of, or also include elements inherent to this process, method, article or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other same elements in the process, method, article, or equipment including the element.
[0196] The various embodiments in this specification are described in a related manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the device and electronic device embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiments.
[0197] The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are all included in the protection scope of the present invention.

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products