A method and terminal for video image content review
By breaking down video image content review into multiple sub-tasks and employing a neural network model that combines object detection and multi-label classification for multi-task learning, the problems of slow computation speed and high resource consumption in existing technologies are solved, achieving efficient and accurate recognition for video image content review.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- FUJIAN IMPERIAL VISION INFORMATION TECH CO LTD
- Filing Date
- 2022-12-23
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, video image content review methods based on multiple single-task recognition algorithms are slow in computation and consume a lot of resources. On the other hand, methods based on single-task models are difficult to cope with the complexities of video content review, especially the poor recognition ability of medium and small targets.
The video image content review is decomposed into multiple sub-tasks. A neural network model that combines object detection and multi-label classification is adopted, including an object detection module, a multi-label classification module, and a shared backbone network. The model is trained through multi-task learning to extract video image features and make violation judgments.
It achieves efficient computation for video and image content review, with high speed, low resource consumption, and high accuracy. The network model has better generalization ability and can better identify illegal content in videos.
Smart Images

Figure CN116109969B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of content moderation technology, and in particular to a method and terminal for video image content moderation. Background Technology
[0002] In existing technologies, there are two common methods for image content moderation:
[0003] One approach is content moderation based on multiple single-task recognition algorithm models: the method involves combining multiple single-task models, each responsible for identifying a type of inappropriate content. For example, when reviewing a frame of image, multiple single-task models such as content recognition and sensitive text recognition are run, and finally the results of each model are combined to obtain the final review result of the input image.
[0004] The second is a content moderation algorithm based on a single-task model: this method trains only one deep learning model, which directly outputs whether an image violates regulations or directly determines what kind of violation the image belongs to.
[0005] The first method requires running multiple review models, which is slow and consumes a lot of computing resources. The second method is difficult to handle the complexities of video content review, especially its poor ability to identify small and medium-sized targets. Summary of the Invention
[0006] The technical problem to be solved by the present invention is to provide a video image content review method and terminal, which has good recognition effect and low computing resource consumption.
[0007] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is as follows:
[0008] A method for video image content moderation includes the following steps:
[0009] S1. Divide video image content review into multiple sub-tasks to identify specific types of violations. Based on the characteristics of the sub-tasks, determine whether the identification algorithm used for the sub-tasks is object detection or multi-label classification.
[0010] S2. The model is trained for multi-task learning according to the set sub-tasks. The neural network model that combines object detection and multi-label classification includes an object detection module, a multi-label classification module and a shared backbone network.
[0011] S3. Input the video image into the backbone network for extracting image features to extract image features of various dimensions of the video image;
[0012] S4. Input the image features into the target detection module and the multi-label classification module to determine whether the video image violates regulations and what kind of violation it is.
[0013] To solve the above-mentioned technical problems, another technical solution adopted by the present invention is as follows:
[0014] A video image content review terminal includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it performs the following steps:
[0015] S1. Divide video image content review into multiple sub-tasks to identify specific types of violations. Based on the characteristics of the sub-tasks, determine whether the identification algorithm used for the sub-tasks is object detection or multi-label classification.
[0016] S2. The model is trained for multi-task learning according to the set sub-tasks. The neural network model that combines object detection and multi-label classification includes an object detection module, a multi-label classification module and a shared backbone network.
[0017] S3. Input the video image into the backbone network for extracting image features to extract image features of various dimensions of the video image;
[0018] S4. Input the image features into the target detection module and the multi-label classification module to determine whether the video image violates regulations and what kind of violation it is.
[0019] The beneficial effects of this invention are as follows: A video image content review method and terminal decomposes video content review into multiple sub-tasks, selects to use a multi-label classification algorithm or an object detection algorithm according to the characteristics of the sub-tasks, and integrates the two algorithms into an artificial neural network, sharing a backbone network to extract features. It is fast, requires less computing resources, has high computational accuracy, and the network model has better generalization. Attached Figure Description
[0020] Figure 1 This is a flowchart illustrating a video image content review method according to an embodiment of the present invention;
[0021] Figure 2 This is a flowchart of a video image content review method according to an embodiment of the present invention;
[0022] Figure 3 This is a schematic diagram of the structure of a video image content review terminal according to an embodiment of the present invention;
[0023] Label Explanation:
[0024] 1. A video image content review terminal; 2. Processor; 3. Memory. Detailed Implementation
[0025] To explain in detail the technical content, objectives, and effects of the present invention, the following description is provided in conjunction with the embodiments and accompanying drawings.
[0026] Please refer to Figure 1-2 A method for reviewing video image content, comprising the following steps:
[0027] S1. Divide video image content review into multiple sub-tasks for identifying specific violation types. Based on the characteristics of the sub-tasks, determine whether the identification algorithm used for the sub-tasks is object detection or multi-label classification.
[0028] S2. The model is trained for multi-task learning according to the set sub-tasks. The neural network model that combines object detection and multi-label classification includes an object detection module, a multi-label classification module and a shared backbone network.
[0029] S3. Input the video image into the backbone network for extracting image features to extract image features of various dimensions of the video image;
[0030] S4. Input the image features into the target detection module and the multi-label classification module to determine whether the video image violates the rules and what kind of violation it is.
[0031] As can be seen from the above description, the beneficial effects of the present invention are as follows: a video image content review method and terminal decomposes video content review into multiple sub-tasks, and selects to use a multi-label classification network or an object detection network according to the characteristics of the sub-tasks. It is fast, requires less computing resources, has high computational accuracy, and the network model has better generalization.
[0032] Furthermore, the backbone network is specifically a TRexNet network, the object detection module is specifically based on a Transformer object detection head, and the multi-label classification module is specifically based on a ML-Decoder multi-label classification head.
[0033] As described above, the specific composition of each network is given. The TRexNet network can be computed in real time on chips with low computing power. The Transformer-based target detection head can achieve end-to-end target detection. The ML-Decoder-based multi-label classification head algorithm has a complexity of only O(n), which is better than the O(n^2) complexity of similar algorithms. The algorithms of the three main modules can be replaced according to the computing power and effect requirements of the application scenario. The selection in this example takes into account both computing speed and recognition effect, achieving a good balance.
[0034] Furthermore, based on the type of subtask, it is divided into object detection tasks and multi-label classification tasks, specifically:
[0035] The identification of targets that are clearly defined objects and very small is divided into the target detection subtask, and the identification of targets that are abstract behaviors or atmospheres is divided into the multi-label classification subtask.
[0036] As can be seen from the above description, the criteria for dividing the object detection subtask and the multi-label classification subtask have been given.
[0037] Furthermore, step S1 further divides the features into primary features and secondary features according to their importance, with the secondary features providing a basis for the identification of the primary features.
[0038] As described above, multi-task joint training allows the model to focus on important features, as other tasks can provide evidence for less important features. Furthermore, features that are difficult to learn on a particular task can be inspired by other related tasks, resulting in more generalized features learned by the model, leading to better final performance and robustness.
[0039] Furthermore, during training, the following multi-task learning steps are performed:
[0040] A1. Initialize the parameters of the backbone network using the parameters of the model trained on big data;
[0041] A2. Fix the parameters of the backbone network and train the multi-label classification module;
[0042] A3. Fix the parameters of the backbone network and the multi-label classification module, and train the target detection module;
[0043] A4. Reduce the learning rates of the backbone network, multi-label classification module, and object detection module. Based on the numerical magnitude of the loss functions of the object detection network model and the multi-label classification module, set weights for the loss functions of the multi-label classification model and the object detection module to make the learning speeds of the two sub-networks more consistent. At the same time, train the entire fusion neural network composed of the backbone network, multi-label classification module, and object detection module.
[0044] As can be seen from the above description, the entire training process of the network is presented.
[0045] A video image content review terminal includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it performs the following steps:
[0046] S1. Divide video image content review into multiple sub-tasks for identifying specific violation types. Based on the characteristics of the sub-tasks, determine whether the identification algorithm used for the sub-tasks is object detection or multi-label classification.
[0047] S2. The model is trained for multi-task learning according to the set sub-tasks. The neural network model that combines object detection and multi-label classification includes an object detection module, a multi-label classification module and a shared backbone network.
[0048] S3. Input the video image into the backbone network for extracting image features to extract image features of various dimensions of the video image;
[0049] S4. Input the image features into the target detection module and the multi-label classification module to determine whether the video image violates the rules and what kind of violation it is.
[0050] As can be seen from the above description, the beneficial effects of the present invention are as follows: a video image content review method and terminal decomposes video content review into multiple sub-tasks, and selects to use a multi-label classification network or an object detection network according to the characteristics of the sub-tasks. It is fast, requires less computing resources, has high computational accuracy, and the network model has better generalization.
[0051] Furthermore, the backbone network of the feature extraction module is specifically a TRexNet network, the object detection module is specifically a Transformer-based object detection head, and the multi-label classification module is specifically a ML-Decoder-based multi-label classification head.
[0052] As described above, the specific composition of each network is given. The TRexNet network can be computed in real time on chips with low computing power. The Transformer-based target detection head can achieve end-to-end target detection. The ML-Decoder-based multi-label classification head algorithm has a complexity of only O(n), which is better than the O(n^2) complexity of similar algorithms. The algorithms of the three main modules can be replaced according to the computing power and effect requirements of the application scenario. The selection in this example takes into account both computing speed and recognition effect, achieving a good balance.
[0053] Furthermore, based on the sub-task type, features are categorized into object detection task features and multi-label classification task features, specifically:
[0054] The identification of targets that are clearly defined objects and very small is divided into the target detection subtask, and the identification of targets that are abstract behaviors or atmospheres is divided into the multi-label classification subtask.
[0055] As can be seen from the above description, the criteria for dividing the object detection subtask and the multi-label classification subtask have been given.
[0056] Furthermore, step S1 further divides the features into primary features and secondary features according to their importance, with the secondary features providing a basis for the identification of the primary features.
[0057] As described above, multi-task joint training allows the model to focus on important features, as other tasks can provide evidence for less important features. Furthermore, features that are difficult to learn on a particular task can be inspired by other related tasks, resulting in more generalized features learned by the model, leading to better final performance and robustness.
[0058] Furthermore, during training, the following multi-task learning steps are performed:
[0059] A1. Initialize the parameters of the backbone network using the parameters of the model trained on big data;
[0060] A2. Fix the parameters of the backbone network and train the multi-label classification module;
[0061] A3. Fix the parameters of the backbone network and the multi-label classification module, and train the target detection module;
[0062] A4. Reduce the learning rates of the backbone network, multi-label classification module, and object detection module. Based on the numerical magnitude of the loss functions of the object detection network model and the multi-label classification module, set weights for the loss functions of the multi-label classification model and the object detection module to make the learning speeds of the two sub-networks more consistent. At the same time, train the entire fusion neural network composed of the backbone network, multi-label classification module, and object detection module.
[0063] As can be seen from the above description, the entire training process of the network is presented.
[0064] This invention is used to review images and videos to determine whether they are illegal videos and, if so, what kind of illegal videos they are.
[0065] Please refer to Figure 1-2 Embodiment 1 of the present invention is as follows:
[0066] In this embodiment, a fusion neural network based on multi-task learning is designed. This model integrates two deep learning algorithms: multi-label classification and object detection, supporting multiple tasks in video content review simultaneously within a single model. The model consists of three main parts: a CNN backbone network for image feature extraction, a Transformer network for object detection, and an ML-Decoder network for multi-label classification. The object detection network and the multi-label classification network share the same backbone network.
[0067] A method for video image content review includes the following steps:
[0068] S1. Divide video image content review into multiple sub-tasks for identifying specific violation types. Based on the characteristics of the sub-tasks, determine whether the identification algorithm used for the sub-tasks is object detection or multi-label classification.
[0069] S2. The model is trained for multi-task learning according to the set sub-tasks. The neural network model that combines object detection and multi-label classification includes an object detection module, a multi-label classification module and a shared backbone network.
[0070] S3. Input the video image into the backbone network for extracting image features to extract image features of various dimensions of the video image;
[0071] S4. Input the image features into the target detection module and the multi-label classification module to determine whether the video image violates the rules and what kind of violation it is.
[0072] Specifically, video content review is broken down into multiple sub-tasks, and multi-label classification algorithms or object detection algorithms are selected based on the characteristics of each sub-task. Video content review generally includes various types of violation content identification and sensitive text identification. In this embodiment, it is further subdivided into more specific sub-tasks: different degrees of violation, different body parts, various flags, various symbols, faces, and text, etc. These sub-tasks are then fused into a single neural network for identification using a model fusion and multi-task learning strategy. For specific and clear targets, object detection algorithms are used, such as flags. For abstract targets, multi-label classification algorithms are used, such as region image features of video frames (e.g., violation scene areas) and sentiment classification features of video frames. Training data is collected and labeled; the same image may contain multiple category labels and multiple categories of bounding boxes.
[0073] The video content review process was broken down into several sub-category recognition tasks. Object detection algorithms excel at detecting small objects, while multi-label classification algorithms are better at detecting image subjects and abstract targets. Based on this, the optimal algorithm type was selected for each sub-category, fully leveraging the strengths of each algorithm, improving the accuracy of each category, and ultimately enhancing the overall recognition accuracy.
[0074] A multi-task model achieves the effect of multiple single-task models. Only one model is run, avoiding repeated calculation of backbone network features, greatly reducing computing resource usage and significantly improving inference speed.
[0075] Multi-task joint training allows the model to focus on important features. Content moderation is a high-dimensional task with limited and difficult-to-collect data. A single model struggles to distinguish correlations between features in such cases. However, in multi-task learning, features learned from multiple tasks coexist, resulting in greater feature diversity. The interaction between features across tasks can provide evidence for less important features. Furthermore, features difficult to learn on one task can be inspired by features learned from other related tasks, similar to how humans can leverage knowledge learned from related tasks for generalization. For example, classification algorithms can be used for violation areas, while detection algorithms can be used for special clothing. Since both violation areas and special clothing appear near the human body, the model will focus on these areas. Thus, these two tasks can provide each other with additional useful information, leading to more generalized features and preventing overfitting to a single task, resulting in better model performance and robustness.
[0076] When training the above-mentioned fusion neural network based on multi-task learning, the following steps are performed:
[0077] A1. Initialize the parameters of the backbone network using the parameters of the model trained on big data;
[0078] Specifically, the parameters of the backbone network of the classification model trained on OpenImage are used as the parameters for initializing the backbone network.
[0079] A2. Fix the parameters of the backbone network and train the multi-label classification module;
[0080] A3. Fix the parameters of the backbone network and the multi-label classification module, and train the target detection module;
[0081] A4. Reduce the learning rates of the backbone network, multi-label classification module, and object detection module. Based on the numerical magnitude of the loss functions of the object detection network model and the multi-label classification module, set weights for the loss functions of the multi-label classification model and the object detection module to make the learning speeds of the two sub-networks more consistent. At the same time, train the entire fusion neural network composed of the backbone network, multi-label classification module, and object detection module.
[0082] Please refer to Figure 3 Embodiment two of the present invention is as follows:
[0083] A video image content review terminal 1 includes a memory 3, a processor 2, and a computer program stored on the memory 3 and executable on the processor 2. When the processor 2 executes the computer program, it implements the steps of the above embodiment 1.
[0084] In summary, the video image content review method and terminal provided by this invention decomposes video content review into multiple sub-tasks. Based on the characteristics of the sub-tasks, a multi-label classification network or an object detection network is selected, which is fast, requires less computational resources, has high computational accuracy, and has better network model generalization.
[0085] The above description is merely an embodiment of the present invention and does not limit the patent scope of the present invention. Any equivalent modifications made based on the content of the present invention specification and drawings, or direct or indirect applications in related technical fields, are similarly included within the patent protection scope of the present invention.
Claims
1. A method of video image content review, the method comprising: Including the following steps: S1. The video image content review is divided into multiple sub-tasks for identifying specific violation types. Based on the characteristics of the sub-tasks, the identification algorithm used for the sub-tasks is set to be either object detection or multi-label classification. Specifically, the step of setting the identification algorithm used for the sub-tasks to be either object detection or multi-label classification based on the characteristics of the sub-tasks includes: classifying the identification targets that are clearly defined objects and are very small into object detection sub-tasks, and classifying the identification targets that are abstract behaviors or atmospheres into multi-label classification sub-tasks. S2. Construct a neural network model that combines object detection and multi-label classification. The neural network model includes an object detection module, a multi-label classification module, and a shared backbone network. Train the neural network model for multi-task learning according to defined sub-tasks. The multi-task learning training specifically includes the following steps: A1. Initialize the parameters of the backbone network using the parameters of the model trained on big data; A2. Fix the parameters of the backbone network and train the multi-label classification module; A3. Fix the parameters of the backbone network and the multi-label classification module, and train the target detection module; A4. Reduce the learning rate of the backbone network, multi-label classification module and object detection module. Based on the magnitude of the loss function of the object detection module and the multi-label classification module, set weights for the loss function of the multi-label classification module and the object detection module to make the learning speed of the two sub-networks more consistent. At the same time, train the entire fusion neural network composed of the backbone network, multi-label classification module and object detection module. S3. Input the video image into the backbone network for extracting image features to extract image features of various dimensions of the video image; S4. Input the image features into the target detection module and the multi-label classification module to determine whether the video image violates the rules and what kind of violation it is.
2. The method of claim 1, wherein, The backbone network is specifically a TRexNet network, the object detection module is specifically based on the Transformer object detection head, and the multi-label classification module is specifically based on the ML-Decoder multi-label classification head.
3. The method of claim 1, wherein, Step S1 further divides the features into primary features and secondary features according to their importance, with the secondary features providing a basis for the identification of the primary features.
4. A video image content review terminal comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized in that, When the processor executes the computer program, it performs the following steps: S1. The video image content review is divided into multiple sub-tasks for identifying specific violation types. Based on the characteristics of the sub-tasks, the identification algorithm used for the sub-tasks is set to be either object detection or multi-label classification. Specifically, the step of setting the identification algorithm used for the sub-tasks to be either object detection or multi-label classification based on the characteristics of the sub-tasks includes: classifying the identification targets that are clearly defined objects and are very small into object detection sub-tasks, and classifying the identification targets that are abstract behaviors or atmospheres into multi-label classification sub-tasks. S2. Construct a neural network model that combines object detection and multi-label classification. The neural network model includes an object detection module, a multi-label classification module, and a shared backbone network. Train the neural network model for multi-task learning according to defined sub-tasks. The multi-task learning training specifically includes the following steps: A1. Initialize the parameters of the backbone network using the parameters of the model trained on big data; A2. Fix the parameters of the backbone network and train the multi-label classification module; A3. Fix the parameters of the backbone network and the multi-label classification module, and train the target detection module; A4. Reduce the learning rate of the backbone network, multi-label classification module and object detection module. Based on the magnitude of the loss function of the object detection module and the multi-label classification module, set weights for the loss function of the multi-label classification module and the object detection module to make the learning speed of the two sub-networks more consistent. At the same time, train the entire fusion neural network composed of the backbone network, multi-label classification module and object detection module. S3. Input the video image into the backbone network for extracting image features to extract image features of various dimensions of the video image; S4. Input the image features into the target detection module and the multi-label classification module to determine whether the video image violates the rules and what kind of violation it is.
5. The video image content review terminal of claim 4, wherein The backbone network is specifically a TRexNet network, the object detection module is specifically based on the Transformer object detection head, and the multi-label classification module is specifically based on the ML-Decoder multi-label classification head.
6. The video image content review terminal according to claim 4, wherein Step S1 further divides the features into primary features and secondary features according to their importance, with the secondary features providing a basis for the identification of the primary features.