Abnormal video recognition model training method, abnormal video recognition method and device
By training an abnormal video recognition model and combining multimodal feature representation and multi-task learning, the problem of time-consuming and labor-intensive manual review was solved, and efficient risk video recognition was achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING DAJIA INTERNET INFORMATION TECH CO LTD
- Filing Date
- 2022-10-14
- Publication Date
- 2026-06-12
Smart Images

Figure CN115909127B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of Internet technology, and in particular to a training method for an abnormal video recognition model, an abnormal video recognition method, an apparatus, an electronic device, a storage medium, and a program product. Background Technology
[0002] With the rapid development of the internet, an increasing number of video resources have emerged on online platforms, such as short video platforms and various self-media platforms. All platforms need to review this video data to ensure that risky videos are not leaked, reduce the impact on public opinion, and ensure the healthy development of the platform's ecosystem.
[0003] Currently, most methods for identifying risky videos rely on manual review. However, the number of videos uploaded to various platforms every day is enormous, and relying solely on manual review would consume a significant amount of time and manpower. Summary of the Invention
[0004] This disclosure provides a training method for an abnormal video recognition model, an abnormal video recognition method, an apparatus, an electronic device, a storage medium, and a program product, to at least solve the problem that the review of risky videos in related technologies requires a large amount of time and manpower. The technical solution of this disclosure is as follows:
[0005] According to a first aspect of the present disclosure, a method for training an abnormal video recognition model is provided, comprising:
[0006] The sample video is obtained by acquiring each video frame, video text information, local images, risk tags, content tags, and supervision text information, wherein the risk tags are obtained based on the supervision text information, and the content tags are obtained based on the video text information;
[0007] The abnormal video recognition model to be trained encodes each video frame, the video text information, and the local image of the sample video to obtain the video features of the sample video; the video features are then recognized to obtain the predicted risk information and predicted content information of the sample video.
[0008] Based on the first difference information between the predicted risk information and the risk label, the second difference information between the predicted content information and the content label, and the third difference information between the video features and the supervisory text features of the supervisory text information, the abnormal video recognition model to be trained is trained to obtain the trained abnormal video recognition model.
[0009] In one exemplary embodiment, the abnormal video recognition model to be trained includes a spatiotemporal visual encoder, a region-level visual encoder, a text encoder, and a multimodal encoder;
[0010] The process involves encoding each video frame, the video text information, and the local image of the sample video using an abnormal video recognition model to be trained, to obtain the video features of the sample video, including:
[0011] The spatiotemporal visual encoder is used to encode each video frame of the sample video to obtain the initial video features of the sample video.
[0012] The local image is encoded using the region-level visual encoder to obtain the local image features of the sample video.
[0013] The text encoder is used to encode the video text information to obtain the text features of the sample video.
[0014] The multimodal encoder fuses the initial video features, the local image features, and the text features to obtain the video features of the sample video.
[0015] In one exemplary embodiment, the method further includes:
[0016] The local detection model is used to detect each video frame of the sample video to obtain the detection box corresponding to the local object in each video frame.
[0017] Based on the detection box, each video frame is cropped to obtain a partial image of the sample video.
[0018] In one exemplary embodiment, the method further includes a training process for the local detection model, the training process of the local detection model including:
[0019] Acquire a sample image; the sample image is labeled with category labels and detection box labels, the category labels including object labels of multiple local objects;
[0020] The sample image is processed by an initial local detection model to obtain the image features of the sample image; and the image features are then classified to obtain the predicted category and predicted detection box of the sample image.
[0021] Based on the difference information between the predicted category and the category label, and the difference information between the predicted detection box and the detection box label, the initial local detection model is trained to obtain the trained local detection model.
[0022] In an exemplary embodiment, training the anomaly video recognition model to be trained based on a first difference information between the predicted risk information and the risk label, a second difference information between the predicted content information and the content label, and a third difference information between the video features and the supervised text features of the supervised text information, to obtain a trained anomaly video recognition model, includes:
[0023] A first loss value is obtained based on the first difference information, a second loss value is obtained based on the second difference information, and a third loss value is obtained based on the third difference information;
[0024] The total loss is obtained based on the first loss value, the second loss value, and the third loss value;
[0025] Based on the total loss, the abnormal video recognition model to be trained is trained to obtain the trained abnormal video recognition model.
[0026] In one exemplary embodiment, the sample video includes multiple videos, and the third difference information includes positive sample difference information and negative sample difference information; the method further includes:
[0027] The method acquires positive sample difference information between the video features of the target sample video and the supervised text features of the supervised text information of the target sample video, and negative sample difference information between the video features of the target sample video and the supervised text features of the supervised text information of other sample videos; the target sample video is any one of the plurality of videos, and the other sample videos are videos other than the target sample video;
[0028] With the objectives of reducing the positive sample difference information, increasing the negative sample difference information, and reducing the total loss, the abnormal video recognition model to be trained is trained to obtain the trained abnormal video recognition model.
[0029] In one exemplary embodiment, the method further includes a method for determining content tags and risk tags in the sample data, the method for determining content tags and risk tags in the sample data including:
[0030] Collect sample videos and their supervised text information from the video platform;
[0031] The supervisory text information of the sample video is identified and processed to obtain the risk label of the sample video;
[0032] The video text information of the sample video is obtained, and the video text information of the sample video is recognized and processed to obtain the content tag of the sample video.
[0033] According to a second aspect of the present disclosure, an abnormal video recognition method is provided, comprising:
[0034] Acquire individual video frames, video text information, and local images of the video to be identified;
[0035] An abnormal video recognition model encodes each video frame, the video text information, and the local image to obtain video features of the video to be identified. These video features are then analyzed to obtain predicted risk information and predicted content information for the video to be identified. The abnormal video recognition model is trained using each video frame, video text information, and local image of a sample video as input, and the risk label, content label, and supervisory text information of the sample video as supervisory information. The risk label is obtained based on the supervisory text information, and the content label is obtained based on the video text information.
[0036] Based on the predicted risk information and predicted content information, the anomaly identification result for the video to be identified is determined.
[0037] In an exemplary embodiment, determining the anomaly identification result for the video to be identified based on the predicted risk information and the predicted content information includes:
[0038] The predicted content information is matched with preset risk content information to obtain the matching result;
[0039] Based on the matching results and the predicted risk information, an anomaly identification result is determined for the video to be identified.
[0040] According to a third aspect of the present disclosure, a training apparatus for an abnormal video recognition model is provided, comprising:
[0041] The acquisition unit is configured to acquire each video frame, video text information, local images, risk tags, content tags, and supervision text information of the sample video, wherein the risk tags are obtained based on the supervision text information, and the content tags are obtained based on the video text information;
[0042] The prediction unit is configured to perform encoding processing on each video frame, the video text information, and the local image of the sample video using an abnormal video recognition model to be trained, to obtain the video features of the sample video; and to perform recognition processing on the video features to obtain the predicted risk information and predicted content information of the sample video.
[0043] The training unit is configured to train the abnormal video recognition model to be trained based on a first difference information between the predicted risk information and the risk label, a second difference information between the predicted content information and the content label, and a third difference information between the video features and the supervised text features of the supervised text information, so as to obtain a trained abnormal video recognition model.
[0044] In one exemplary embodiment, the abnormal video recognition model to be trained includes a spatiotemporal visual encoder, a region-level visual encoder, a text encoder, and a multimodal encoder;
[0045] The prediction unit is further configured to perform the following operations: encoding each video frame of the sample video using the spatiotemporal visual encoder to obtain initial video features of the sample video; encoding the local image using the region-level visual encoder to obtain local image features of the sample video; encoding the video text information using the text encoder to obtain text features of the sample video; and fusing the initial video features, the local image features, and the text features using the multimodal encoder to obtain video features of the sample video.
[0046] In one exemplary embodiment, the apparatus further includes a local image determination unit configured to perform detection processing on each video frame of the sample video using a local detection model to obtain detection boxes corresponding to local objects in each video frame; and to perform cropping processing on each video frame based on the detection boxes to obtain a local image of the sample video.
[0047] In one exemplary embodiment, the apparatus further includes a detection model training unit configured to acquire a sample image; the sample image is labeled with category labels and detection box labels, the category labels including object labels of multiple local objects; the sample image is processed by an initial local detection model to obtain image features of the sample image; the image features are classified to obtain predicted categories and predicted detection boxes of the sample image; the initial local detection model is trained based on the difference information between the predicted categories and the category labels, and the difference information between the predicted detection boxes and the detection box labels, to obtain a trained local detection model.
[0048] In an exemplary embodiment, the training unit is further configured to perform the following operations: obtaining a first loss value based on the first difference information, obtaining a second loss value based on the second difference information, and obtaining a third loss value based on the third difference information; obtaining a total loss based on the first loss value, the second loss value, and the third loss value; and training the abnormal video recognition model to be trained based on the total loss to obtain a trained abnormal video recognition model.
[0049] In an exemplary embodiment, the sample video includes multiple videos, and the third difference information includes positive sample difference information and negative sample difference information; the training unit is further configured to acquire positive sample difference information between the video features of the target sample video and the supervised text features of the supervised text information of the target sample video, and to acquire negative sample difference information between the video features of the target sample video and the supervised text features of the supervised text information of other sample videos; the target sample video is any one of the multiple videos, and the other sample videos are videos other than the target sample video; the abnormal video recognition model to be trained is trained with the objectives of reducing the positive sample difference information, increasing the negative sample difference information, and reducing the total loss, to obtain a trained abnormal video recognition model.
[0050] In an exemplary embodiment, the acquisition unit is further configured to perform the following actions: acquiring sample videos and supervisory text information of the sample videos from a video platform; performing identification processing on the supervisory text information of the sample videos to obtain risk tags for the sample videos; acquiring video text information of the sample videos; and performing identification processing on the video text information of the sample videos to obtain content tags for the sample videos.
[0051] According to a fourth aspect of the present disclosure, an abnormal video recognition device is provided, comprising:
[0052] The acquisition unit is configured to acquire individual video frames, video text information, and local images of the video to be identified;
[0053] The prediction unit is configured to perform encoding processing on each video frame, the video text information, and the local image using an abnormal video recognition model to obtain video features of the video to be identified; analyze the video features to obtain predicted risk information and predicted content information of the video to be identified; the abnormal video recognition model is trained by taking each video frame, video text information, and local image of a sample video as input, and using the risk label, content label, and supervision text information of the sample video as supervision information; the risk label is obtained based on the supervision text information, and the content label is obtained based on the video text information;
[0054] The identification unit is configured to perform anomaly identification results for the video to be identified based on the predicted risk information and predicted content information.
[0055] In an exemplary embodiment, the identification unit is further configured to perform matching of the predicted content information with preset risk content information to obtain a matching result; and to determine an anomaly identification result for the video to be identified based on the matching result and the predicted risk information.
[0056] According to a fifth aspect of the present disclosure, an electronic device is provided, comprising:
[0057] processor;
[0058] Memory used to store the processor's executable instructions;
[0059] The processor is configured to execute the instructions to implement the method as described in any of the preceding methods.
[0060] According to a sixth aspect of the present disclosure, a computer-readable storage medium is provided such that, when instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device is enabled to perform the method described in any of the preceding claims.
[0061] According to a seventh aspect of the present disclosure, a computer program product is provided, the computer program product including instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the method as described in any of the preceding claims.
[0062] The technical solutions provided by the embodiments of this disclosure bring at least the following beneficial effects:
[0063] By training an anomaly video recognition model and then using it to identify subsequent videos, recognition efficiency can be improved and manpower consumption reduced. The model uses individual video frames, video text information, and local images as representations of the sample video. This multi-scale, multi-modal feature representation improves the accuracy of the sample video representation, thereby enhancing the accuracy of the anomaly video recognition model trained based on multi-modal features. The model is trained by predicting the first difference between risk information and risk labels, the second difference between content information and content labels, and the third difference between video features and the supervisory text features. This multi-task learning approach, where multiple tasks complement each other, further enhances the anomaly video recognition model's overall ability to understand the abnormal content of anomaly videos.
[0064] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description
[0065] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure, and are not intended to unduly limit this disclosure.
[0066] Figure 1 This is a flowchart illustrating a training method for an abnormal video recognition model according to an exemplary embodiment.
[0067] Figure 2 This is a schematic diagram of the model structure of an abnormal video recognition model according to an exemplary embodiment.
[0068] Figure 3 This is a flowchart illustrating the training process of a local detection model according to an exemplary embodiment.
[0069] Figure 4 This is a schematic diagram of the model structure of a local detection model according to an exemplary embodiment.
[0070] Figure 5 This is a schematic diagram illustrating the relationship between risk labels and content labels according to an exemplary embodiment.
[0071] Figure 6 This is a flowchart illustrating an abnormal video recognition method according to an exemplary embodiment.
[0072] Figure 7 This is a schematic diagram illustrating video recognition using an anomaly detection model according to an exemplary embodiment.
[0073] Figure 8 This is a structural block diagram illustrating a training apparatus for an abnormal video recognition model according to an exemplary embodiment.
[0074] Figure 9 This is a structural block diagram of an abnormal video recognition device according to an exemplary embodiment.
[0075] Figure 10 This is a block diagram illustrating an electronic device according to an exemplary embodiment. Detailed Implementation
[0076] To enable those skilled in the art to better understand the technical solutions of this disclosure, the technical solutions in the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings.
[0077] It should be noted that the embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this disclosure. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this disclosure as detailed in the appended claims.
[0078] It should be noted that the terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this disclosure described herein can be implemented in orders other than those illustrated or described herein. It should also be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for display, data used for analysis, etc.) involved in this disclosure are all information and data authorized by the user or fully authorized by all parties.
[0079] In one exemplary embodiment, such as Figure 1 As shown, a training method for an abnormal video recognition model is provided. This embodiment illustrates the application of this method to a terminal. It is understood that this method can also be applied to a server, and to a system including both a terminal and a server, and is implemented through interaction between the terminal and the server. The terminal can be, but is not limited to, various personal computers, laptops, smartphones, tablets, IoT devices, and portable wearable devices. IoT devices can include smart speakers, smart TVs, smart air conditioners, smart in-vehicle devices, etc. Portable wearable devices can include smartwatches, smart bracelets, head-mounted devices, etc. The server can be implemented using a standalone server or a server cluster consisting of multiple servers. In this embodiment, the method includes the following steps:
[0080] In step S110, each video frame, video text information, local image, risk label, content label, and supervision text information of the sample video are obtained. The risk label is obtained based on the supervision text information, and the content label is obtained based on the video text information.
[0081] The video text information may include the title, topic, and text description of the sample video, as well as the text information obtained by recognizing the sample video through OCR (Optical Character Recognition) technology, which is the content expression of the video.
[0082] A local image refers to an image of a local area in a video frame. A local image can be extracted from a video frame, or multiple local images may be extracted, or no local image may be extracted.
[0083] Among them, the risk label indicates that the video content is risky. It can be a binary label, such as risky and no risk. The risk label can also be expressed by the degree of risk, for example, the degree of risk is 80.
[0084] Content tags refer to tags representing the content of the video itself. For example, content tags could be pets, cooking tutorials, etc.
[0085] The supervisory text information describes the problems present in the pre-collected sample videos.
[0086] In practice, historical videos and their supervisory text information can be collected from a video platform to obtain the video text information of the historical videos. Risk tags and content tags for the historical videos can be obtained by mining the supervisory text information. Local images are obtained by performing local region detection on each video frame of the historical videos. The historical videos, their supervisory text information, each video frame, the video text information, the local images, the risk tags, and the content tags are then combined to form sample data.
[0087] In step S120, the abnormal video recognition model to be trained encodes each video frame, video text information, and local image of the sample video to obtain the video features of the sample video; the video features are then recognized to obtain the predicted risk information and predicted content information of the sample video.
[0088] In practice, a single-stream mode can be used to encode each video frame, video text information, and local images of the sample video to obtain the video features of the sample video. Alternatively, a multi-stream mode can be used to encode each video frame, video text information, and local images of the sample video to obtain the video features of the sample video.
[0089] More specifically, in single-stream mode, the individual video frames, video text information, and local images of the sample video can be fused first, and the fusion result can be input into an encoder to obtain the video features of the sample video.
[0090] In multi-stream mode, three encoders can encode each video frame, video text information, and local image of the sample video respectively. Then, a fusion module can fuse the outputs of each encoder to obtain the video features of the sample video.
[0091] Therefore, by using the single-stream or multi-stream mode described above, video features that integrate multi-scale / multi-modal information of the sample video can be obtained, which can improve the ability to capture local risks. Furthermore, by performing recognition processing on the video features, the predicted risk information and predicted content information of the sample video can be obtained.
[0092] In step S130, the abnormal video recognition model to be trained is trained based on the first difference information between the predicted risk information and the risk label, the second difference information between the predicted content information and the content label, and the third difference information between the video features and the supervision text features of the supervision text information, so as to obtain the trained abnormal video recognition model.
[0093] In practice, after the abnormal video recognition model outputs the predicted risk information and predicted content information of the sample video, the output results can be compared with the supervision information to obtain the difference information between the prediction results and the supervision information. Based on the difference information, the model parameters of the abnormal video recognition model to be trained can be adjusted.
[0094] More specifically, the predicted risk information in the output results can be compared with the risk labels to obtain the first difference information, the predicted content information in the output results can be compared with the content labels to obtain the second difference information, and the video features of the sample video can be compared with the supervised text features of the supervised text information to obtain the third difference information. Based on the first difference information, the second difference information, and the third difference information, the total loss is obtained. The abnormal video recognition model to be trained is trained based on the total loss until the preset number of iterations or loss accuracy is reached, and the trained abnormal video recognition model is obtained.
[0095] In the training method of the aforementioned abnormal video recognition model, training the abnormal video recognition model for subsequent video recognition can improve recognition efficiency and reduce manpower consumption. By using each video frame, video text information, and local images of the sample video as a representation, this multi-scale, multi-modal feature representation of the sample video improves the accuracy of the sample video representation, thereby enhancing the accuracy of the abnormal video recognition model trained based on multi-modal features. The abnormal video recognition model is trained by predicting the first difference between risk information and risk labels, the second difference between predicted content information and content labels, and the third difference between video features and the supervisory text features. This multi-task learning approach, where multiple tasks complement each other, further enhances the overall ability of the abnormal video recognition model to understand the abnormal content of abnormal videos.
[0096] In one exemplary embodiment, the abnormal video recognition model to be trained includes a spatiotemporal visual encoder, a region-level visual encoder, a text encoder, and a multimodal encoder;
[0097] The above step S120 can be implemented through the following steps:
[0098] Step S1201: Encode each video frame of the sample video using a spatiotemporal visual encoder to obtain the initial video features of the sample video.
[0099] Step S1202: The local image is encoded using a region-level visual encoder to obtain the local image features of the sample video.
[0100] Step S1203: The text information of the video is encoded using a text encoder to obtain the text features of the sample video;
[0101] Step S1204: The initial video features, local image features, and text features are fused using a multimodal encoder to obtain the video features of the sample video.
[0102] Among them, the vision branch (Vision Encoder) in the Space-Time Vision Encoder and Region-Level Vision Encoder can be the base model of ResNet (Deepresidual network) or EfficientNet series.
[0103] The text encoder can be a model from the Bert family.
[0104] Among them, the multi-modal encoder can be a TransFormer (a specific model with self-attention as its main component) or an MFH (Multi-Faceted Hierarchical, a multi-view hierarchical multi-task learning model) or other models.
[0105] refer to Figure 2 This is a schematic diagram illustrating the model structure of an abnormal video recognition model as an exemplary embodiment, corresponding to the method of determining video features of sample videos in multi-stream mode, such as... Figure 2 As shown, in multi-stream mode, each video frame of the sample video is input to a spatiotemporal visual encoder to obtain initial video features. Local images of the sample video are input to a region-level visual encoder to obtain local image features. The video text information of the sample video is input to a text encoder to obtain text features. Further, the initial video features, local image features, and text features of the sample video are input to a multimodal encoder, which performs fusion processing to obtain the video features of the sample video. A multimodal encoder can then be connected to a multi-classification module for multi-label prediction, i.e., prediction of risk information and content information.
[0106] In this embodiment, multiple encoders process the information of each modality of the sample video separately, and then a multimodal encoder is used to fuse the features of each modality, making the video features of the sample video more accurate. This multimodal framework using multi-scale features can also improve the ability of the anomaly detection model to capture local risks.
[0107] In an exemplary embodiment, a partial image of the sample video is obtained in the following manner:
[0108] The local detection model is used to detect each video frame of the sample video and obtain the detection box corresponding to the local object in each video frame.
[0109] Based on the detection bounding box, each video frame is cropped to obtain a local image of the sample video.
[0110] The detection box corresponds to the area covered by a local object in a video frame.
[0111] Specifically, refer to Figure 2 The local detection model can be used to detect and process each video frame of the sample video to obtain the detection boxes corresponding to the local objects in each video frame. According to the position and size of the detection boxes in each video frame, the corresponding video frame is cropped to obtain the local images in each video frame, which are then used to form the local images of the sample video.
[0112] In one exemplary embodiment, such as Figure 3 As shown, the training process of the local detection model includes:
[0113] Step S310: Obtain a sample image; the sample image is labeled with category labels and detection box labels, and the category labels include object labels for multiple local objects;
[0114] Step S320: The sample image is processed by the initial local detection model to obtain the image features of the sample image; and the image features are classified to obtain the predicted category and predicted detection box of the sample image.
[0115] Step S330: Based on the difference information between the predicted category and the category label, and the difference information between the predicted detection box and the detection box label, the initial local detection model is trained to obtain the trained local detection model.
[0116] In practice, sample images can be obtained from open-source datasets such as ImageNet, Opening, and COCO. These images are labeled with category labels and bounding box labels. Category labels include object labels for multiple local objects, such as legs, arms, and heads. After obtaining the sample images, they can be input into an initial local detection model to obtain image features. The classification layer in the initial local detection model then classifies these features to obtain the predicted category and predicted bounding box. Further, the prediction results can be compared with supervisory information: the predicted category is compared with the category label to obtain discrepancy information, and the predicted bounding box is compared with the bounding box label to obtain discrepancy information. Based on these two discrepancy information, a loss value is obtained. The model parameters of the initial local detection model are adjusted using this loss value, and the model is trained to obtain a fully trained local detection model.
[0117] The format of the open-source dataset can be: "**.jpg, leg, x1, y1, x2, y2", "**.jpg, arm, x1, y1, x2, y2", "**.jpg, head, x1, y1, x2, y2", etc., where x1, y1, x2, y2 represent the coordinates of the detection box corresponding to the local object in the image.
[0118] refer to Figure 4 This is a schematic diagram illustrating the model structure of a local object detection model in an exemplary embodiment. The local object detection model can consist of multiple layers of CNN (Convolutional Neural Networks) and some post-processing parsing layers, employing Yolov5 (an object detection algorithm) technology for local object detection. The input data includes sample images, category labels, and bounding box labels. For a batch of sample data, image features are obtained through multiple layers of convolutional neural networks. These image features are then parsed to obtain predicted categories and predicted bounding boxes. Based on the category labels and bounding box labels, the accuracy of the predicted categories and bounding boxes by the local detection model is determined. The corresponding discrepancy information is calculated and used as the loss of the local detection model, updating and guiding the training of the local detection model until the model's loss converges, completing the training process.
[0119] The above embodiments, by training a local detection model to detect local objects in sample images, can extract local images from the images, laying the foundation for multi-scale feature extraction for subsequent abnormal video recognition models, and improving the ability of abnormal video recognition models to identify risks.
[0120] In an exemplary embodiment, step S130 described above can be implemented through the following steps:
[0121] Step S1301: Obtain a first loss value based on the first difference information, obtain a second loss value based on the second difference information, and obtain a third loss value based on the third difference information;
[0122] Step S1302: Obtain the total loss based on the first loss value, the second loss value, and the third loss value;
[0123] Step S1303: Based on the total loss, train the abnormal video recognition model to be trained to obtain the trained abnormal video recognition model.
[0124] The loss value is used to measure the difference between the model's prediction and the actual region. The smaller the loss value, the better the model's prediction. The training process of the model is the process of optimizing the loss value.
[0125] In this step, the first loss value, the second loss value, and the third loss value can be regarded as the loss values of three tasks, such as... Figure 2 As shown, Task 1 compares the risk side, Task 2 compares the content side, and Task 3 compares the video features with the supervised text features of the supervised text information. The supervised text features are obtained by inputting the supervised text information into the text encoder. Through joint learning and training of these three tasks, which complement each other, the feature extraction capability of the abnormal video recognition model can be improved. It should be noted that... Figure 2 The text encoder that encodes the supervised text information and the text encoder that encodes the video text information have the same parameters to facilitate subsequent comparison.
[0126] In practice, the first difference information, the second difference information, and the third difference information can be used as the first loss value, the second loss value, and the third loss value, respectively. Furthermore, the total loss can be obtained from the first loss value, the second loss value, and the third loss value. The training of the abnormal video recognition model is then updated by the total loss until the total loss converges and becomes stable, thus completing the training.
[0127] More specifically, the total loss can be obtained by summing the first, second, and third loss values. Alternatively, the three loss values can be pre-weighted to determine their respective weights, and then the first, second, and third loss values can be weighted and summed to obtain the total loss.
[0128] In this embodiment, by using the loss values from multiple tasks to jointly train the abnormal video recognition model, the feature extraction capability of the abnormal video recognition model can be improved, thereby enhancing the model's ability to understand risky content.
[0129] In an exemplary embodiment, the sample video includes multiple videos, and the third difference information includes positive sample difference information and negative sample difference information; the method further includes: acquiring positive sample difference information between the video features of the target sample video and the supervised text features of the supervised text information of the target sample video, and acquiring negative sample difference information between the video features of the target sample video and the supervised text features of the supervised text information of other sample videos; the target sample video is any one of the multiple videos, and the other sample videos are videos other than the target sample video; the abnormal video recognition model to be trained is trained with the goal of reducing positive sample difference information, increasing negative sample difference information, and reducing the total loss, to obtain a trained abnormal video recognition model.
[0130] Specifically, Figure 2 The main idea behind the comparative learning between the fused video features and the supervised text features of the supervised text information corresponding to Task 3 is to narrow the distance between positive samples and widen the distance between positive and negative samples. In this embodiment, the video features and supervised text information of the same sample video are considered positive samples, while the video features of sample A and the supervised text information of sample B are considered negative samples. Therefore, the training method for Task 3 can be as follows: obtain the difference information between the video features of the target sample video and the supervised text features of the supervised text information of the target sample video as positive sample difference information; obtain the difference information between the video features of the target sample video and the supervised text features of the supervised text information of other sample videos as negative sample difference information. With the goal of reducing positive sample difference information, increasing negative sample difference information, and reducing the total loss, the abnormal video recognition model to be trained is trained to obtain the trained abnormal video recognition model.
[0131] In this embodiment, by comparing and learning between video features and supervised text information, the text modality and the video modality are aligned, which enhances the interaction and representation capabilities of the text modality and the video modality, makes the interaction between the modalities more complete, improves the risk identification effect, reduces the dependence on training data, and improves the generalization of the model.
[0132] In an exemplary embodiment, the content tags and risk tags in the sample data are determined in the following manner:
[0133] Step S1101: Collect sample videos and their supervisory text information from the video platform;
[0134] Step S1102: Recognize and process the supervisory text information of the sample video to obtain the risk label of the sample video;
[0135] Step S1102: Obtain the video text information of the sample video, perform recognition processing on the video text information of the sample video, and obtain the content tags of the sample video.
[0136] In practice, various types of supervisory text information uploaded by users in the past can be collected from video platforms, and the corresponding videos can be obtained as sample videos. Considering the high cost (large data volume and complex rules) of manually labeling the sample data, this embodiment adopts a method of mining supervisory text information and video content information to obtain risk tags and content tags.
[0137] More specifically, the supervisory text information is the user's expression of general risk content in the video, such as "minor misconduct 1" or "minor misconduct 2". The supervisory text information can be processed with keyword extraction and noise reduction to uncover fine-grained risk tags. Keyword extraction can employ the TF-IDF method (Term Frequency–Inverse Document Frequency, a commonly used weighting technique for information retrieval and data mining).
[0138] The video text information, such as topics, titles, and descriptions, constitutes the content expression of the video. By relevance cleaning (word segmentation, part-of-speech tagging, noise reduction, etc.) and posterior probability verification, the content side tags of the video are mined.
[0139] like Figure 5 As shown, this is a schematic diagram of the relationship between risk tags and content tags. Since the content tags of a video are the expression of the video itself, determining risk tags based on video content tags can be used as prior probabilities, while verifying content tags based on video risk tags can be used as posterior probabilities. The filtered content tags are then cleaned using the posterior probabilities.
[0140] In this embodiment, the sample video is determined by the supervisory text information of "collective intelligence", which basically eliminates the need for manual annotation, reduces the workload of manual labor, saves human and material resources, increases the scale of training data, and can further improve the performance of the abnormal video recognition model.
[0141] The training method for the abnormal video recognition model in the above embodiments provides a multi-task joint learning and multi-scale feature perception multi-modal framework. It uses data from "collective intelligence" such as user reports or review lines as sample data, and introduces comparative learning between supervised text information and video features. This can reduce the dependence on training data while improving the model's generalization ability. Through multi-scale features, it can improve the model's ability to capture and perceive risks, and significantly improve the recognition effect of risky videos.
[0142] In one exemplary embodiment, such as Figure 6 As shown, an abnormal video recognition method is provided. This embodiment illustrates the method's application to a terminal. It is understood that this method can also be applied to a server, or to a system including both a terminal and a server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the following steps:
[0143] Step S610: Obtain each video frame, video text information, and local images of the video to be identified;
[0144] Step S620: Encode each video frame, video text information, and local images using an abnormal video recognition model to obtain video features of the video to be identified; analyze and process the video features to obtain predicted risk information and predicted content information of the video to be identified; the abnormal video recognition model is trained by taking each video frame, video text information, and local images of the sample video as input, and using the risk label, content label, and supervision text information of the sample video as supervision information; the risk label is obtained based on the supervision text information, and the content label is obtained based on the video text information;
[0145] Step S630: Based on the predicted risk information and predicted content information, determine the anomaly identification result for the video to be identified.
[0146] refer to Figure 7 This diagram illustrates video recognition using an anomaly detection model. After acquiring the video to be recognized, a local detection model first processes each video frame to obtain local images of the video. Then, each video frame is input into a spatiotemporal visual encoder, the local images into a region-level visual encoder, and the video text information into a text encoder. The encoding results from each encoder are input into a multimodal encoder, which performs fusion processing to obtain the fused result, which serves as the video feature of the video to be recognized. Further, a multi-classification layer analyzes the video features to obtain predicted risk information and predicted content information. Finally, based on the predicted risk information and predicted content information, the anomaly recognition result for the video to be recognized is determined.
[0147] Furthermore, in an exemplary embodiment, step S630 specifically includes the following steps:
[0148] Step S6301: Match the predicted content information with the preset risk content information to obtain the matching result;
[0149] Step S6302: Based on the matching results and predicted risk information, determine the anomaly identification results for the video to be identified.
[0150] Specifically, risk content information can be preset. After obtaining the predicted content information of the video to be identified, it is matched with the preset risk content information to obtain the matching result. The matching result can be either hit risk content or miss risk content. Based on the matching result and the predicted risk information, the abnormal identification result for the video to be identified is determined.
[0151] More specifically, an anomaly can be determined in a video if the matching result matches risky content and the predicted risk information also indicates a risk. Alternatively, an anomaly can be determined in a video if the matching result matches risky content or the predicted risk information indicates a risk. The specific strategy can be set according to requirements, and this application does not impose any restrictions on it.
[0152] The abnormal video recognition method provided in the above embodiments uses each video frame, video text information, and local images of the video to be recognized as a representation of the video to be recognized. By representing the video to be recognized in this multi-scale and multi-modal way, the accuracy of the representation of the video to be recognized can be improved, thereby improving the accuracy of the predicted risk information and predicted content information. Finally, based on the predicted risk information and predicted content information, the abnormal recognition result for the video to be recognized is determined from multiple dimensions, which can ensure the credibility of the abnormal recognition result.
[0153] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.
[0154] It is understood that the same / similar parts between the various embodiments of the methods described above in this specification can be referred to each other. Each embodiment focuses on the differences from other embodiments, and relevant parts can be referred to the description of other method embodiments.
[0155] Based on the same inventive concept, this disclosure also provides a training apparatus for an abnormal video recognition model for implementing the training method of the abnormal video recognition model mentioned above, and an abnormal video recognition apparatus for implementing the abnormal video recognition method mentioned above.
[0156] Figure 8This is a structural block diagram illustrating a training apparatus for an abnormal video recognition model according to an exemplary embodiment. (Refer to...) Figure 8 The device includes:
[0157] The acquisition unit 810 is configured to acquire each video frame, video text information, local image, risk label, content label and supervision text information of the sample video. The risk label is obtained based on the supervision text information and the content label is obtained based on the video text information.
[0158] The prediction unit 820 is configured to perform encoding processing on each video frame, video text information and local image of the sample video through the abnormal video recognition model to be trained, to obtain the video features of the sample video; and to perform recognition processing on the video features to obtain the predicted risk information and predicted content information of the sample video.
[0159] Training unit 830 is configured to train the abnormal video recognition model to be trained based on the first difference information between predicted risk information and risk label, the second difference information between predicted content information and content label, and the third difference information between video features and supervised text features of supervised text information, to obtain the trained abnormal video recognition model.
[0160] In one exemplary embodiment, the abnormal video recognition model to be trained includes a spatiotemporal visual encoder, a region-level visual encoder, a text encoder, and a multimodal encoder;
[0161] The prediction unit 820 is also configured to perform the following operations: encode each video frame of the sample video using a spatiotemporal visual encoder to obtain the initial video features of the sample video; encode local images using a region-level visual encoder to obtain local image features of the sample video; encode video text information using a text encoder to obtain text features of the sample video; and fuse the initial video features, local image features, and text features using a multimodal encoder to obtain the video features of the sample video.
[0162] In an exemplary embodiment, the apparatus further includes a local image determination unit, configured to perform detection processing on each video frame of the sample video using a local detection model to obtain detection boxes corresponding to local objects in each video frame; and to perform cropping processing on each video frame based on the detection boxes to obtain a local image of the sample video.
[0163] In one exemplary embodiment, the apparatus further includes a detection model training unit configured to perform the following operations: acquiring a sample image; the sample image is labeled with category labels and detection box labels, the category labels including object labels of multiple local objects; performing detection processing on the sample image using an initial local detection model to obtain image features of the sample image; performing classification processing on the image features to obtain predicted categories and predicted detection boxes of the sample image; and training the initial local detection model based on the difference information between the predicted categories and category labels, and the difference information between the predicted detection boxes and detection box labels, to obtain a trained local detection model.
[0164] In an exemplary embodiment, the training unit 830 is further configured to perform the following operations: obtaining a first loss value based on first difference information, obtaining a second loss value based on second difference information, and obtaining a third loss value based on third difference information; obtaining a total loss based on the first loss value, the second loss value, and the third loss value; and training the abnormal video recognition model to be trained based on the total loss to obtain a trained abnormal video recognition model.
[0165] In an exemplary embodiment, the sample video includes multiple videos, and the third difference information includes positive sample difference information and negative sample difference information; the training unit 830 is further configured to acquire positive sample difference information between the video features of the target sample video and the supervised text features of the supervised text information of the target sample video, and to acquire negative sample difference information between the video features of the target sample video and the supervised text features of the supervised text information of other sample videos; the target sample video is any one of the multiple videos, and the other sample videos are videos other than the target sample video; the abnormal video recognition model to be trained is trained with the goal of reducing positive sample difference information, increasing negative sample difference information, and reducing the total loss, to obtain a trained abnormal video recognition model.
[0166] In an exemplary embodiment, the acquisition unit 810 is further configured to perform the following actions: acquiring sample videos and supervisory text information of sample videos from a video platform; performing identification processing on the supervisory text information of sample videos to obtain risk labels for sample videos; acquiring video text information of sample videos; and performing identification processing on the video text information of sample videos to obtain content labels for sample videos.
[0167] Figure 9 This is a structural block diagram illustrating an abnormal video recognition device according to an exemplary embodiment. (Refer to...) Figure 9 The device includes:
[0168] The acquisition unit 910 is configured to acquire each video frame, video text information, and local images of the video to be identified;
[0169] The prediction unit 920 is configured to perform encoding processing on each video frame, video text information, and local images through an abnormal video recognition model to obtain video features of the video to be identified; analyze and process the video features to obtain predicted risk information and predicted content information of the video to be identified; the abnormal video recognition model is trained by taking each video frame, video text information, and local images of the sample video as input, and using the risk label, content label, and supervision text information of the sample video as supervision information; the risk label is obtained based on the supervision text information, and the content label is obtained based on the video text information;
[0170] The identification unit 930 is configured to perform anomaly identification results for the video to be identified based on predicted risk information and predicted content information.
[0171] In an exemplary embodiment, the identification unit 930 is further configured to perform matching of predicted content information with preset risk content information to obtain a matching result; and based on the matching result and the predicted risk information, to determine an anomaly identification result for the video to be identified.
[0172] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.
[0173] Figure 10 This is a block diagram illustrating an electronic device 1000 for implementing a training method for an abnormal video recognition model, according to an exemplary embodiment. For example, the electronic device 1000 may be a mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, medical device, fitness equipment, personal digital assistant, etc.
[0174] Reference Figure 10 The electronic device 1000 may include one or more of the following components: processing component 1002, memory 1004, power supply component 1006, multimedia component 1008, audio component 1010, input / output (I / O) interface 1012, sensor component 1014, and communication component 1016.
[0175] Processing component 1002 typically controls the overall operation of electronic device 1000, such as operations associated with display, telephone calls, data communication, camera operation, and recording operations. Processing component 1002 may include one or more processors 1020 to execute instructions to perform all or part of the steps of the methods described above. Furthermore, processing component 1002 may include one or more modules to facilitate interaction between processing component 1002 and other components. For example, processing component 1002 may include a multimedia module to facilitate interaction between multimedia component 1008 and processing component 1002.
[0176] Memory 1004 is configured to store various types of data to support the operation of electronic device 1000. Examples of such data include instructions for any application or method operating on electronic device 1000, contact data, phonebook data, messages, pictures, videos, etc. Memory 1004 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, optical disk, or graphene storage.
[0177] Power supply component 1006 provides power to various components of electronic device 1000. Power supply component 1006 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic device 1000.
[0178] Multimedia component 1008 includes a screen that provides an output interface between the electronic device 1000 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touchscreen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensors may sense not only the boundaries of the touch or swipe action but also the duration and pressure associated with the touch or swipe operation. In some embodiments, multimedia component 1008 includes a front-facing camera and / or a rear-facing camera. When the electronic device 1000 is in an operating mode, such as a shooting mode or a video mode, the front-facing camera and / or the rear-facing camera may receive external multimedia data. Each front-facing camera and rear-facing camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
[0179] Audio component 1010 is configured to output and / or input audio signals. For example, audio component 1010 includes a microphone (MIC) configured to receive external audio signals when electronic device 1000 is in an operating mode, such as call mode, recording mode, and voice recognition mode. The received audio signals may be further stored in memory 1004 or transmitted via communication component 1016. In some embodiments, audio component 1010 also includes a speaker for outputting audio signals.
[0180] I / O interface 1012 provides an interface between processing component 1002 and peripheral interface modules, such as keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to, home buttons, volume buttons, power buttons, and lock buttons.
[0181] Sensor assembly 1014 includes one or more sensors for providing state assessments of various aspects of electronic device 1000. For example, sensor assembly 1014 can detect the on / off state of electronic device 1000, the relative positioning of components such as the display and keypad of electronic device 1000, changes in position of electronic device 1000 or its components, the presence or absence of user contact with electronic device 1000, orientation or acceleration / deceleration of device 1000, and temperature changes of electronic device 1000. Sensor assembly 1014 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 1014 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, sensor assembly 1014 may also include an accelerometer, gyroscope, magnetometer, pressure sensor, or temperature sensor.
[0182] Communication component 1016 is configured to facilitate wired or wireless communication between electronic device 1000 and other devices. Electronic device 1000 can access wireless networks based on communication standards, such as WiFi, carrier networks (such as 2G, 3G, 4G, or 5G), or combinations thereof. In one exemplary embodiment, communication component 1016 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, communication component 1016 also includes a near-field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
[0183] In an exemplary embodiment, the electronic device 1000 may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform the methods described above.
[0184] In one exemplary embodiment, a computer-readable storage medium including instructions is also provided, such as a memory 1004 including instructions, which can be executed by a processor 1020 of an electronic device 1000 to perform the above-described method. For example, the computer-readable storage medium may be a ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage device, etc.
[0185] In one exemplary embodiment, a computer program product is also provided, which includes instructions that can be executed by a processor 1020 of an electronic device 1000 to perform the above-described method.
[0186] It should be noted that the above-mentioned apparatus, electronic equipment, computer-readable storage medium, computer program product, etc., may also include other implementation methods according to the description of the method embodiments. For specific implementation methods, please refer to the description of the relevant method embodiments, which will not be elaborated here.
[0187] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this disclosure are indicated by the claims.
[0188] It should be understood that this disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this disclosure is limited only by the appended claims.
Claims
1. A training method for an abnormal video recognition model, characterized in that, include: The process involves acquiring multimodal features of a sample video, as well as risk tags, content tags, and supervisory text information for the sample video. The multimodal features include individual video frames, video text information, and local images of the sample video. The risk tags are obtained by mining the supervisory text information and characterize the risk inherent in the video content. The content tags are obtained by mining the video text information and verifying it based on the risk tags. The sample video and the supervisory text information are collected from a video platform. The abnormal video recognition model to be trained encodes each video frame, the video text information, and the local image of the sample video to obtain the video features of the sample video; the video features are then recognized to obtain the predicted risk information and predicted content information of the sample video. Based on the first difference information between the predicted risk information and the risk label, the second difference information between the predicted content information and the content label, and the third difference information between the video features and the supervisory text features of the supervisory text information, the abnormal video recognition model to be trained is trained to obtain the trained abnormal video recognition model.
2. The method according to claim 1, characterized in that, The abnormal video recognition model to be trained includes a spatiotemporal visual encoder, a region-level visual encoder, a text encoder, and a multimodal encoder. The process involves encoding each video frame, the video text information, and the local image of the sample video using an abnormal video recognition model to be trained, to obtain the video features of the sample video, including: The spatiotemporal visual encoder is used to encode each video frame of the sample video to obtain the initial video features of the sample video. The local image is encoded using the region-level visual encoder to obtain the local image features of the sample video. The text encoder is used to encode the video text information to obtain the text features of the sample video. The multimodal encoder fuses the initial video features, the local image features, and the text features to obtain the video features of the sample video.
3. The method according to claim 1, characterized in that, The method further includes: The local detection model is used to detect each video frame of the sample video to obtain the detection box corresponding to the local object in each video frame. Based on the detection box, each video frame is cropped to obtain a partial image of the sample video.
4. The method according to claim 3, characterized in that, The method further includes a training process for the local detection model, which includes: Acquire a sample image; the sample image is labeled with category labels and detection box labels, the category labels including object labels of multiple local objects; The sample image is processed by an initial local detection model to obtain the image features of the sample image; and the image features are then classified to obtain the predicted category and predicted detection box of the sample image. Based on the difference information between the predicted category and the category label, and the difference information between the predicted detection box and the detection box label, the initial local detection model is trained to obtain the trained local detection model.
5. The method according to claim 1, characterized in that, The abnormal video recognition model to be trained is trained based on the first difference information between the predicted risk information and the risk label, the second difference information between the predicted content information and the content label, and the third difference information between the video features and the supervised text features of the supervised text information, to obtain a trained abnormal video recognition model, including: A first loss value is obtained based on the first difference information, a second loss value is obtained based on the second difference information, and a third loss value is obtained based on the third difference information; The total loss is obtained based on the first loss value, the second loss value, and the third loss value; Based on the total loss, the abnormal video recognition model to be trained is trained to obtain the trained abnormal video recognition model.
6. The method according to claim 5, characterized in that, The sample videos include multiple videos, and the third difference information includes positive sample difference information and negative sample difference information; the method further includes: The method acquires positive sample difference information between the video features of the target sample video and the supervised text features of the supervised text information of the target sample video, and negative sample difference information between the video features of the target sample video and the supervised text features of the supervised text information of other sample videos; the target sample video is any one of the plurality of videos, and the other sample videos are videos other than the target sample video; With the objectives of reducing the positive sample difference information, increasing the negative sample difference information, and reducing the total loss, the abnormal video recognition model to be trained is trained to obtain the trained abnormal video recognition model.
7. An abnormal video recognition method, characterized in that, include: Acquire individual video frames, video text information, and local images of the video to be identified; The video features of the video to be identified are obtained by encoding each video frame, the video text information and the local image through an abnormal video recognition model. The video features are analyzed and processed to obtain predicted risk information and predicted content information of the video to be identified. The abnormal video recognition model is trained by taking each video frame, video text information, and local image of the sample video as input, and using the risk label, content label, and supervision text information of the sample video as supervision information. The risk label is obtained by mining the supervision text information and represents the risk of the video content. The content label is obtained by mining the video text information and verifying it based on the risk label. The sample video and the supervision text information are collected from the video platform. Based on the predicted risk information and predicted content information, the anomaly identification result for the video to be identified is determined.
8. The method according to claim 7, characterized in that, The step of determining the anomaly identification result for the video to be identified based on the predicted risk information and predicted content information includes: The predicted content information is matched with preset risk content information to obtain the matching result; Based on the matching results and the predicted risk information, an anomaly identification result is determined for the video to be identified.
9. A training device for an abnormal video recognition model, characterized in that, include: The acquisition unit is configured to acquire multimodal features of a sample video, as well as risk tags, content tags, and supervisory text information of the sample video. The multimodal features include individual video frames, video text information, and local images of the sample video. The risk tags are obtained by mining the supervisory text information and characterize the risk inherent in the video content. The content tags are obtained by mining the video text information and verifying it based on the risk tags. The sample video and the supervisory text information are collected from a video platform. The prediction unit is configured to perform encoding processing on each video frame, the video text information, and the local image of the sample video using an abnormal video recognition model to be trained, to obtain the video features of the sample video; and to perform recognition processing on the video features to obtain the predicted risk information and predicted content information of the sample video. The training unit is configured to train the abnormal video recognition model to be trained based on a first difference information between the predicted risk information and the risk label, a second difference information between the predicted content information and the content label, and a third difference information between the video features and the supervised text features of the supervised text information, so as to obtain a trained abnormal video recognition model.
10. An abnormal video recognition device, characterized in that, include: The acquisition unit is configured to acquire individual video frames, video text information, and local images of the video to be identified; The prediction unit is configured to perform encoding processing on each video frame, the video text information, and the local image through an abnormal video recognition model to obtain the video features of the video to be recognized; The video features are analyzed and processed to obtain predicted risk information and predicted content information of the video to be identified. The abnormal video recognition model is trained by taking each video frame, video text information, and local image of the sample video as input, and using the risk label, content label, and supervision text information of the sample video as supervision information. The risk label is obtained by mining the supervision text information and represents the risk of the video content. The content label is obtained by mining the video text information and verifying it based on the risk label. The sample video and the supervision text information are collected from the video platform. The identification unit is configured to perform anomaly identification results for the video to be identified based on the predicted risk information and predicted content information.
11. An electronic device, characterized in that, include: processor; Memory used to store the processor's executable instructions; The processor is configured to execute the instructions to implement the method as described in any one of claims 1 to 8.
12. A computer-readable storage medium, characterized in that, When the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is enabled to perform the method as described in any one of claims 1 to 8.