Abnormal multimedia resource identification method and device, electronic equipment and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By fusing text and image features, anomaly clusters are obtained, solving the problem of low accuracy in multimedia resource identification and achieving fast and accurate identification of abnormal multimedia resources.

CN116563679BActive Publication Date: 2026-06-12BEIJING DAJIA INTERNET INFORMATION TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: BEIJING DAJIA INTERNET INFORMATION TECH CO LTD
Filing Date: 2023-04-25
Publication Date: 2026-06-12

Application Information

Patent Timeline

25 Apr 2023

Application

12 Jun 2026

Publication

CN116563679B

IPC: G06V10/80; G06V30/148; G06V10/764; G06F16/55; G06V10/774; G06V10/82; G06V10/74

CPC: G06V10/806; G06V30/153; G06V10/764; G06F16/55; G06V10/774; G06V10/82; G06V10/761; Y02D10/00

AI Tagging

Application Domain

Character and pattern recognitionStill image data clustering/classification

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

In existing technologies, the accuracy of anomaly identification in multimedia resources is low, mainly due to the poor representation ability of single-modal features, which leads to large errors.

⚗Method used

The method employs a fusion processing of text features and image features. The target text and image features are extracted using text feature extraction models and image feature extraction models respectively, and then fused to obtain anomaly clusters. The recognition result of the multimedia resource is determined based on the matching result of the anomaly clusters.

🎯Benefits of technology

It improves the accuracy and efficiency of multimedia resource identification, enabling quick and accurate determination of whether multimedia resources are abnormal.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN116563679B_ABST

Patent Text Reader

Abstract

The present disclosure relates to an abnormal multimedia resource identification method and device, electronic equipment and storage medium. The method comprises: determining a target text and a target image in a to-be-identified multimedia resource; inputting the target text into a text feature extraction model for text feature extraction processing to obtain a target text feature; inputting the target image into an image feature extraction model for image feature extraction processing to obtain a target image feature; performing fusion processing on the target text feature and the target image feature to obtain a target fusion feature; obtaining an abnormal class cluster set; the abnormal class cluster set is obtained by clustering preset fusion features of a plurality of abnormal multimedia resources according to respective abnormal categories of the plurality of abnormal multimedia resources; determining an identification result of the to-be-identified multimedia resource based on a matching result of the target fusion feature and each abnormal class cluster in the abnormal class cluster set; and the present disclosure improves the identification accuracy of abnormal multimedia resources.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computer technology, and in particular to a method, apparatus, electronic device, and storage medium for identifying abnormal multimedia resources. Background Technology

[0002] Application platforms typically contain a large number of multimedia resources, including positive information and negative or abnormal multimedia resources. Abnormal multimedia resources can mislead users; therefore, it is necessary to filter out abnormal multimedia resources from a large number of multimedia resources.

[0003] In related technologies, when screening abnormal multimedia resources from multimedia resources, single-modal features (text features) are usually used for classification and judgment. However, single-modal features have poor representation ability and large error, resulting in a low accuracy rate in identifying abnormal multimedia resources. Summary of the Invention

[0004] This disclosure provides a method, apparatus, electronic device, and storage medium for identifying abnormal multimedia resources, to at least solve the problem of low accuracy in identifying abnormal multimedia resources in related technologies. The technical solution of this disclosure is as follows:

[0005] According to a first aspect of the present disclosure, a method for identifying abnormal multimedia resources is provided, comprising:

[0006] Identify the target text and target image in the multimedia resource to be identified;

[0007] The target text is input into a text feature extraction model for text feature extraction processing to obtain the target text features;

[0008] The target image is input into an image feature extraction model for image feature extraction processing to obtain the target image features;

[0009] The target text features and the target image features are fused together to obtain target fused features;

[0010] Obtain an anomaly cluster set; the anomaly cluster set is obtained by clustering the preset fusion features corresponding to the multiple abnormal multimedia resources according to their respective anomaly categories; the anomaly cluster set includes at least one anomaly cluster; the feature set corresponding to each anomaly cluster includes at least one preset fusion feature; each anomaly cluster corresponds to one anomaly category;

[0011] Based on the matching results between the target fusion features and each abnormal cluster in the abnormal cluster set, the identification result of the multimedia resource to be identified is determined; the identification result indicates whether the multimedia resource to be identified is abnormal.

[0012] In one exemplary embodiment, the training method for the text feature extraction model includes:

[0013] The training text is input into a first preset model, which includes multiple neural networks; text features are extracted from the training text based on a first number of neural networks to obtain first text features; text features are extracted from the training text based on a second number of neural networks to obtain second text features;

[0014] Based on the first and second text features corresponding to the same training text, construct positive text feature pairs;

[0015] Based on the training text features corresponding to any two training texts, construct negative text feature pairs; the training text features are either the first text feature or the second text feature.

[0016] Determine the first similarity between two text features in the positive text feature pair and the second similarity between two text features in the negative text feature pair;

[0017] Based on the difference between the first similarity and the second similarity, the first loss information is determined;

[0018] Based on the first loss information, adjust the model parameters of the first preset model until the training termination condition is met, and determine the first preset model at the end of training as the initial text feature extraction model.

[0019] Based on the initial text feature extraction model, the text feature extraction model is determined.

[0020] In one exemplary implementation, determining the text feature extraction model based on the initial text feature extraction model includes:

[0021] Based on the sample text set, sample text pairs are constructed; the sample text pairs include two types of text pairs, namely positive sample text pairs and negative sample text pairs; the similarity between the two texts in the positive sample text pair is greater than a first threshold, and the similarity between the two texts in the negative sample text pair is less than a second threshold; the first threshold is greater than the second threshold.

[0022] Each sample text in the sample text pair is input into the initial text feature extraction model to extract text features, thereby obtaining the sample text features corresponding to each sample text.

[0023] Determine the similarity between the features of the two sample texts corresponding to the positive sample text pair to obtain the positive sample text similarity;

[0024] Determine the similarity between the features of the two sample texts corresponding to the negative sample text pair to obtain the negative sample text similarity;

[0025] Based on the difference between the similarity of the positive sample text and the similarity of the negative sample text, the initial text feature extraction model is trained to obtain the text feature extraction model.

[0026] In one exemplary embodiment, the training method for the image feature extraction model includes:

[0027] Based on the sample image set, sample image pairs are constructed; the sample image pairs include two types of image pairs, namely positive sample image pairs and negative sample image pairs. The similarity between the two images in the positive sample image pair is greater than a third threshold, and the similarity between the two images in the negative sample image pair is less than a fourth threshold; the third threshold is greater than the fourth threshold.

[0028] Each sample image in the sample image pair is input into the second preset model for image feature extraction processing to obtain the sample image features corresponding to each sample image.

[0029] Determine the similarity between the features of the two corresponding positive sample images to obtain the positive sample image similarity;

[0030] The similarity between the features of the two corresponding negative sample images is determined to obtain the negative sample image similarity.

[0031] Based on the difference between the similarity of the positive sample images and the similarity of the negative sample images, the second preset model is trained to obtain the image feature extraction model.

[0032] In one exemplary implementation, after obtaining the set of exception class clusters, the method further includes:

[0033] Obtain the newly added fusion features corresponding to the newly added abnormal multimedia resources; the newly added abnormal multimedia resources are abnormal multimedia resources.

[0034] Based on the abnormal fusion features corresponding to each abnormal cluster in the abnormal cluster set, determine the central feature corresponding to each abnormal cluster;

[0035] Based on the similarity between the newly added fusion feature and the central feature corresponding to each abnormal cluster, the abnormal cluster that matches the newly added fusion feature is determined, and the matching cluster is obtained.

[0036] The newly added fusion feature is added to the feature set corresponding to the matching cluster.

[0037] In one exemplary implementation, determining the identification result of the multimedia resource to be identified based on the matching results of the target fusion features and each abnormal cluster in the abnormal cluster set includes:

[0038] When an abnormal cluster that matches the target fusion feature exists in the abnormal cluster set, the multimedia resource to be identified is determined to be abnormal.

[0039] The method further includes:

[0040] Determine the target processing strategy corresponding to the matched anomaly clusters;

[0041] The multimedia resources to be identified are processed based on the target processing strategy.

[0042] In an exemplary embodiment, before determining the target processing strategy corresponding to the target anomaly cluster, the method further includes:

[0043] The sorting parameters for each anomaly cluster are determined based on the amount of information and the degree of attention received by the anomaly multimedia resources corresponding to each anomaly cluster.

[0044] Based on the sorting parameters corresponding to each abnormal cluster in the abnormal cluster set, sort each abnormal cluster in the abnormal cluster set to obtain the sorting result;

[0045] Based on the sorting results, determine the anomaly level parameter corresponding to each anomaly cluster in the anomaly cluster set;

[0046] Based on the exception level parameter corresponding to each exception cluster in the exception cluster set, a first exception handling strategy is constructed; the first exception handling strategy includes the correspondence between the exception level parameter and the first handling strategy.

[0047] The step of determining the target processing strategy corresponding to the target anomaly cluster includes:

[0048] Based on the first exception handling strategy, the target handling strategy corresponding to the target exception cluster is determined.

[0049] In an exemplary embodiment, before determining the target processing strategy corresponding to the target anomaly cluster, the method further includes:

[0050] Obtain the abnormal multimedia resources corresponding to each abnormal cluster in the abnormal cluster set;

[0051] Based on the preset text in the abnormal multimedia resources corresponding to each abnormal cluster, determine the scene category corresponding to each abnormal cluster;

[0052] Based on the scene category corresponding to each anomaly cluster in the anomaly cluster set, a second anomaly handling strategy is constructed; the second anomaly handling strategy includes the correspondence between scene categories and the second handling strategy;

[0053] The step of determining the target processing strategy corresponding to the target anomaly cluster includes:

[0054] Based on the second exception handling strategy, the target handling strategy corresponding to the target exception cluster is determined.

[0055] According to a second aspect of the present disclosure, an abnormal multimedia resource identification device is provided, comprising:

[0056] The information acquisition module is configured to determine the target text and target image in the multimedia resource to be identified;

[0057] The text feature extraction module is configured to perform text feature extraction processing on the target text input into the text feature extraction model to obtain the target text features;

[0058] The image feature extraction module is configured to perform image feature extraction processing on the target image input image feature extraction model to obtain target image features;

[0059] The feature fusion module is configured to perform fusion processing on the target text features and the target image features to obtain target fused features;

[0060] The cluster set acquisition module is configured to acquire an abnormal cluster set; the abnormal cluster set is obtained by clustering the multiple abnormal multimedia resources according to their respective abnormal categories and preset fusion features; the abnormal cluster set includes at least one abnormal cluster; the feature set corresponding to each abnormal cluster includes at least one preset fusion feature; each abnormal cluster corresponds to one abnormal category.

[0061] An anomaly identification module is configured to perform matching results between the target fusion features and each anomaly cluster in the anomaly cluster set to determine the identification result of the multimedia resource to be identified; the identification result indicates whether the multimedia resource to be identified is abnormal.

[0062] In one exemplary embodiment, the apparatus further includes:

[0063] The text input module is configured to input training text into a first preset model, the first preset model including multiple neural networks; extract text features from the training text based on a first number of neural networks to obtain first text features; and extract text features from the training text based on a second number of neural networks to obtain second text features.

[0064] The first feature pair construction module is configured to construct positive text feature pairs based on the first text feature and the second text feature corresponding to the same training text.

[0065] The second feature pair construction module is configured to construct negative text feature pairs based on the training text features corresponding to any two training texts; the training text features are either the first text feature or the second text feature.

[0066] The text similarity determination module is configured to determine the first similarity between two text features in the positive text feature pair and the second similarity between two text features in the negative text feature pair.

[0067] The text model training module is configured to train the first preset model based on the first similarity and the second similarity to obtain the text feature extraction model.

[0068] In one exemplary embodiment, the text model training module includes:

[0069] The first loss determination submodule is configured to determine first loss information based on the difference between the first similarity and the second similarity.

[0070] The initial model determination submodule is configured to adjust the model parameters of the first preset model according to the first loss information until the training termination condition is met, and determine the first preset model at the end of training as the initial text feature extraction model.

[0071] The text model training submodule is configured to perform a text feature extraction model determination based on the initial text feature extraction model.

[0072] In one exemplary implementation, the text model training submodule includes:

[0073] The sample text pair construction unit is configured to construct sample text pairs based on a sample text set; the sample text pairs include two types of text pairs, namely positive sample text pairs and negative sample text pairs; the similarity between the two texts in the positive sample text pair is greater than a first threshold, and the similarity between the two texts in the negative sample text pair is less than a second threshold; the first threshold is greater than the second threshold.

[0074] The text feature extraction unit is configured to perform text feature extraction by inputting each sample text in the sample text pair into the initial text feature extraction model to obtain the sample text features corresponding to each sample text.

[0075] The positive text similarity determination unit is configured to determine the similarity between the features of the two sample texts corresponding to the positive sample text pair, and obtain the positive sample text similarity.

[0076] The negative text similarity determination unit is configured to determine the similarity between the features of the two sample texts corresponding to the negative sample text pair, and obtain the negative sample text similarity.

[0077] The text model training unit is configured to train the initial text feature extraction model based on the positive sample text similarity and the negative sample text similarity to obtain the text feature extraction model.

[0078] In one exemplary embodiment, the apparatus further includes:

[0079] The sample image pair construction module is configured to construct sample image pairs based on a sample image set. The sample image pairs include two types of image pairs: positive sample image pairs and negative sample image pairs. The similarity between the two images in a positive sample image pair is greater than a third threshold, and the similarity between the two images in a negative sample image pair is less than a fourth threshold. The third threshold is greater than the fourth threshold.

[0080] The sample image feature extraction module is configured to perform image feature extraction processing by inputting each sample image in the sample image pair into a second preset model to obtain the sample image features corresponding to each sample image.

[0081] The first similarity determination module is configured to determine the similarity between the features of the two sample images corresponding to the positive sample image pair, and obtain the positive sample image similarity.

[0082] The second similarity determination module is configured to determine the similarity between the features of the two sample images corresponding to the negative sample image pair, and obtain the negative sample image similarity.

[0083] The image model training module is configured to train the second preset model based on the similarity of the positive sample images and the similarity of the negative sample images to obtain the image feature extraction model.

[0084] In one exemplary embodiment, the apparatus further includes:

[0085] A new fusion feature acquisition module is added, which is configured to acquire the new fusion features corresponding to the newly added abnormal multimedia resources; the newly added abnormal multimedia resources are abnormal multimedia resources.

[0086] The central feature determination module is configured to determine the central feature corresponding to each anomaly cluster based on the anomaly fusion feature corresponding to each anomaly cluster in the anomaly cluster set.

[0087] The matching cluster determination module is configured to determine the abnormal clusters that match the newly added fusion feature based on the similarity between the newly added fusion feature and the central feature corresponding to each abnormal cluster, thereby obtaining the matching clusters;

[0088] The feature set update module is configured to add the newly added fused features to the feature set corresponding to the matching cluster.

[0089] In one exemplary embodiment, the second preset model includes a first encoder, a second encoder, and a nonlinear layer; the sample image pair includes a first image and a second image; and the sample image feature extraction module includes:

[0090] The first encoding submodule is configured to input the first image into the first encoder for encoding processing to obtain the first encoded feature;

[0091] The second encoding submodule is configured to input the second image into the second encoder for encoding processing to obtain the second encoded feature;

[0092] The image feature determination submodule is configured to perform nonlinear mapping processing by inputting the first encoded feature and the second encoded feature into the nonlinear layer respectively, so as to obtain the sample image features corresponding to each sample image.

[0093] In one exemplary embodiment, the image model training module includes:

[0094] The second loss information determination submodule is configured to determine the second loss information based on the difference between the similarity of the positive sample images and the similarity of the negative sample images.

[0095] The current parameter determination submodule is configured to adjust the parameters of the first encoder and the nonlinear layer based on the second loss information to obtain the current parameters of the first encoder.

[0096] The model training submodule is configured to update the parameters of the second encoder based on the current parameters when the current training iteration meets the preset conditions, until the training termination condition is met, and then determine the first encoder, the second encoder, and the nonlinear layer at the end of training as the image feature extraction model.

[0097] In one exemplary implementation, the sample image pair construction module includes:

[0098] The sample image set acquisition submodule is configured to acquire a sample image set, which includes a first sample image set and a second sample image set.

[0099] The image enhancement submodule is configured to perform data enhancement processing on the first sample image in the first sample image set based on two data enhancement methods to obtain the first enhanced image and the second enhanced image corresponding to the first sample image.

[0100] The first construction submodule is configured to execute the construction of the positive sample image pair based on the first enhanced image and the second enhanced image corresponding to the same first sample image;

[0101] The second construction submodule is configured to construct the negative sample image pair based on the enhanced image corresponding to any first sample image and any second sample image in the second sample image set; the enhanced image is either the first enhanced image or the second enhanced image.

[0102] The sample image pair construction submodule is configured to construct the sample image pair based on the positive sample image pair and the negative sample image pair.

[0103] In one exemplary embodiment, the apparatus further includes:

[0104] The abnormal fusion feature acquisition module is configured to acquire the abnormal fusion features corresponding to each of the multiple abnormal multimedia resources.

[0105] The clustering module is configured to perform clustering of multiple abnormal fusion features based on the abnormal categories corresponding to the multiple abnormal multimedia resources, to obtain an abnormal cluster set; the abnormal cluster set includes at least one abnormal cluster; the feature set corresponding to each abnormal cluster includes at least one abnormal fusion feature;

[0106] In one exemplary embodiment, the anomaly detection module includes:

[0107] The abnormal multimedia resource determination submodule is configured to determine that the multimedia resource to be identified is abnormal when there is an abnormal cluster in the abnormal cluster set that matches the target fusion feature.

[0108] In an exemplary embodiment, when an anomaly cluster matching the target fusion feature exists within the anomaly cluster set, the apparatus further includes:

[0109] The target processing strategy determination module is configured to execute the target processing strategy corresponding to the above-mentioned matched exception clusters;

[0110] The strategy execution module is configured to process the multimedia resource to be identified based on the target processing strategy.

[0111] In one exemplary embodiment, the apparatus further includes:

[0112] The sorting parameter determination module is configured to determine the sorting parameters for each anomaly cluster based on the amount of information and the degree of attention of the anomaly multimedia resources corresponding to each anomaly cluster.

[0113] The sorting module is configured to sort each abnormal cluster in the abnormal cluster set according to the sorting parameters corresponding to each abnormal cluster in the abnormal cluster set, and obtain the sorting result;

[0114] The exception level parameter determination module is configured to determine the exception level parameter corresponding to each exception cluster in the exception cluster set based on the sorting result.

[0115] The first strategy construction module is configured to construct a first exception handling strategy based on the exception level parameters corresponding to each exception cluster in the exception cluster set; the first exception handling strategy includes the correspondence between the exception level parameters and the first handling strategy.

[0116] In one exemplary embodiment, the target processing strategy determination module includes:

[0117] The first strategy determination module is configured to execute a target processing strategy corresponding to the target exception cluster based on the first exception handling strategy.

[0118] In one exemplary embodiment, the apparatus further includes:

[0119] The information quantity determination module is configured to determine the information quantity of the abnormal multimedia resources corresponding to each abnormal cluster based on the number of abnormal fusion features corresponding to each abnormal cluster in the abnormal cluster set.

[0120] The attention level acquisition module is configured to acquire the attention level of the abnormal multimedia resources corresponding to each abnormal cluster.

[0121] In one exemplary embodiment, the apparatus further includes:

[0122] The abnormal multimedia resource acquisition module is configured to acquire the abnormal multimedia resources corresponding to each abnormal cluster in the abnormal cluster set;

[0123] The scene category determination module is configured to determine the scene category corresponding to each anomaly cluster based on preset text in the abnormal multimedia resources corresponding to each anomaly cluster.

[0124] The second strategy construction module is configured to construct a second exception handling strategy based on the scenario category corresponding to each exception cluster in the exception cluster set; the second exception handling strategy includes the correspondence between scenario categories and the second handling strategy.

[0125] In one exemplary embodiment, the target processing strategy determination module includes:

[0126] The second strategy determination module is configured to execute a target processing strategy corresponding to the target exception cluster based on the second exception handling strategy.

[0127] According to a third aspect of the present disclosure, an electronic device is provided, comprising:

[0128] processor;

[0129] Memory used to store the processor's executable instructions;

[0130] The processor is configured to execute the instructions to implement the abnormal multimedia resource identification method described above.

[0131] According to a fourth aspect of the present disclosure, a computer-readable storage medium is provided, which, when the instructions in the computer-readable storage medium are executed by an electronic device processor, enables the electronic device to perform the abnormal multimedia resource identification method as described above.

[0132] According to a fifth aspect of the present disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements the abnormal multimedia resource identification method as described above.

[0133] This disclosure identifies target text and target image in a multimedia resource to be identified; inputs the target text into a text feature extraction model for text feature extraction processing to obtain target text features; inputs the target image into an image feature extraction model for image feature extraction processing to obtain target image features; fuses the target text features and target image features to obtain target fused features; obtains an anomaly cluster set; the anomaly cluster set is obtained by clustering the preset fused features corresponding to multiple anomaly multimedia resources according to their respective anomaly categories, and the anomaly cluster set includes at least one anomaly cluster; the feature set corresponding to each anomaly cluster includes at least one preset fused feature; each anomaly cluster corresponds to one anomaly category; based on the matching results between the target fused features and each anomaly cluster in the anomaly cluster set, the identification result of the multimedia resource to be identified is determined. The target fused features obtained by this disclosure fuse features from two modalities, which can accurately represent the multimedia resource to be identified; by obtaining the anomaly cluster set and matching the target fused features with each anomaly cluster in the anomaly cluster set, the identification result of the multimedia resource to be identified can be determined quickly and accurately, improving the identification accuracy and efficiency of anomaly multimedia resources.

[0134] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description

[0135] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure, and are not intended to unduly limit this disclosure.

[0136] Figure 1 This is an application environment diagram illustrating an abnormal multimedia resource identification method according to an exemplary embodiment.

[0137] Figure 2 This is a flowchart illustrating an abnormal multimedia resource identification method according to an exemplary embodiment.

[0138] Figure 3 This is a flowchart illustrating a training method for a text feature extraction model according to an exemplary embodiment.

[0139] Figure 4 This is a flowchart illustrating a method for determining a text feature extraction model according to an exemplary embodiment.

[0140] Figure 5 This is a schematic diagram illustrating the structure of a text feature extraction model according to an exemplary embodiment.

[0141] Figure 6This is a flowchart illustrating a method for determining a text feature extraction model based on an initial text feature extraction model, according to an exemplary embodiment.

[0142] Figure 7 This is a schematic diagram of the structure of a first preset model according to an exemplary embodiment.

[0143] Figure 8 This is a flowchart illustrating a training method for an image feature extraction model according to an exemplary embodiment.

[0144] Figure 9 This is a flowchart illustrating a method for determining sample image features corresponding to each sample image according to an exemplary embodiment.

[0145] Figure 10 This is a flowchart illustrating a method for training a second preset model to obtain an image feature extraction model according to an exemplary embodiment.

[0146] Figure 11 This is a schematic diagram of the structure of a second preset model according to an exemplary embodiment.

[0147] Figure 12 This is a flowchart illustrating a method for determining target fusion features through a model, according to an exemplary embodiment.

[0148] Figure 13 This is a flowchart illustrating an update method for an exception class cluster set according to an exemplary embodiment.

[0149] Figure 14 This is a flowchart illustrating a method for constructing a first exception handling strategy according to an exemplary embodiment.

[0150] Figure 15 This is a block diagram illustrating an abnormal multimedia resource identification device according to an exemplary embodiment.

[0151] Figure 16 This is a block diagram illustrating an electronic device for identifying abnormal multimedia resources according to an exemplary embodiment. Detailed Implementation

[0152] To enable those skilled in the art to better understand the technical solutions of this disclosure, the technical solutions in the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings.

[0153] It should be noted that the terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this disclosure described herein can be implemented in orders other than those illustrated or described herein. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this disclosure. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this disclosure as detailed in the appended claims.

[0154] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for display, data used for analysis, etc.) involved in this disclosure are all information and data authorized by the user or fully authorized by all parties.

[0155] In the context of the rapid development of the internet, a vast amount of video resources are available. For example, short video platforms and various self-media platforms contain a wealth of video data. People's daily lives are increasingly influenced by short video platforms, and a series of trending trends and events are disseminated and discussed online in real time.

[0156] Therefore, monitoring public opinion on short video platforms is becoming increasingly important. Through monitoring, it's possible to detect sudden events and potential incidents that could trigger public outcry earlier, allowing for timely intervention and mitigation of damage. For example, monitoring short video platforms for vulgar or inappropriate multimedia content involving minors.

[0157] On all short video platforms, it's crucial to closely monitor public opinion trends. Any negative signs should be addressed promptly, such as closing topics, restricting inappropriate comments, and issuing penalties or educational measures to video creators. This is essential to mitigate the impact of public opinion and ensure the healthy development of the platform's ecosystem. However, relying solely on manual methods to identify abnormal multimedia resources within the platform is impractical, inefficient, and prone to delays.

[0158] To improve the accuracy of identifying abnormal multimedia resources, this disclosure provides a method, apparatus, electronic device, and storage medium for identifying abnormal multimedia resources.

[0159] Please see Figure 1 The diagram illustrates an application environment for an abnormal multimedia resource identification method according to an exemplary embodiment. The application environment may include a server 01 and a client 02.

[0160] Specifically, in the embodiments of this specification, server 01 may include a standalone server, a distributed server, or a server cluster composed of multiple servers. It may also be a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms. Server 01 may include a network communication unit, a processor, and a memory, etc. Specifically, server 01 can be used to construct a text feature extraction model and an image feature extraction model, and to extract text features and image features of multimedia resources through these two models respectively. Based on the text features and image features, a fused feature is obtained, which can accurately determine whether the multimedia resource is an abnormal multimedia resource, obtain the identification result, and send the identification result to client 02.

[0161] Specifically, in the embodiments of this specification, the client 02 may include physical devices such as smartphones, desktop computers, tablets, laptops, digital assistants, smart wearable devices, and in-vehicle terminals, and may also include software running on the physical device, such as web pages provided to users by some service providers, or applications provided to users by these service providers. Specifically, the client 02 can be used to display the recognition results of multimedia resources.

[0162] Figure 2 This is a flowchart illustrating an abnormal multimedia resource identification method according to an exemplary embodiment, such as... Figure 2 As shown, this method can be applied to Figure 1 The server 01 shown includes the following steps.

[0163] In step S201, the target text and target image in the multimedia resource to be identified are determined.

[0164] In this embodiment of the disclosure, the multimedia resource to be identified may include, but is not limited to, videos, images carrying text, etc.; the target text may include, but is not limited to, the text title, text content, number of likes, number of shares, number of comments, number of positive comments, number of negative comments, etc. in the multimedia resource to be identified. When the multimedia resource to be identified is an image carrying text, the target image is the image after removing the text information; when the multimedia resource to be identified is a video, the target image may include the video cover image and video frames; there may be one or more video frames.

[0165] In step S203, the target text is input into the text feature extraction model for text feature extraction processing to obtain the target text features.

[0166] In this embodiment of the disclosure, text features of the target text can be extracted using a text feature extraction model to obtain target text features, which can be vector features. There can be one or more target texts, and each target text corresponds to one or more target text features.

[0167] In some embodiments, such as Figure 3 As shown, the training methods for the above text feature extraction model include:

[0168] In step S301, the training text is input into a first preset model, which includes multiple neural networks; text features are extracted from the training text based on a first number of neural networks to obtain first text features; text features are extracted from the training text based on a second number of neural networks to obtain second text features.

[0169] In this embodiment of the disclosure, the original training text can be crawled from platforms such as websites, microblogs, and forums, or the original multimedia resources can be obtained through a multimedia resource publishing platform. The original training text can be extracted from the multimedia resources and then cleaned to obtain the training text. Data cleaning may include operations such as removing text that is too long or too short, and removing special characters from the text.

[0170] In some embodiments, data augmentation processing can be performed on the training text to expand the number of training texts. Data augmentation methods for training texts may include, but are not limited to, synonym replacement, random replacement of adjacent characters, replacement of non-Chinese characters with equivalent Chinese characters, translation conversion, and inverted sentence structure replacement. The purpose is to generate texts with the same semantics but different expressions, thereby enhancing the expressive power of the model.

[0171] In this embodiment of the disclosure, the first preset model can be a text encoder, which may include, but is not limited to, a graph neural network (transformer) or a pre-trained language representation model (Bidirectional Encoder Representations from Transformers, BERT); the number of training texts input into the model at one time can be a preset number (batch), that is, the number of samples used in one iteration, and the parameters of the model need to be updated after each batch is completed.

[0172] In this embodiment, the first number of neural networks and the second number of neural networks refer to some of the neural networks in the first preset model. The first number and the second number can be the same or different. When the first number and the second number are the same, the neural networks corresponding to them are different networks, that is, the neural networks corresponding to the first text feature and the second text feature are not completely the same. Based on the first number of neural networks, text features are extracted from the above training text to obtain the first text feature. Based on the second number of neural networks, text features are extracted from the above training text to obtain the second text feature. That is, the same training text is processed through two forward calculations and two different random masks (dropout masks) to obtain two text features (embedding vectors). The random mask scheme is a training strategy that can set random masks to make the output units ineffective, thereby improving the accuracy of feature extraction by each neural network in the first preset model.

[0173] In step S303, a positive text feature pair is constructed based on the first text feature and the second text feature corresponding to the same training text;

[0174] In this embodiment of the disclosure, positive text feature pairs can be constructed based on the first text feature and the second text feature corresponding to the same training text.

[0175] In step S305, a negative text feature pair is constructed based on the training text features corresponding to any two training texts; the training text features are either the first text feature or the second text feature.

[0176] In this embodiment of the disclosure, negative text feature pairs can be constructed based on the first text feature or the second text feature corresponding to different training texts.

[0177] In some embodiments, any two training texts may include a first training text and a second training text. The above-mentioned construction of negative text feature pairs based on the training text features corresponding to each of the two training texts includes:

[0178] Based on the first text features corresponding to the first training text and the first text features corresponding to the second training text, construct negative sample text pairs; or

[0179] Based on the first text features corresponding to the first training text and the second text features corresponding to the second training text, construct negative sample text pairs; or

[0180] Based on the second text features corresponding to the first training text and the first text features corresponding to the second training text, construct negative sample text pairs; or

[0181] Based on the second text features corresponding to the first training text and the second text features corresponding to the second training text, negative sample text pairs are constructed.

[0182] In this embodiment of the disclosure, negative sample text pairs can be constructed based on the same type of text features or different types of text features of any two training texts; wherein the first text features corresponding to the two training texts are of the same type, otherwise they are of different types.

[0183] In step S307, the first similarity of the two text features in the positive text feature pair and the second similarity of the two text features in the negative text feature pair are determined.

[0184] In step S309, the first preset model is trained based on the difference between the first similarity and the second similarity to obtain the text feature extraction model.

[0185] In this embodiment of the disclosure, a first loss information can be determined based on the difference between the first similarity and the second similarity; the model parameters of the first preset model are adjusted according to the first loss information until the training termination condition is met, so that the first similarity becomes larger and larger, while the second similarity becomes smaller and smaller; thus obtaining a text feature extraction model.

[0186] In this embodiment, a text feature extraction model can be trained using an unsupervised learning strategy. Compared to supervised training that requires labeling, the unsupervised training method in this embodiment reduces the training difficulty.

[0187] In the embodiments disclosed herein, such as Figure 4 As shown, based on the difference between the first similarity and the second similarity, the first preset model is trained to obtain the text feature extraction model, including:

[0188] In step S3091, first loss information is determined based on the difference between the first similarity and the second similarity.

[0189] In this embodiment of the disclosure, the difference between the first similarity and the second similarity can be calculated to obtain the first loss information; the weights of the first similarity and the second similarity can also be set, and the first loss information can be determined based on the difference between their weights.

[0190] In step S3093, based on the first loss information, the model parameters of the first preset model are adjusted until the training termination condition is met, and the first preset model at the end of training is determined as the initial text feature extraction model.

[0191] In this embodiment of the disclosure, the training termination condition can be determined by the value of the first loss information or the number of iterations; for example, the training termination condition can be that the first loss information is greater than a preset value or the number of iterations reaches a preset number; thereby obtaining the initial text feature extraction model.

[0192] In step S3095, the text feature extraction model is determined based on the initial text feature extraction model.

[0193] In this embodiment of the disclosure, the initial text feature extraction model can be directly determined as the text feature extraction model, or the initial text feature extraction model can be used as a pre-trained model to further determine the above-mentioned text feature extraction model.

[0194] In this embodiment of the disclosure, a first loss information can be determined based on the difference between the first similarity and the second similarity; and the model parameters can be adjusted based on the first loss information, thereby achieving text feature extraction model through comparative learning training; which can improve the accuracy of text feature extraction model in extracting text features.

[0195] In some embodiments, the initial text feature extraction model can be directly determined as the text feature extraction model, such as... Figure 5 As shown, Figure 5 This is a schematic diagram of the structure of a text feature extraction model; wherein, the first preset model is an encoder, and the input training text to the encoder is multiple; in the figure, ① and ② represent the same training text being encoded twice using different random masks, and the two text features obtained are positive sample text pairs; Figure 5 Solid circles represent positive sample text pairs, and dashed circles represent negative sample text pairs.

[0196] In the embodiments disclosed herein, such as Figure 6 As shown, based on the above initial text feature extraction model, the above text feature extraction model is determined to include:

[0197] In step S30951, sample text pairs are constructed based on the sample text set; the sample text pairs include two types of text pairs, namely positive sample text pairs and negative sample text pairs; the similarity between the two texts in the positive sample text pair is greater than a first threshold, and the similarity between the two texts in the negative sample text pair is less than a second threshold; the first threshold is greater than the second threshold.

[0198] In this embodiment of the disclosure, data augmentation processing can be performed on sample texts in the sample text set to obtain augmented texts; and positive sample text pairs can be constructed based on the sample texts and their corresponding augmented texts; or positive sample text pairs can be constructed based on two augmented texts corresponding to a sample text; negative sample text pairs can be constructed based on two different sample texts in the sample text set, or negative sample text pairs can be constructed based on the augmented text corresponding to a sample text and two sample texts. The first threshold and the second threshold are both values less than 1.

[0199] In step S30953, each sample text in the above sample text pair is input into the above initial text feature extraction model to extract text features, thereby obtaining the sample text features corresponding to each sample text;

[0200] In this embodiment of the disclosure, fine-tuning training can be performed on the basis of the initial text feature extraction model to further improve the accuracy of the model; by inputting sample text pairs to train the initial text feature extraction model, the obtained sample text features can be embedded vectors.

[0201] In step S30955, the similarity between the features of the two sample texts corresponding to the above positive sample text pair is determined to obtain the positive sample text similarity.

[0202] In this embodiment of the disclosure, after obtaining the features of the two sample texts corresponding to the positive sample text pair, the similarity between the two text features can be calculated to obtain the positive sample text similarity; during the training process, the positive sample text similarity is continuously increased.

[0203] In step S30957, the similarity between the features of the two sample texts corresponding to the above negative sample text pair is determined to obtain the negative sample text similarity.

[0204] In this embodiment of the disclosure, after obtaining the two sample text features corresponding to the negative sample text pair, the similarity between the two text features can be calculated to obtain the negative sample text similarity; during the training process, the negative sample text similarity is continuously reduced.

[0205] In step S30959, the initial text feature extraction model is trained based on the difference between the positive sample text similarity and the negative sample text similarity to obtain the text feature extraction model.

[0206] In this embodiment of the disclosure, a contrastive loss function can be constructed based on the difference between the similarity of positive sample text and the similarity of negative sample text to train the initial text feature extraction model, so that the similarity of positive sample text increases while the similarity of negative sample text decreases, thereby obtaining the text feature extraction model.

[0207] In this embodiment of the disclosure, by constructing sample text pairs and further fine-tuning the initial text feature extraction model, a text feature extraction model is obtained, which can further improve the accuracy of feature extraction by the text feature extraction model.

[0208] In some embodiments, such as Figure 7As shown in the figure, this is a schematic diagram of the structure of a first preset model. The first preset model includes two encoders. Positive sample text 1 and positive sample text 2 are positive sample text pairs, negative sample text 1 and negative sample text 2 are negative sample text pairs, and negative sample text x and negative sample text y are another negative sample text pair. After constructing the positive sample text pairs and negative sample text pairs, one sample text from each pair is input into the first encoder, and the other sample text is input into the other encoder. Text features are extracted separately to calculate the similarity between positive and negative sample texts. Figure 7 Solid circles represent positive sample text feature pairs, while dashed circles represent negative sample text feature pairs.

[0209] In some embodiments, the initial text feature extraction model is trained based on the difference between the positive sample text similarity and the negative sample text similarity to obtain the text feature extraction model, including:

[0210] Based on the difference between the positive sample text similarity and the negative sample text similarity, the loss information is determined;

[0211] In this embodiment of the disclosure, the difference between the similarity of positive sample text and the similarity of negative sample text can be calculated to obtain loss information; the weights of the similarity between positive sample text and negative sample text can also be set, and the loss information can be determined based on the difference between the two weights.

[0212] In this embodiment of the disclosure, based on the aforementioned loss information, the model parameters of the initial text feature extraction model are adjusted until the training termination condition is met, and the initial text feature extraction model at the end of training is determined as the text feature extraction model.

[0213] In this embodiment of the disclosure, the training termination condition can be determined by the value of the loss information or the number of iterations; for example, the training termination condition can be that the loss information is greater than a preset value or the number of iterations reaches a preset number; thereby obtaining a text feature extraction model.

[0214] In step S205, the target image is input into the image feature extraction model for image feature extraction processing to obtain the target image features.

[0215] In this embodiment of the disclosure, image features of a target image can be extracted using an image feature extraction model to obtain target image features, which can be vector features. There can be one or more target images, and each target image can correspond to one or more target image features. When the multimedia resource to be identified is a video, the target image can include a cover image and video frame images. The cover image and video frame images can be input into the image feature extraction model separately for image feature extraction processing to obtain the image features corresponding to each of the two types of images. Then, the obtained multiple image features are fused to obtain the target image features.

[0216] In some embodiments, such as Figure 8 As shown, the training methods for the above image feature extraction model include:

[0217] In step S2051, sample image pairs are constructed based on the sample image set. The sample image pairs include two types of image pairs, namely positive sample image pairs and negative sample image pairs. The similarity between the two images in the positive sample image pair is greater than a third threshold, and the similarity between the two images in the negative sample image pair is less than a fourth threshold. The third threshold is greater than the fourth threshold.

[0218] In this embodiment, original sample images can be crawled from platforms such as websites, microblogs, and forums, or original multimedia resources can be obtained from multimedia resource publishing platforms. Frames are then extracted from the multimedia resources using a preset frame extraction strategy to extract the original sample images. The preset frame extraction strategy may include uniform frame extraction, fixed-number frame extraction, or frame extraction based on the difference between frames. Data cleaning is then performed on the original sample images to obtain sample images. Data cleaning can be used to remove images with low clarity or poor quality. The third threshold may be the same as or different from the first threshold, and the fourth threshold may be the same as or different from the second threshold. Both the third and fourth thresholds are values less than 1.

[0219] In some embodiments, data augmentation processing can be performed on the sample images to expand the sample images and increase the number of sample images. The data augmentation methods for the sample images may include, but are not limited to, rotation, flipping, scaling transformation, translation transformation, scale transformation, noise perturbation, color transformation, occlusion and other operations, thereby enhancing the expressive power of the model.

[0220] In some embodiments, constructing sample image pairs based on a sample image set includes:

[0221] Obtain a sample image set, which includes a first sample image set and a second sample image set;

[0222] In this embodiment, images from the first sample image set can be used to construct positive sample image pairs, and images from the second sample image set can be used to construct negative sample image pairs. Since the second sample image in the second sample image set differs significantly from the first sample image in the first sample image set, it can be used as a comparison image with the first sample image in the first sample image set to construct negative sample image pairs. During model training, the number of images in the second sample image set can remain fixed and can be continuously updated based on the generation of augmented images.

[0223] Based on two data augmentation methods, the first sample image in the first sample image set above is subjected to data augmentation processing to obtain the first augmented image and the second augmented image corresponding to the first sample image above.

[0224] In this embodiment of the disclosure, the data augmentation method for the sample image may include, but is not limited to, rotation, flipping, scaling transformation, translation transformation, scale transformation, noise perturbation, color transformation, occlusion, etc.; the two data augmentation methods can be any two of them; through data augmentation processing, two enhanced images corresponding to the same sample image are obtained, namely the first enhanced image and the second enhanced image.

[0225] Based on the first enhanced image and the second enhanced image corresponding to the same first sample image, the above positive sample image pair is constructed;

[0226] In this embodiment of the disclosure, a positive sample image pair can be constructed based on the first enhanced image and the second enhanced image corresponding to each first sample image; alternatively, a positive sample image pair can be constructed based on the first sample image and any one of its corresponding enhanced images.

[0227] Based on the enhanced image corresponding to any first sample image and any second sample image in the aforementioned second sample image set, the aforementioned negative sample image pair is constructed; the aforementioned enhanced image is either the aforementioned first enhanced image or the aforementioned second enhanced image.

[0228] In this embodiment of the disclosure, a negative sample image pair can be constructed based on the enhanced image corresponding to any first sample image and any second sample image in the second sample image set. In some embodiments, after generating the first enhanced image and the second enhanced image of the first sample image, any enhanced image can be stored in the second sample image set, and an original image can be deleted from the second sample image set to ensure that the total number of images in the image set remains unchanged. The image to be deleted can be determined based on the image's enqueue time, that is, the time when the image was added to the second sample image set. The image with the earliest enqueue time can be used as the image to be deleted and deleted from the second sample image set.

[0229] Based on the above positive sample image pairs and the above negative sample image pairs, construct sample image pairs.

[0230] In this embodiment of the disclosure, both positive sample image pairs and negative sample image pairs can be defined as sample image pairs.

[0231] In this embodiment of the disclosure, a first enhanced image and a second enhanced image can be constructed through data augmentation, thereby constructing a positive sample image pair with high similarity based on the first enhanced image and the second enhanced image, and constructing a negative sample image pair based on the enhanced image corresponding to the first sample image and the second sample image, thereby improving the accuracy of the sample image pair.

[0232] In step S2053, each sample image in the above sample image pair is input into the second preset model for image feature extraction processing to obtain the sample image features corresponding to each sample image;

[0233] In this embodiment of the disclosure, the number of sample image pairs input to the model at one time can be a preset number (batch), that is, the number of samples used in one iteration. The parameters of the model need to be updated after each batch is completed. The second preset model may include, but is not limited to, Convolutional Neural Networks (CNN), Residual Networks (ResNet), etc.

[0234] In this embodiment of the disclosure, such as Figure 9 As shown, the second preset model includes a first encoder, a second encoder, and a nonlinear layer. The sample image pair includes a first image and a second image. Each sample image in the sample image pair is input into the second preset model for image feature extraction processing to obtain the sample image features corresponding to each sample image, including:

[0235] In step S20531, the first image is input into the first encoder for encoding processing to obtain the first encoded feature;

[0236] In this embodiment of the disclosure, the first encoder may include, but is not limited to, convolutional neural networks (CNN), residual networks (ResNet), etc.

[0237] In step S20533, the second image is input into the second encoder for encoding processing to obtain the second encoded feature;

[0238] In this embodiment of the disclosure, the first encoder and the second encoder can be the same type of neural network or different types of neural networks; the second encoder can include, but is not limited to, convolutional neural networks (CNN), residual networks (ResNet), etc.

[0239] In step S20535, the first coding feature and the second coding feature are respectively input into the nonlinear layer for nonlinear mapping processing to obtain the sample image features corresponding to each sample image.

[0240] In this embodiment of the disclosure, the first encoded feature and the second encoded feature can be input into a nonlinear layer for nonlinear mapping processing to obtain the sample image features corresponding to each sample image; the nonlinear layer may include, but is not limited to, a fully connected layer with a nonlinear activation function; adding a nonlinear layer to the model can improve the model accuracy.

[0241] In this embodiment of the present disclosure, two encoders are set in the second preset model, and feature extraction is performed by the two encoders respectively, followed by comparative training, which can improve the training efficiency of the model; setting a non-threaded layer in the second preset model can improve the accuracy of feature extraction, thereby improving the accuracy of the model.

[0242] In step S2055, the similarity between the features of the two sample images corresponding to the above positive sample image pair is determined to obtain the positive sample image similarity.

[0243] In this embodiment of the disclosure, after obtaining the features of the two sample images corresponding to the positive sample image pair, the similarity between the two image features can be calculated to obtain the positive sample image similarity; during the training process, the positive sample image similarity is continuously increased.

[0244] In step S2057, the similarity between the features of the two sample images corresponding to the above negative sample image pair is determined to obtain the negative sample image similarity.

[0245] In this embodiment of the disclosure, after obtaining the features of the two sample images corresponding to the negative sample image pair, the similarity between the two image features can be calculated to obtain the negative sample image similarity; during the training process, the negative sample image similarity is continuously reduced.

[0246] In step S2059, the second preset model is trained based on the difference between the similarity of the positive sample images and the similarity of the negative sample images to obtain the image feature extraction model.

[0247] In this embodiment of the disclosure, a contrastive loss function can be constructed based on the difference between the similarity of positive sample images and the similarity of negative sample images, and the second preset model can be trained to make the similarity of positive sample images increase while the similarity of negative sample images decrease, thereby obtaining an image feature extraction model.

[0248] In this embodiment, an image feature extraction model can be trained using an unsupervised learning strategy. Compared to supervised training that requires labeling, the unsupervised training method in this embodiment reduces the training difficulty.

[0249] In this embodiment of the disclosure, such as Figure 10 As shown, based on the difference between the similarity of the positive sample images and the similarity of the negative sample images, the second preset model is trained to obtain the image feature extraction model, including:

[0250] In step S20591, the second loss information is determined based on the difference between the similarity of the positive sample images and the similarity of the negative sample images.

[0251] In this embodiment of the disclosure, the difference between the similarity of positive sample images and the similarity of negative sample images can be calculated to obtain the second loss information; the weights corresponding to the similarity of positive sample images and the similarity of negative sample images can also be set, and the second loss information can be determined based on the difference between the two weights.

[0252] In step S20593, the parameters of the first encoder and the nonlinear layer are adjusted according to the second loss information to obtain the current parameters of the first encoder.

[0253] In this embodiment of the disclosure, during the training process, backpropagation is performed only on the first encoder, and a gradient truncation strategy is adopted for the second encoder; that is, the parameters of the first encoder and the nonlinear layer are adjusted in real time only through the first loss information to obtain the current parameters of the first encoder; the parameters of the second encoder do not change in each training cycle.

[0254] In step S20595, when the current training iterations meet the preset conditions, the parameters of the second encoder are updated based on the current parameters until the training termination conditions are met, and the first encoder, the second encoder, and the nonlinear layer at the end of training are determined as the image feature extraction model.

[0255] In this embodiment, when the number of training iterations meets a preset condition, the parameters of the second encoder can be adjusted based on the current parameters of the first encoder, and training can continue. The preset condition can be a specified number of training iterations, such as the 3rd, 5th, or 7th iteration. That is, when the number of training iterations is 3rd, 5th, or 7th, the current parameters of the first encoder are obtained at the corresponding training iterations, and the parameters of the second encoder are adjusted based on the current parameters, and training continues until the training termination condition is met. The training termination condition can be determined by the value of the first loss information or the number of iterations; for example, the training termination condition can be that the second loss information is greater than a preset value or the number of iterations reaches a preset number, thereby obtaining the initial text feature extraction model.

[0256] In this embodiment, a second loss information can be determined based on the difference between the similarity of positive sample images and the similarity of negative sample images; and the parameters of the first encoder and the nonlinear layer can be adjusted based on the second loss information. When the current training iterations meet the preset conditions, the parameters of the second encoder are updated based on the current parameters, thereby reducing the model training complexity and improving the model training efficiency. Through comparative learning training, the accuracy of the image feature extraction model in extracting image features can be improved.

[0257] In some embodiments, such as Figure 11 As shown, Figure 11 This is a schematic diagram of a second pre-defined model, which includes a first encoder, a second encoder, and a nonlinear layer. The model involves: performing data augmentation on a first sample image to obtain a first augmented image and a second augmented image; constructing positive sample image pairs based on the first and second augmented images; inputting the first augmented image into the first encoder to obtain a first encoded feature; inputting the second augmented image into the second encoder to obtain a second encoded feature; inputting the first and second encoded features into the nonlinear layer for nonlinear mapping to obtain a first image feature and a second image feature; calculating the similarity between the first and second image features to obtain the positive sample image similarity; simultaneously calculating the negative sample image similarity of the negative sample image pairs using a similar method; determining the second loss information based on the difference between the positive and negative sample image similarities; backpropagating the second loss information to the first encoder to adjust its parameters; performing gradient truncation on the second encoder; and adjusting the parameters of the second encoder based on the current parameters of the first encoder when a preset number of training iterations is reached, until training ends, thus obtaining the image feature extraction model.

[0258] In step S207, the target text features and the target image features are fused to obtain target fused features.

[0259] In this embodiment of the disclosure, the target text features and the target image features can be spliced together to obtain target fusion features; or the target text features and the target image features can be fused using a feature fusion network such as a multi-head attention network to obtain target fusion features; the target fusion features can accurately characterize the multimedia resources to be identified.

[0260] In some embodiments, such as Figure 12 As shown, Figure 12 This is a flowchart of a method for determining target fusion features through a model. First, based on the multimedia resource to be identified, target text, target cover image, and target video frame are extracted. Then, text features of the target text are extracted using a text feature extraction model to obtain target text features. Next, image features of the target cover image and target video frame are extracted using an image feature extraction model to obtain the corresponding image features of the target cover image and target video frame, which are respectively the first image feature and the second image feature. The first image feature and the second image feature are then subjected to image fusion processing to obtain fused image features. Finally, the target text features and the fused image features are then fused to obtain the target fusion features.

[0261] In step S209, an abnormal cluster set is obtained; the abnormal cluster set is obtained by clustering the preset fusion features corresponding to the multiple abnormal multimedia resources according to their respective abnormal categories; the abnormal cluster set includes at least one abnormal cluster; the feature set corresponding to each abnormal cluster includes at least one preset fusion feature; each abnormal cluster corresponds to one abnormal category.

[0262] In this embodiment of the disclosure, the above-mentioned acquisition of anomaly clusters includes:

[0263] Obtain the anomaly fusion characteristics corresponding to each of the multiple abnormal multimedia resources;

[0264] In some embodiments, abnormal multimedia resources can be obtained by acquiring public feedback information through a multimedia resource publishing platform. Then, text features are extracted using a text feature extraction model, and image features are extracted using an image feature extraction model. The extracted text features and image features are then fused to obtain abnormal fused features.

[0265] Based on the anomaly categories corresponding to the aforementioned multiple abnormal multimedia resources, the aforementioned multiple abnormal fusion features are clustered to obtain an anomaly cluster set; the aforementioned anomaly cluster set includes at least one anomaly cluster; the feature set corresponding to each anomaly cluster includes at least one abnormal fusion feature.

[0266] In this embodiment of the disclosure, when initially constructing anomaly clusters, each anomaly multimedia resource can be labeled with an anomaly category. The anomaly category can represent the domain, information type, etc., corresponding to the multimedia resource. The anomaly fusion features corresponding to the anomaly multimedia resources of the same anomaly category are obtained, thereby constructing anomaly clusters. An anomaly cluster set is constructed based on at least one anomaly category, wherein each anomaly cluster corresponds to one or more anomaly fusion features.

[0267] In some embodiments, after obtaining the exception class cluster set, the method further includes:

[0268] Obtain the newly added fusion features corresponding to the newly added abnormal multimedia resources;

[0269] In this embodiment of the disclosure, the aforementioned newly added abnormal multimedia resources are abnormal multimedia resources. After initially constructing the abnormal cluster set, for the subsequently acquired newly added abnormal multimedia resources, their corresponding newly added fusion features can be extracted, and the newly added abnormal multimedia resources can be classified according to the newly added fusion features and the abnormal clusters, thereby expanding the multimedia resources corresponding to the abnormal clusters and improving the accuracy of clustering.

[0270] Based on the abnormal fusion features corresponding to each abnormal cluster in the above abnormal cluster set, determine the central feature corresponding to each abnormal cluster;

[0271] In this embodiment of the disclosure, each anomaly cluster corresponds to an anomaly category. For any anomaly cluster, all anomaly fusion features corresponding to that cluster can be obtained, and the average value of all anomaly fusion features can be calculated to obtain a central feature. Alternatively, the average value of some anomaly fusion features corresponding to the anomaly cluster can be calculated to obtain a central feature. This central feature can be used to characterize the anomaly cluster and for feature matching to determine whether a newly added abnormal multimedia resource belongs to that anomaly cluster.

[0272] Based on the similarity between the newly added fusion feature and the central feature corresponding to each abnormal cluster, the abnormal clusters that match the newly added fusion feature are determined, and the matching clusters are obtained.

[0273] In this embodiment of the disclosure, the similarity between the newly added fusion feature and the central feature corresponding to each abnormal cluster can be calculated to determine the abnormal cluster that matches the newly added fusion feature and obtain the matching cluster.

[0274] Add the newly added fusion features to the feature set corresponding to the matching clusters.

[0275] In this embodiment of the disclosure, after determining the matching cluster corresponding to the newly added fusion feature, the feature can be stored in the matching cluster, and the central feature of the matching cluster can be updated according to the newly added fusion feature, so as to facilitate the next matching; at the same time, the correspondence between the newly added abnormal multimedia resources and the matching cluster can also be stored, so as to facilitate the search for the abnormal multimedia resources corresponding to each abnormal cluster.

[0276] In this embodiment of the disclosure, for newly added abnormal multimedia resources, their corresponding fusion features can be extracted, and the fusion features can be matched with the central features corresponding to each abnormal cluster to determine the abnormal cluster to which the newly added abnormal multimedia resources belong. The newly added abnormal fusion features are added to the corresponding abnormal cluster, thereby continuously updating the feature set corresponding to the abnormal cluster, which facilitates the updating of the central features corresponding to the abnormal cluster, improves the accuracy of the central features, and thus improves the recognition accuracy of abnormal multimedia resources.

[0277] In some embodiments, the above method further includes:

[0278] If no abnormal cluster matches the newly added fusion feature, construct a new cluster corresponding to the newly added fusion feature.

[0279] Based on the newly added clusters, update the above set of abnormal clusters.

[0280] In this embodiment of the disclosure, when there is no abnormal cluster in the abnormal cluster set that matches the newly added fusion feature, it means that the newly added abnormal multimedia resource does not match any of the existing abnormal clusters. In this case, a new cluster corresponding to the newly added fusion feature is constructed, and the abnormal cluster is added to the abnormal cluster set to realize the update of the abnormal cluster set.

[0281] In some embodiments, such as Figure 13 As shown, Figure 13 A flowchart of an update method for an exception class cluster includes:

[0282] In step S1301, newly added abnormal multimedia resources are obtained and their corresponding newly added fusion features are extracted; the similarity between the newly added fusion features and the current abnormal cluster in the abnormal cluster set is calculated.

[0283] In step S1303, it is determined whether the similarity between the newly added fusion feature and the current abnormal cluster is greater than a preset threshold;

[0284] In step S1305, when the similarity is greater than a preset threshold, the newly added fusion feature is added to the current abnormal cluster;

[0285] In step S1307, when the similarity is less than or equal to a preset threshold, it is determined whether there are any unmatched remaining clusters in the abnormal cluster set;

[0286] In step S1309, if there are remaining clusters, the remaining clusters are used as the current clusters again, and the process jumps to the step of calculating the similarity between the newly added fusion feature and the current abnormal cluster in the abnormal cluster set;

[0287] In step S13011, if there are no remaining clusters, a new cluster y is set, and the abnormal cluster set is updated.

[0288] In step S2011, based on the matching results of the target fusion features and each abnormal cluster in the abnormal cluster set, the identification result of the multimedia resource to be identified is determined; the identification result indicates whether the multimedia resource to be identified is abnormal.

[0289] In this embodiment of the disclosure, the identification result of the above-mentioned multimedia resource to be identified can be determined by target fusion features; the identification result indicates whether the above-mentioned multimedia resource to be identified is an abnormal multimedia resource; thereby, abnormal multimedia resources can be identified from multiple multimedia resources to be identified, and the abnormal type and corresponding processing strategy corresponding to the multimedia resource to be identified can be further determined, thereby reducing the erroneous public opinion guidance caused by abnormal multimedia resources.

[0290] In some embodiments, the identification result of the multimedia resource to be identified is determined based on the matching results of the target fusion features and each abnormal cluster in the abnormal cluster set, including:

[0291] When an anomaly cluster exists that matches the fusion features of the aforementioned target, the aforementioned multimedia resource to be identified is determined to be an anomaly.

[0292] In this embodiment of the disclosure, when there is an abnormal cluster in the abnormal cluster set that matches the above-mentioned target fusion feature, it indicates that the multimedia resource to be identified is an abnormal multimedia resource corresponding to an abnormal cluster in the abnormal cluster set.

[0293] In this embodiment of the disclosure, an abnormal cluster set can be constructed by clustering, and the multimedia resource to be identified can be matched with each abnormal cluster in the abnormal cluster set, thereby achieving rapid and accurate screening of abnormal multimedia resources.

[0294] In some embodiments, when there are anomalous clusters in the anomalous cluster set that match the target fusion feature, the above method further includes:

[0295] The target processing strategy corresponding to the above matched anomaly clusters;

[0296] In this embodiment of the disclosure, a processing strategy corresponding to each anomaly cluster in the anomaly cluster set can be pre-constructed. Different anomaly clusters correspond to different processing strategies, thereby allowing the search for a strategy that matches the target anomaly cluster and obtaining the corresponding target processing strategy.

[0297] The aforementioned multimedia resources to be identified are processed based on the target processing strategy.

[0298] In this embodiment of the disclosure, the target processing strategy may include, but is not limited to, performing operations such as reviewing or deleting the multimedia resource to be identified, or sending warning information to the publishing object of the multimedia resource to be identified.

[0299] In this embodiment of the disclosure, different processing strategies can be set for different types of abnormal clusters, so as to realize timely processing of the identified abnormal multimedia resources and avoid the abnormal multimedia resources from causing significant negative impact.

[0300] In some embodiments, such as Figure 14 As shown, before determining the target processing strategy corresponding to the above-mentioned target anomaly cluster, the above method further includes:

[0301] In step S1401, the sorting parameters for each abnormal cluster are determined based on the number of abnormal multimedia resources corresponding to each abnormal cluster and the degree of attention paid to the abnormal multimedia resources; the degree of attention paid to the abnormal multimedia resources represents the degree of attention paid to the abnormal multimedia resources.

[0302] In this embodiment, the level of attention of abnormal multimedia resources can be determined based on factors such as the number of searches, likes, and views on the application platform. The level of attention can be determined directly based on these factors, or by calculating the weighted sum of these factors as the level of attention. The average or weighted average of the number of information items and the level of attention corresponding to each abnormal cluster can be calculated to obtain the ranking parameters for each abnormal cluster. The abnormal clusters can then be ranked according to these ranking parameters.

[0303] In some embodiments, before determining the sorting parameters for each anomaly cluster based on the amount of information of the anomaly multimedia resources corresponding to each anomaly cluster and the degree of attention paid to the anomaly multimedia resources, the method further includes:

[0304] Based on the number of abnormal fusion features corresponding to each abnormal cluster in the above abnormal cluster set, the number of abnormal multimedia resources corresponding to each abnormal cluster is determined.

[0305] In this embodiment of the disclosure, after constructing the abnormal cluster set, the number of abnormal fusion features corresponding to each abnormal cluster in the abnormal cluster set can be determined, and the number of abnormal fusion features is the number of abnormal multimedia resources; thereby, the amount of information of the abnormal multimedia resources corresponding to each abnormal cluster can be determined.

[0306] Get the level of attention of the abnormal multimedia resources corresponding to each abnormal cluster.

[0307] In this embodiment, the level of attention of abnormal multimedia resources can be determined based on factors such as the number of searches, likes, and views on the application platform. The level of attention can be determined directly based on these factors, or by calculating the weighted sum of these factors as the level of attention. When there are multiple abnormal multimedia resources corresponding to an abnormal cluster, the sum of the level of attention values for each abnormal multimedia resource can be calculated and used as the level of attention value for the abnormal cluster.

[0308] In this embodiment of the disclosure, the number of abnormal multimedia resources corresponding to each abnormal cluster can be determined by the number of abnormal fusion features in each abnormal cluster, and the degree of attention of the abnormal multimedia resources in each abnormal cluster can be obtained from the application platform. Thus, the abnormal handling strategy for each abnormal cluster can be determined based on the amount of information and the degree of attention corresponding to each abnormal cluster.

[0309] In step S1403, each abnormal cluster in the abnormal cluster set is sorted according to the sorting parameter corresponding to each abnormal cluster in the abnormal cluster set to obtain the sorting result;

[0310] In this embodiment of the disclosure, the abnormal clusters in the abnormal cluster set can be sorted from largest to smallest according to the sorting parameter corresponding to each abnormal cluster to obtain a first sorting result; or sorted from smallest to largest to obtain a second sorting result.

[0311] In step S1405, based on the above sorting results, the anomaly level parameter corresponding to each anomaly cluster in the above anomaly cluster set is determined;

[0312] In this embodiment of the disclosure, based on the first sorting result, the abnormal clusters ranked higher are determined as clusters with higher abnormality levels and are set with larger abnormality level parameters; the abnormal clusters ranked lower are determined as clusters with lower abnormality levels and are set with smaller abnormality level parameters; for the second sorting result, the corresponding abnormality level parameters are set using the opposite method; the abnormality level parameters can characterize the importance of the abnormal clusters.

[0313] In step S1407, a first exception handling strategy is constructed based on the exception level parameter corresponding to each exception cluster in the above exception cluster set; the first exception handling strategy includes the correspondence between the exception level parameter and the first handling strategy.

[0314] In this embodiment, multiple exception parameter ranges can be defined based on exception level parameters, and the same exception handling strategy can be set for exception clusters within each exception parameter range. Exception clusters with larger exception level parameters can be given priority and assigned higher-priority exception handling strategies to ensure priority execution of the strategies. The constructed first exception handling strategy includes the correspondence between exception level parameters and the first handling strategy, which can be a one-to-one or one-to-many relationship. The first handling strategy may include, but is not limited to, pushing or deleting abnormal multimedia resources, or sending warning messages to the publishing objects of abnormal multimedia resources.

[0315] In some embodiments, determining the target processing strategy corresponding to the above-mentioned target anomaly clusters includes:

[0316] Based on the first exception handling strategy described above, the target handling strategy corresponding to the target exception cluster is determined.

[0317] In this embodiment of the disclosure, a processing strategy matching the target exception cluster can be found according to a pre-set first exception handling strategy to obtain the target processing strategy.

[0318] In this embodiment of the disclosure, the abnormal clusters in the abnormal cluster set can be sorted according to the amount of information and the degree of attention of the abnormal multimedia resources corresponding to each abnormal cluster; and a first abnormality handling strategy can be constructed according to the sorting result; thereby, the target handling strategy corresponding to the target abnormal cluster can be quickly determined according to the first abnormality handling strategy.

[0319] In some embodiments, before determining the target processing strategy corresponding to the target anomaly cluster, the method further includes:

[0320] Obtain the abnormal multimedia resources corresponding to each abnormal cluster in the above abnormal cluster set;

[0321] In this embodiment of the disclosure, abnormal multimedia resources corresponding to each abnormal cluster in the abnormal cluster set can be obtained, and the scene category corresponding to the abnormal cluster can be determined by further extracting text features.

[0322] Based on the preset text in the abnormal multimedia resources corresponding to each abnormal cluster, determine the scene category corresponding to each abnormal cluster.

[0323] In this embodiment of the disclosure, the preset text may include, but is not limited to, keywords, titles, and other text in the abnormal multimedia resources; the scene category corresponding to each abnormal cluster can be determined through the preset text in the abnormal multimedia resources; the scene category can characterize the risk level of the abnormal cluster. For example, when the scene category characterizes the abnormal multimedia resource as a multimedia resource targeting minors, the risk level of the corresponding abnormal cluster can be determined to be high risk, requiring close attention.

[0324] Based on the scenario category corresponding to each anomaly cluster in the above anomaly cluster set, a second anomaly handling strategy is constructed; the second anomaly handling strategy includes the correspondence between scenario categories and the second handling strategy.

[0325] In this embodiment, corresponding exception handling strategies can be set according to the scenario category corresponding to the exception cluster. For scenarios with higher risk levels, higher priority exception handling strategies are set to ensure the priority execution of the strategies. Different scenario categories correspond to different exception handling strategies. The constructed second exception handling strategy includes the correspondence between scenario categories and the second handling strategy, which can be a one-to-one or one-to-many relationship. The second handling strategy may include, but is not limited to, pushing or deleting abnormal multimedia resources, or sending warning information to the publishing objects of abnormal multimedia resources.

[0326] In some embodiments, determining the target processing strategy corresponding to the above-mentioned target anomaly clusters includes:

[0327] Based on the second anomaly handling strategy described above, the target handling strategy corresponding to the target anomaly cluster is determined.

[0328] In this embodiment of the disclosure, a processing strategy matching the target exception cluster can be found according to a pre-set second exception handling strategy to obtain the target processing strategy.

[0329] In this embodiment of the disclosure, a second anomaly handling strategy can be constructed based on the scenario category corresponding to each anomaly cluster; thereby, the target handling strategy corresponding to the target anomaly cluster can be quickly determined based on the second anomaly handling strategy.

[0330] This disclosure employs low-cost (unsupervised and weakly supervised) text and image representation models to extract information from multiple modalities, introduces training strategies such as contrastive learning to further improve the effect of modal representation, and clusters the fused features of multiple modalities to capture abnormal multimedia resources. Combined with the strategy, it significantly improves the ability to identify abnormal multimedia resources.

[0331] This disclosure identifies target text and target image in a multimedia resource to be identified; inputs the target text into a text feature extraction model for text feature extraction processing to obtain target text features; inputs the target image into an image feature extraction model for image feature extraction processing to obtain target image features; fuses the target text features and target image features to obtain target fused features; obtains an anomaly cluster set; the anomaly cluster set is obtained by clustering the preset fused features corresponding to multiple anomaly multimedia resources according to their respective anomaly categories, and the anomaly cluster set includes at least one anomaly cluster; the feature set corresponding to each anomaly cluster includes at least one preset fused feature; each anomaly cluster corresponds to one anomaly category; based on the matching results between the target fused features and each anomaly cluster in the anomaly cluster set, the identification result of the multimedia resource to be identified is determined. The target fused features obtained by this disclosure fuse features from two modalities, which can accurately represent the multimedia resource to be identified; by obtaining the anomaly cluster set and matching the target fused features with each anomaly cluster in the anomaly cluster set, the identification result of the multimedia resource to be identified can be determined quickly and accurately, improving the identification accuracy and efficiency of anomaly multimedia resources.

[0332] Figure 15 This is a block diagram illustrating an abnormal multimedia resource identification device according to an exemplary embodiment. (Refer to...) Figure 15 The device includes:

[0333] Information acquisition module 1510 is configured to determine target text and target image in a multimedia resource to be identified;

[0334] The text feature extraction module 1520 is configured to perform text feature extraction processing on the above-mentioned target text input into the text feature extraction model to obtain the target text features;

[0335] Image feature extraction module 1530 is configured to perform image feature extraction processing on the above-mentioned target image input image feature extraction model to obtain target image features;

[0336] The feature fusion module 1540 is configured to perform fusion processing on the above-mentioned target text features and the above-mentioned target image features to obtain target fused features;

[0337] The cluster set acquisition module 1550 is configured to acquire an abnormal cluster set; the abnormal cluster set is obtained by clustering the preset fusion features corresponding to the multiple abnormal multimedia resources according to their respective abnormal categories; the abnormal cluster set includes at least one abnormal cluster; the feature set corresponding to each abnormal cluster includes at least one preset fusion feature; each abnormal cluster corresponds to one abnormal category.

[0338] The anomaly identification module 1550 is configured to perform matching results between the target fusion features and each anomaly cluster in the anomaly cluster set to determine the identification result of the multimedia resource to be identified; the identification result indicates whether the multimedia resource to be identified is abnormal.

[0339] In one exemplary embodiment, the above-described apparatus further includes:

[0340] The text input module is configured to input training text into a first preset model, the first preset model including multiple neural networks; extract text features from the training text based on a first number of neural networks to obtain first text features; and extract text features from the training text based on a second number of neural networks to obtain second text features.

[0341] The first feature pair construction module is configured to construct positive text feature pairs based on the first text feature and the second text feature corresponding to the same training text.

[0342] The second feature pair construction module is configured to construct negative text feature pairs based on the training text features corresponding to any two training texts; the training text features are either the first text feature or the second text feature.

[0343] The text similarity determination module is configured to determine the first similarity between two text features in the above positive text feature pair and the second similarity between two text features in the above negative text feature pair;

[0344] The text model training module is configured to train the first preset model based on the difference between the first similarity and the second similarity to obtain the text feature extraction model.

[0345] In one exemplary embodiment, the text model training module includes:

[0346] The first loss determination submodule is configured to determine the first loss information based on the difference between the first similarity and the second similarity.

[0347] The initial model determination submodule is configured to adjust the model parameters of the first preset model according to the first loss information until the training termination condition is met, and determine the first preset model at the end of training as the initial text feature extraction model.

[0348] The text model training submodule is configured to perform the determination of the text feature extraction model based on the initial text feature extraction model described above.

[0349] In one exemplary embodiment, the text model training submodule described above includes:

[0350] The sample text pair construction unit is configured to construct sample text pairs based on the sample text set; the sample text pairs include two types of text pairs, namely positive sample text pairs and negative sample text pairs; the similarity between the two texts in the positive sample text pair is greater than a first threshold, and the similarity between the two texts in the negative sample text pair is less than a second threshold; the first threshold is greater than the second threshold.

[0351] The text feature extraction unit is configured to perform text feature extraction by inputting each sample text in the above sample text pair into the above initial text feature extraction model, and obtain the sample text features corresponding to each sample text.

[0352] The positive text similarity determination unit is configured to determine the similarity between the features of the two sample texts corresponding to the above positive sample text pair, and obtain the positive sample text similarity;

[0353] The negative text similarity determination unit is configured to determine the similarity between the features of the two sample texts corresponding to the negative sample text pair, and obtain the negative sample text similarity.

[0354] The text model training unit is configured to train the initial text feature extraction model based on the difference between the positive sample text similarity and the negative sample text similarity, thereby obtaining the text feature extraction model.

[0355] In one exemplary embodiment, the above-described apparatus further includes:

[0356] The sample image pair construction module is configured to construct sample image pairs based on the sample image set. The sample image pairs include two types of image pairs, namely positive sample image pairs and negative sample image pairs. The similarity between the two images in the positive sample image pair is greater than a third threshold, and the similarity between the two images in the negative sample image pair is less than a fourth threshold. The third threshold is greater than the fourth threshold.

[0357] The sample image feature extraction module is configured to input each sample image in the above sample image pair into the second preset model for image feature extraction processing to obtain the sample image features corresponding to each sample image.

[0358] The first similarity determination module is configured to determine the similarity between the features of the two sample images corresponding to the above positive sample image pair, and obtain the positive sample image similarity;

[0359] The second similarity determination module is configured to determine the similarity between the features of the two sample images corresponding to the negative sample image pair, and obtain the negative sample image similarity.

[0360] The image model training module is configured to train the second preset model based on the similarity of the positive sample images and the similarity of the negative sample images to obtain the image feature extraction model.

[0361] In one exemplary embodiment, the above-described apparatus further includes:

[0362] A new fusion feature acquisition module has been added, which is configured to acquire the new fusion features corresponding to the newly added abnormal multimedia resources; the aforementioned newly added abnormal multimedia resources are abnormal multimedia resources.

[0363] The central feature determination module is configured to determine the central feature corresponding to each anomaly cluster based on the anomaly fusion feature corresponding to each anomaly cluster in the above anomaly cluster set.

[0364] The matching cluster determination module is configured to determine the abnormal clusters that match the newly added fusion features based on the similarity between the newly added fusion features and the central features corresponding to each of the above abnormal clusters, thereby obtaining the matching clusters;

[0365] The feature set update module is configured to add the newly added fused features to the feature set corresponding to the matching clusters.

[0366] In one exemplary embodiment, the second preset model includes a first encoder, a second encoder, and a nonlinear layer; the sample image pair includes a first image and a second image; and the sample image feature extraction module includes:

[0367] The first encoding submodule is configured to perform encoding processing by inputting the first image into the first encoder to obtain the first encoded feature;

[0368] The second encoding submodule is configured to perform encoding processing by inputting the second image into the second encoder to obtain the second encoded feature;

[0369] The image feature determination submodule is configured to perform nonlinear mapping processing by inputting the first encoded feature and the second encoded feature into the nonlinear layer respectively to obtain the sample image features corresponding to each sample image.

[0370] In one exemplary embodiment, the image model training module includes:

[0371] The second loss information determination submodule is configured to determine the second loss information based on the difference between the similarity of the positive sample images and the similarity of the negative sample images.

[0372] The current parameter determination submodule is configured to adjust the parameters of the first encoder and the nonlinear layer based on the second loss information to obtain the current parameters of the first encoder.

[0373] The model training submodule is configured to update the parameters of the second encoder based on the current parameters when the current training iteration meets the preset conditions, until the training termination condition is met, and then determine the first encoder, the second encoder, and the nonlinear layer at the end of the training as the image feature extraction model.

[0374] In one exemplary embodiment, the above-mentioned sample image pair construction module includes:

[0375] The sample image set acquisition submodule is configured to acquire a sample image set, which includes a first sample image set and a second sample image set.

[0376] The image enhancement submodule is configured to perform data enhancement processing on the first sample image in the first sample image set based on two data enhancement methods to obtain the first enhanced image and the second enhanced image corresponding to the first sample image.

[0377] The first construction submodule is configured to execute the construction of the above-mentioned positive sample image pair based on the first enhanced image and the second enhanced image corresponding to the same first sample image;

[0378] The second construction submodule is configured to construct the negative sample image pair based on the enhanced image corresponding to any first sample image and any second sample image in the second sample image set; the enhanced image is either the first enhanced image or the second enhanced image.

[0379] The sample image pair construction submodule is configured to construct the sample image pairs based on the positive sample image pairs and the negative sample image pairs.

[0380] In one exemplary embodiment, the above-described apparatus further includes:

[0381] The abnormal fusion feature acquisition module is configured to acquire the abnormal fusion features corresponding to each of the multiple abnormal multimedia resources.

[0382] The clustering module is configured to perform clustering of multiple abnormal fusion features based on the abnormal categories corresponding to the multiple abnormal multimedia resources, thereby obtaining an abnormal cluster set; the abnormal cluster set includes at least one abnormal cluster; and the feature set corresponding to each abnormal cluster includes at least one abnormal fusion feature.

[0383] In one exemplary embodiment, the above-mentioned anomaly detection module includes:

[0384] The anomaly determination submodule is configured to determine the anomaly of the multimedia resource to be identified when there is an anomaly cluster in the above anomaly cluster set that matches the above target fusion feature.

[0385] In an exemplary embodiment, when an anomaly cluster matching the target fusion feature exists within the aforementioned anomaly cluster set, the apparatus further includes:

[0386] The target processing strategy determination module is configured to execute the target processing strategy corresponding to the above-mentioned matched exception clusters;

[0387] The strategy execution module is configured to process the multimedia resources to be identified based on the aforementioned target processing strategy.

[0388] In one exemplary embodiment, the above-described apparatus further includes:

[0389] The sorting parameter determination module is configured to determine the sorting parameters for each anomaly cluster based on the amount of information and the degree of attention of the anomaly multimedia resources corresponding to each anomaly cluster.

[0390] The sorting module is configured to sort each exception cluster in the above exception cluster set according to the sorting parameters corresponding to each exception cluster in the above exception cluster set, and obtain the sorting result;

[0391] The exception level parameter determination module is configured to determine the exception level parameter corresponding to each exception cluster in the above exception cluster set based on the above sorting results.

[0392] The first strategy construction module is configured to construct a first exception handling strategy based on the exception level parameters corresponding to each exception cluster in the above exception cluster set; the first exception handling strategy includes the correspondence between the exception level parameters and the first handling strategy.

[0393] In one exemplary embodiment, the target processing strategy determination module includes:

[0394] The first strategy determination module is configured to execute, based on the first exception handling strategy described above, the target handling strategy corresponding to the target exception cluster described above.

[0395] In one exemplary embodiment, the above-described apparatus further includes:

[0396] The information quantity determination module is configured to determine the information quantity of the abnormal multimedia resources corresponding to each abnormal cluster based on the number of abnormal fusion features corresponding to each abnormal cluster in the above abnormal cluster set.

[0397] The attention level acquisition module is configured to acquire the attention level of the abnormal multimedia resources corresponding to each abnormal cluster.

[0398] In one exemplary embodiment, the above-described apparatus further includes:

[0399] The abnormal multimedia resource acquisition module is configured to acquire the abnormal multimedia resources corresponding to each abnormal cluster in the above-mentioned abnormal cluster set;

[0400] The scene category determination module is configured to determine the scene category corresponding to each exception cluster based on the preset text in the exception multimedia resources corresponding to each exception cluster.

[0401] The second strategy construction module is configured to construct a second exception handling strategy based on the scenario category corresponding to each exception cluster in the above exception cluster set; the second exception handling strategy includes the correspondence between scenario categories and the second handling strategy.

[0402] In one exemplary embodiment, the target processing strategy determination module includes:

[0403] The second strategy determination module is configured to execute the target handling strategy corresponding to the target exception cluster based on the second exception handling strategy described above.

[0404] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.

[0405] In one exemplary embodiment, an electronic device is also provided, including a processor; a memory for storing processor-executable instructions; wherein, when the processor is configured to execute the instructions stored in the memory, it implements the abnormal multimedia resource identification method provided in any of the above embodiments.

[0406] The electronic device can be a terminal, a server, or a similar computing device. Taking a server as an example... Figure 16 This is a block diagram illustrating an electronic device according to an exemplary embodiment, such as... Figure 16As shown, the server 1600 can vary significantly due to different configurations or performance. It may include one or more central processing units (CPUs) 1610 (CPUs 1610 may include, but are not limited to, microprocessors such as MCUs or programmable logic devices such as FPGAs), a memory 1630 for storing data, and one or more storage media 1620 (e.g., one or more mass storage devices) for storing application programs 1623 or data 1622. The memory 1630 and storage media 1620 may be temporary or persistent storage. The program stored in the storage media 1620 may include one or more modules, each module may include a series of instruction operations on the server. Furthermore, the CPU 1610 may be configured to communicate with the storage media 1620 and execute the series of instruction operations in the storage media 1620 on the server 1600. Server 1600 may also include one or more power supplies 1660, one or more wired or wireless network interfaces 1650, one or more input / output interfaces 1640, and / or one or more operating systems 1621, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™, etc.

[0407] The input / output interface 1640 can be used to receive or send data via a network. Specific examples of the network described above may include a wireless network provided by the communication provider of server 1600. In one example, input / output interface 1640 includes a network interface controller (NIC), which can connect to other network devices via a base station to communicate with the Internet. In one example, input / output interface 1640 can be a radio frequency (RF) module for wireless communication with the Internet.

[0408] Those skilled in the art will understand that Figure 16 The structure shown is for illustrative purposes only and does not limit the structure of the aforementioned electronic device. For example, server 1600 may also include... Figure 16 The more or fewer components shown, or having the same Figure 16 The different configurations shown.

[0409] In one exemplary embodiment, a computer-readable storage medium including instructions is also provided, such as a memory 1630 including instructions, which can be executed by a processor 1610 of the device 1600 to perform the above-described method. Optionally, the computer-readable storage medium may be a ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage device, etc.

[0410] In one exemplary embodiment, a computer program product is also provided, including a computer program that, when executed by a processor, implements the abnormal multimedia resource identification method provided in any of the above embodiments.

[0411] This disclosure identifies target text and target image in a multimedia resource to be identified; inputs the target text into a text feature extraction model for text feature extraction processing to obtain target text features; inputs the target image into an image feature extraction model for image feature extraction processing to obtain target image features; fuses the target text features and target image features to obtain target fused features; obtains an anomaly cluster set; the anomaly cluster set is obtained by clustering the preset fused features corresponding to multiple anomaly multimedia resources according to their respective anomaly categories, and the anomaly cluster set includes at least one anomaly cluster; the feature set corresponding to each anomaly cluster includes at least one preset fused feature; each anomaly cluster corresponds to one anomaly category; based on the matching results between the target fused features and each anomaly cluster in the anomaly cluster set, the identification result of the multimedia resource to be identified is determined. The target fused features obtained by this disclosure fuse features from two modalities, which can accurately represent the multimedia resource to be identified; by obtaining the anomaly cluster set and matching the target fused features with each anomaly cluster in the anomaly cluster set, the identification result of the multimedia resource to be identified can be determined quickly and accurately, improving the identification accuracy and efficiency of anomaly multimedia resources.

[0412] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. This computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and RAMbus dynamic RAM (RDRAM), etc.

[0413] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this disclosure are indicated by the following claims.

[0414] It should be understood that this disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this disclosure is limited only by the appended claims.

Claims

1. A method for identifying abnormal multimedia resources, characterized in that, include: Identify the target text and target image in the multimedia resource to be identified; The target text is input into a text feature extraction model for text feature extraction processing to obtain target text features; the text feature extraction model is obtained by performing data augmentation processing on sample texts in the sample text set, and constructing positive sample text pairs based on the sample texts and their corresponding augmented texts, and training an initial text feature extraction model; the initial text feature extraction model is obtained by using two text features obtained by encoding the same training text twice with different random masks as positive text feature pairs, and training a first preset model. The target image is input into an image feature extraction model for image feature extraction processing to obtain the target image features; The target text features and the target image features are fused together to obtain target fused features; Obtain an anomaly cluster set; the anomaly cluster set is obtained by clustering the preset fusion features corresponding to the multiple abnormal multimedia resources according to their respective anomaly categories, and the anomaly cluster set includes at least one anomaly cluster; the feature set corresponding to each anomaly cluster includes at least one preset fusion feature; Each exception cluster corresponds to one exception category; Based on the matching results between the target fusion features and each abnormal cluster in the abnormal cluster set, the identification result of the multimedia resource to be identified is determined; the identification result indicates whether the multimedia resource to be identified is abnormal.

2. The method according to claim 1, characterized in that, The training method for the text feature extraction model includes: The training text is input into a first preset model, which includes multiple neural networks; text features are extracted from the training text based on a first number of neural networks to obtain first text features; text features are extracted from the training text based on a second number of neural networks to obtain second text features; Based on the first and second text features corresponding to the same training text, construct positive text feature pairs; Based on the training text features corresponding to any two training texts, construct negative text feature pairs; the training text features are either the first text feature or the second text feature. Determine the first similarity between two text features in the positive text feature pair and the second similarity between two text features in the negative text feature pair; Based on the difference between the first similarity and the second similarity, the first loss information is determined; Based on the first loss information, adjust the model parameters of the first preset model until the training termination condition is met, and determine the first preset model at the end of training as the initial text feature extraction model. Based on the initial text feature extraction model, the text feature extraction model is determined.

3. The method according to claim 2, characterized in that, The step of determining the text feature extraction model based on the initial text feature extraction model includes: Based on the sample text set, sample text pairs are constructed; the sample text pairs include two types of text pairs, namely positive sample text pairs and negative sample text pairs; the similarity between the two texts in the positive sample text pair is greater than a first threshold, and the similarity between the two texts in the negative sample text pair is less than a second threshold; the first threshold is greater than the second threshold. Each sample text in the sample text pair is input into the initial text feature extraction model to extract text features, thereby obtaining the sample text features corresponding to each sample text. Determine the similarity between the features of the two sample texts corresponding to the positive sample text pair to obtain the positive sample text similarity; Determine the similarity between the features of the two sample texts corresponding to the negative sample text pair to obtain the negative sample text similarity; Based on the difference between the similarity of the positive sample text and the similarity of the negative sample text, the initial text feature extraction model is trained to obtain the text feature extraction model.

4. The method according to claim 1, characterized in that, The training method for the image feature extraction model includes: Based on the sample image set, sample image pairs are constructed; the sample image pairs include two types of image pairs, namely positive sample image pairs and negative sample image pairs. The similarity between the two images in the positive sample image pair is greater than a third threshold, and the similarity between the two images in the negative sample image pair is less than a fourth threshold; the third threshold is greater than the fourth threshold. Each sample image in the sample image pair is input into the second preset model for image feature extraction processing to obtain the sample image features corresponding to each sample image. Determine the similarity between the features of the two corresponding positive sample images to obtain the positive sample image similarity; The similarity between the features of the two corresponding negative sample images is determined to obtain the negative sample image similarity. Based on the difference between the similarity of the positive sample images and the similarity of the negative sample images, the second preset model is trained to obtain the image feature extraction model.

5. The method according to claim 1, characterized in that, After obtaining the abnormal class cluster set, the method further includes: Obtain the newly added fusion features corresponding to the newly added abnormal multimedia resources; Based on the abnormal fusion features corresponding to each abnormal cluster in the abnormal cluster set, determine the central feature corresponding to each abnormal cluster; Based on the similarity between the newly added fusion feature and the central feature corresponding to each abnormal cluster, the abnormal cluster that matches the newly added fusion feature is determined, and the matching cluster is obtained. The newly added fusion feature is added to the feature set corresponding to the matching cluster.

6. The method according to claim 1, characterized in that, The step of determining the identification result of the multimedia resource to be identified based on the matching results between the target fusion features and each abnormal cluster in the abnormal cluster set includes: When an abnormal cluster that matches the target fusion feature exists in the abnormal cluster set, the multimedia resource to be identified is determined to be abnormal. The method further includes: Determine the target handling strategy corresponding to the matched anomaly clusters; The multimedia resources to be identified are processed based on the target processing strategy.

7. The method according to claim 6, characterized in that, Before determining the target processing strategy corresponding to the matching anomaly cluster, the method further includes: The sorting parameters for each anomaly cluster are determined based on the number of information items of the abnormal multimedia resources corresponding to each anomaly cluster and the degree of attention paid to the abnormal multimedia resources. Based on the sorting parameters corresponding to each abnormal cluster in the abnormal cluster set, sort each abnormal cluster in the abnormal cluster set to obtain the sorting result; Based on the sorting results, determine the anomaly level parameter corresponding to each anomaly cluster in the anomaly cluster set; Based on the exception level parameter corresponding to each exception cluster in the exception cluster set, a first exception handling strategy is constructed; the first exception handling strategy includes the correspondence between the exception level parameter and the first handling strategy. The target processing strategy corresponding to the matched anomaly cluster includes: Based on the first exception handling strategy, the target handling strategy corresponding to the matched exception cluster is determined.

8. The method according to claim 6, characterized in that, Before determining the target processing strategy corresponding to the matching anomaly cluster, the method further includes: Obtain the abnormal multimedia resources corresponding to each abnormal cluster in the abnormal cluster set; Based on the preset text in the abnormal multimedia resources corresponding to each abnormal cluster, determine the scene category corresponding to each abnormal cluster; Based on the scene category corresponding to each anomaly cluster in the anomaly cluster set, a second anomaly handling strategy is constructed; the second anomaly handling strategy includes the correspondence between scene categories and the second handling strategy; The target processing strategy corresponding to the matched anomaly cluster includes: Based on the second exception handling strategy, the target handling strategy corresponding to the matched exception cluster is determined.

9. An abnormal multimedia resource identification device, characterized in that, include: The information acquisition module is configured to determine the target text and target image in the multimedia resource to be identified; The text feature extraction module is configured to perform text feature extraction processing on the target text input into the text feature extraction model to obtain target text features; the text feature extraction model is obtained by performing data augmentation processing on sample texts in the sample text set, and constructing positive sample text pairs based on the sample texts and their corresponding augmented texts, and training the initial text feature extraction model; the initial text feature extraction model is obtained by using two text features obtained by encoding the same training text twice with different random masks as positive text feature pairs, and training the first preset model. The image feature extraction module is configured to perform image feature extraction processing on the target image input image feature extraction model to obtain target image features; The feature fusion module is configured to perform fusion processing on the target text features and the target image features to obtain target fused features; The cluster set acquisition module is configured to acquire an abnormal cluster set; the abnormal cluster set is obtained by clustering the multiple abnormal multimedia resources according to their respective abnormal categories and the preset fusion features corresponding to each of the multiple abnormal multimedia resources; the abnormal cluster set includes at least one abnormal cluster; the feature set corresponding to each abnormal cluster includes at least one preset fusion feature. Each exception cluster corresponds to one exception category; An anomaly identification module is configured to perform matching results between the target fusion features and each anomaly cluster in the anomaly cluster set to determine the identification result of the multimedia resource to be identified; the identification result indicates whether the multimedia resource to be identified is abnormal.

10. An electronic device, characterized in that, include: processor; Memory used to store the processor's executable instructions; The processor is configured to execute the instructions to implement the abnormal multimedia resource identification method as described in any one of claims 1-8.

11. A computer-readable storage medium, characterized in that, When the instructions in the computer-readable storage medium are executed by the processor of an electronic device, the electronic device is able to perform the abnormal multimedia resource identification method as described in any one of claims 1-8.