Content review, training method of content review model, and related devices
By using a multimodal content moderation model that combines image and text features and trains the adapter in stages, the problems of high adjustment difficulty and low scalability of visual content moderation systems are solved, achieving rapid response and efficient content moderation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GUANGZHOU NETSTAR INFORMATION TECH CO LTD
- Filing Date
- 2022-09-19
- Publication Date
- 2026-06-26
AI Technical Summary
Existing visual content moderation systems are difficult to adjust, have low scalability, require a large amount of image data for retraining, and have long iteration cycles.
A multimodal content moderation model using image encoders and text encoders is adopted. Image features are extracted by the image encoder and text features are extracted by the text encoder. The model is trained to adapt to content moderation under adversarial and classification modes. The image adapter and text adapter are trained in stages to improve generalization and adaptability.
It reduces the required sample size, enables rapid response to changes in review rules and the addition of new violation categories, and improves the scalability and responsiveness of content review.
Smart Images

Figure CN115565038B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the technical field of content moderation, and in particular to a content moderation method, a training method for a content moderation model, and related apparatus. Background Technology
[0002] With the development of internet technology, various visual content is applied to various internet products, such as live streaming and short videos. Whether the review of visual content complies with regulations is part of internet products and is conducive to building a good product ecosystem. Visual content review systems can reduce the manpower cost of content review and reduce the frequency of exposure of illegal content.
[0003] Currently, visual content moderation is mainly based on image-based single-modal neural networks. For specific categories, a large amount of non-violation image data (positive samples) and non-violation image data (negative samples) are collected to train the neural network for classification.
[0004] However, image-based unimodal neural networks are difficult to adjust. If the review rules change, the training image data needs to be cleaned again, and the iteration cycle is long. Furthermore, image-based unimodal neural networks have low scalability. Adding new categories requires collecting a large amount of non-compliant image data and retraining the neural network, which is costly. Summary of the Invention
[0005] This application provides a content moderation method, a training method for a content moderation model, and related devices to address the problems of high difficulty in adjusting visual content moderation and low scalability.
[0006] According to one aspect of this application, a content moderation method is provided, comprising:
[0007] Load a preset content moderation model, which includes an image encoder, an image adapter, a text encoder, and a text adapter;
[0008] The image data to be reviewed is input into the image encoder to extract the first image feature;
[0009] The first image feature is input into the image adapter and mapped to the target space to obtain the second image feature;
[0010] The text information representing the category in content review is input into the text encoder to extract the first text feature;
[0011] The first text feature is input into the text adapter and mapped to the target space to obtain the second text feature;
[0012] The second image feature is compared with the second text feature to generate an audit result for the image data.
[0013] According to another aspect of this application, a method for training a content moderation model is provided, comprising:
[0014] A content moderation model is defined, comprising an image encoder, an image adapter, a text encoder, and a text adapter. The image encoder is used to extract a first image feature from image data, and the image adapter is used to map the first image feature to a target space to obtain a second image feature. The text encoder is used to extract a first text feature from text information, and the text adapter is used to map the first text feature to the target space to obtain a second text feature.
[0015] The image encoder is trained to be adapted for content moderation using adversarial and classification methods;
[0016] If the image encoder is trained, then, with the image encoder and the text encoder fixed, the image adapter and the text adapter are trained to be suitable for content review in a classification manner.
[0017] According to another aspect of this application, a content moderation device is provided, comprising:
[0018] A content moderation model loading module is used to load a preset content moderation model, which includes an image encoder, an image adapter, a text encoder, and a text adapter.
[0019] The first image feature extraction module is used to extract the first image features from the image data to be reviewed by inputting it into the image encoder.
[0020] The second image feature mapping module is used to input the first image feature into the image adapter and map it to the target space to obtain the second image feature;
[0021] The first text feature extraction module is used to input text information representing the category in content review into the text encoder to extract the first text feature;
[0022] The second text feature mapping module is used to input the first text feature into the text adapter and map it to the target space to obtain the second text feature;
[0023] The review result generation module is used to compare the second image feature with the second text feature to generate a review result for the image data.
[0024] According to another aspect of this application, a training apparatus for a content moderation model is provided, comprising:
[0025] A content moderation model determination module is used to determine a content moderation model. The content moderation model includes an image encoder, an image adapter, a text encoder, and a text adapter. The image encoder is used to extract a first image feature from image data. The image adapter is used to map the first image feature to a target space to obtain a second image feature. The text encoder is used to extract a first text feature from text information. The text adapter is used to map the first text feature to the target space to obtain a second text feature.
[0026] An encoder training module is used to train the image encoder to be adapted for content moderation in an adversarial and classification manner.
[0027] The adapter training module is used to train the image adapter and the text adapter to be suitable for content review in a classification manner, provided that the image encoder and the text encoder are fixed, after the image encoder has been trained.
[0028] According to another aspect of this application, an electronic device is provided, the electronic device comprising:
[0029] At least one processor; and
[0030] A memory communicatively connected to the at least one processor; wherein,
[0031] The memory stores a computer program that can be executed by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to execute the content moderation method or the training method for the content moderation model described in any embodiment of this application.
[0032] According to another aspect of this application, a computer-readable storage medium is provided, the computer-readable storage medium storing a computer program, the computer program being configured to cause a processor to execute and implement the content moderation method or the training method for a content moderation model described in any embodiment of this application.
[0033] According to another aspect of this application, a computer program product is provided, the computer program product including a computer program, which, when executed by a processor, implements the content moderation method or the training method of the content moderation model described in any embodiment of this application.
[0034] In this embodiment, a preset content moderation model is loaded. This model includes an image encoder, an image adapter, a text encoder, and a text adapter. The image data to be moderated is input into the image encoder to extract a first image feature. The first image feature is then input into the image adapter and mapped to a target space to obtain a second image feature. Text information representing the category in the content moderation is input into the text encoder to extract a first text feature. The first text feature is then input into the text adapter and mapped to a target space to obtain a second text feature. The second image feature and the second text feature are compared to generate a moderation result for the image data. This embodiment splits the content moderation model into two parts: the first part consists of an image encoder and a text editor with higher generalization requirements, and the second part consists of an image adapter and a text adapter with higher adaptability requirements. This phased approach to achieving generalization and adaptability of the content moderation network ensures its performance. Furthermore, this embodiment utilizes multimodal content moderation using image data and text information. This fully leverages the inherent connections between language and text, resulting in stronger generalization and reasoning capabilities. While ensuring the accuracy of content moderation, the sample size for iteratively updating the content moderation model is reduced from hundreds of thousands or millions to tens or hundreds. This significant reduction in sample size facilitates the collection of sufficient samples. When moderation rules change or new violation categories are added, samples can be quickly collected to iteratively update the content moderation model, demonstrating strong scalability and improving the response speed of content moderation. It should be understood that the content described in this section is not intended to identify key or important features of the embodiments of this application, nor is it intended to limit the scope of this application. Other features of this application will become readily apparent from the following description. Attached Figure Description
[0035] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0036] Figure 1 This is a flowchart of a training method for a content moderation model according to Embodiment 1 of this application;
[0037] Figure 2 This is an architecture diagram of a training image encoder provided according to Embodiment 1 of this application;
[0038] Figure 3 This is an architecture diagram of a training image adapter and a text adapter provided according to Embodiment 1 of this application;
[0039] Figure 4 This is a flowchart of a content moderation method provided according to Embodiment 2 of this application;
[0040] Figure 5 This is an architecture diagram of a content moderation model provided according to Embodiment 2 of this application;
[0041] Figure 6 This is a schematic diagram of the structure of a training device for a content moderation model according to Embodiment 3 of this application;
[0042] Figure 7 This is a schematic diagram of a content moderation device according to Embodiment 4 of this application;
[0043] Figure 8 This is a schematic diagram of the structure of an electronic device provided in Embodiment 5 of this application. Detailed Implementation
[0044] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present application, and not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative effort should fall within the scope of protection of the present application.
[0045] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0046] Example 1
[0047] Figure 1 This is a flowchart illustrating a training method for a content moderation model provided in Embodiment 1 of this application. This embodiment is applicable to training a multimodal content moderation model for text and images. The method can be executed by a content moderation model training device, which can be implemented in hardware and / or software. This training device can be configured in an electronic device, such as a server, blade server, mainframe computer, or other suitable computer. Figure 1 As shown, the method includes:
[0048] Step 101: Determine the content moderation model.
[0049] In this embodiment, a content moderation model based on image and text multimodal can be pre-built. That is, for visual content, the content moderation model can detect whether the visual content conforms to the specifications.
[0050] In practical applications, the business system has already implemented an automated review process, accumulating a certain amount of business-related violation image data. The business system has human reviewers maintaining the review rules, and each review rule has a certain number (e.g., dozens to hundreds) of violation categories (names), legends, and legend descriptions. This data can be used to train the content review model.
[0051] Furthermore, such as Figure 2 , Figure 3 As shown, the content moderation model includes an image encoder G, an image adapter H, a text encoder T, and a text adapter S.
[0052] The structures of the image encoder G, image adapter H, text encoder T, and text adapter S are independent of each other. The structures of the image encoder G and the image adapter H are not the same, and the structures of the text encoder T and the text adapter S are not the same.
[0053] The image encoder G provides the ability to generalize image features, while the image adapter H provides the ability to map image features. That is, the image encoder G is used to extract the first image feature from the image data x, and the image adapter H is used to map the first image feature to a unified target space to obtain the second image feature. The first image feature is a low-dimensional feature, mostly a 512 or 1024-dimensional vector, while the second image feature is a high-dimensional feature. Therefore, the structure of the image encoder G is usually larger than that of the image adapter H.
[0054] For example, the image encoder G can be a convolutional neural network (CNN), such as ResNet (residual network), DenseNet (densely connected convolutional network), or a structure based on Transformer (a deep learning model based on self-attention mechanism), such as ViT (Vision Transformer), and the image adapter H can be a multilayer perceptron (MLP).
[0055] The text encoder T provides the ability to generalize text features, and the text adapter S provides the ability to map text features. That is, the text encoder T is used to extract the first text feature from the text information, and the text adapter S is used to map the first text feature to a unified target space to obtain the second text feature. The first text feature is a low-dimensional feature, and the second text feature is a high-dimensional feature. Therefore, the structure of the text encoder T is usually larger than that of the text adapter S.
[0056] For example, the text encoder T can be a pre-trained model of the text class, such as BERT based on Transformer, and the text adapter S can be a multilayer perceptron.
[0057] Of course, the structure of the content moderation model, including the image encoder G, image adapter H, text encoder T, and text adapter S, is not limited to manually designed neural networks. It can also be a neural network optimized by model quantization methods, a neural network searched for the characteristics of content moderation by NAS (Neural Architecture Search) methods, and so on. This embodiment does not impose any restrictions on this.
[0058] Step 102: Train the image encoder to be adapted for content moderation using adversarial and classification methods.
[0059] In this embodiment, the training is divided into two stages. Considering that the structure of the image encoder G is relatively large, in order to improve the generalization ability of the image encoder, in the first stage of training, both adversarial and classification methods can be used to train the image encoder G separately, so that the image encoder G can be adapted to visual content review.
[0060] Among them, adversarial means that the image encoder G plays a game with a third-party network structure to improve its ability to extract image features, and classification means that the image encoder G is applied to the classification of visual content moderation to improve its ability to classify extracted image features.
[0061] Since there is relatively little business-related violation image data in the open-source dataset, this embodiment can train the image encoder G using the business-related violation image data accumulated by the business system.
[0062] Furthermore, in content moderation, the semantics of text information related to violations are relatively narrow. If the text encoder T is trained and adapted for content moderation, its generalization ability may be reduced. Therefore, in this embodiment, a pre-trained model of the text class can be directly applied as the text encoder T, and the text encoder T is not trained.
[0063] In one embodiment of this application, step 102 may include the following steps:
[0064] Step 1021: Determine the classification branch and the adversarial branch.
[0065] In this embodiment, a third-party network structure can be used to assist in training the image encoder G. These network structures and the image encoder G respectively form a classification branch and an adversarial branch. That is, the adversarial branch includes the image encoder G and the third-party network structure, and the classification branch includes the image encoder G and the third-party network structure.
[0066] The adversarial branch trains the image encoder G in an adversarial manner to make the first image feature discrimination suitable for content moderation.
[0067] Furthermore, the classification branch trains the image encoder G in a classification manner to train the image encoder G so that the first image features are applicable to classifying various categories in content moderation.
[0068] Step 1022: In each iteration of training, update the classification branch and the adversarial branch in sequence to train the image encoder to be suitable for content review.
[0069] Generally, after multiple rounds of iterative training of the classification branch and the adversarial branch, the classification branch and the adversarial branch can be executed synchronously or asynchronously in each iteration. The classification branch and the adversarial branch are updated sequentially, that is, the classification branch is updated first, and then the adversarial branch is updated. Considering that the image encoder G is updated once in one iteration, the image encoder G is updated in the classification branch, and then the third-party network structure in the adversarial branch is updated to improve the training effect and train the image encoder to be suitable for content review.
[0070] In one embodiment of this application, such as Figure 2 As shown, the adversarial branch has an image encoder G, a general encoder P adapted to non-specific operations, and a discriminator D, while the classification branch has an image encoder G and a feedforward network F.
[0071] Among them, the general encoder P, discriminator D, and feedforward network F are all third-party network structures.
[0072] A general encoder P is an encoder that encodes image data. It can be trained using a large amount of general image data, which makes it unsuitable for any specific operation. The so-called feature operation can refer to an operation that is specifically adapted to a certain domain, such as content moderation, face recognition, human pose recognition, autonomous driving, etc.
[0073] Discriminator D provides the ability to determine whether image data is suitable for content review, and its output dimension is 2-dimensional.
[0074] For example, the discriminator D can be a multilayer perceptron.
[0075] The feedforward network F provides the ability to map image features. That is, the feedforward network F is used to map the first image features to the classification space to obtain the fourth image features. The first image features are low-dimensional features, and the fourth image features are high-dimensional features. Therefore, the structure of the image encoder G is usually larger than that of the image adapter H.
[0076] For example, the feedforward network F can be a multilayer perceptron.
[0077] In this embodiment, step 1022 may further include the following steps:
[0078] Step 10221: Input the image data used as samples into the image encoder to extract the first image features, and input them into the general encoder to extract the third image features.
[0079] In this embodiment, image data can be collected through publicly available datasets, or through human or machine review (e.g., based on image monomodality) in business systems to accumulate violation image data related to business operations, and used as samples to train the image encoder G.
[0080] According to the review rules, the samples can be divided into normal image data and non-compliant image data.
[0081] During each round of training iterations, such as Figure 2 As shown, in both the adversarial branch and the classification branch, the image data x, which is used as a sample, is input into the image encoder G. The image encoder G processes the image data x according to its structure, outputs the first image feature extracted from the image data x, and performs forward propagation on the first image feature.
[0082] In the adversarial branch, the image data x, which is used as a sample, is input into the general encoder P. The general encoder P processes the image data x according to its structure and outputs the third image feature extracted from the image data x.
[0083] Step 10222: In the discriminator, use the first image feature to generate a first adversarial label for whether the image data is suitable for content review, and use the third image feature to generate a second adversarial label for whether the image data is suitable for content review.
[0084] During each round of training iterations, such as Figure 2 As shown, in the adversarial branch, on one hand, the first image features generated by the image encoder G are input into the discriminator D. The discriminator D processes the first image features according to its structure and outputs the first adversarial label. On the other hand, the third image features generated by the general encoder P are input into the discriminator D. The discriminator D processes the third image features according to its structure and outputs the second adversarial label.
[0085] The adversarial tags (i.e., the first adversarial tag and the second adversarial tag) indicate whether image data x is suitable for content review. If the adversarial tag is 1, it means that image data x is suitable for content review, that is, the content of the image data violates the rules. If the adversarial tag is 1, it means that image data x is not suitable for content review, that is, the content of the image data is normal and does not violate the rules.
[0086] The general encoder P has strong generalization ability and strong ability to extract image features. The general encoder P and the image encoder G compete against each other at the discriminator D, thereby helping to improve the image encoder G's ability to extract image features, so that the distribution of G(x) converges to P(x), where x represents image data, G(x) represents the first image feature generated by the image encoder G, and P(x) represents the third image feature generated by the general encoder P.
[0087] Step 10223: Map the first image features to the classification space in the feedforward network to obtain the fourth image features.
[0088] During each round of training iterations, such as Figure 2 As shown, in the branch network, the first image features generated by the image encoder G are input into the feedforward network F. The feedforward network F processes the first image features according to its structure, maps the first image features to the classification space, and obtains the fourth image features.
[0089] Step 10224: Map the fourth image feature to the probability that the image data belongs to each of the corresponding content review categories.
[0090] In this embodiment, multiple categories of content to be reviewed can be set in advance according to the review rules, such as hookah, cigarettes, dice, poker, etc.
[0091] During each round of training iterations, such as Figure 2 As shown, in the branch network, the fourth image feature can be input into activation functions such as softmax and sigmoid, thereby mapping the fourth image feature to the probability y' of image data x belonging to each category of the content review.
[0092] Step 10225: Update the parameters of the image encoder and the parameters of the feedforward network according to the first adversarial label and probability pair.
[0093] In this embodiment, the first adversarial label and probability corresponding to the image encoder G can be considered comprehensively. The loss of the image encoder G when extracting the first image features can be calculated, and the parameters of the image encoder G and the feedforward network F can be updated based on the loss.
[0094] In one way of calculating loss, such as Figure 2As shown, for a given category, a classification label y can be generated based on the probability y' of that category. The classification label y indicates whether the image data x belongs to that category. When the classification label y is 1, it means that the image data x belongs to that category. When the classification label y is 0, it means that the image data x does not belong to that category.
[0095] The first candidate value is obtained by squaring the difference between 1 and the first adversarial label corresponding to the image encoder G. The second candidate value is obtained by summing the products of the derivatives of each classification label y and the probability y'. The negative values of the first and second candidate values are linearly fused to form the first loss value L. G To characterize the loss when the image encoder G extracts the first image features, and thus according to the first loss value L G Update the parameters of the image encoder G and the feedforward network F.
[0096] For example, the first loss value L G It is expressed as follows:
[0097]
[0098] y′=softmax(F(G(x)))
[0099] Where c is the number of categories, i∈c, y is the category label, y' is the probability, and y i ' is the i-th component of y', y i Let y be the i-th component, G be the image encoder, F be the feedforward network, D be the discriminator, x be the image data, G(x) be the first image feature, F(G(x)) be the fourth image feature, D(G(x)) be the first adversarial label corresponding to the image encoder G, and λ be the hyperparameter.
[0100] In this method, backpropagation is performed on the classification branch, and the first loss value L is... G Substituting these parameters into optimization algorithms such as SGD (stochastic gradient descent) and Adam (adaptive momentum), gradients are calculated for the parameters of the image encoder G and the feedforward network P, respectively, and the parameters of the image encoder G and the feedforward network P are updated according to these gradients.
[0101] Step 10226: While keeping the parameters of the image encoder unchanged, update the parameters of the discriminator according to the first adversarial label and the second adversarial label.
[0102] In each round of training iterations, such as Figure 2As shown, if the classification branch is updated, the adversarial branch can be updated. In the adversarial branch, the parameters of the image encoder G are kept unchanged in this round of training iterations. Under this condition, the first adversarial label corresponding to the image encoder G and the second adversarial label corresponding to the general encoder P can be considered together. The loss of the discriminator D in judging whether the image data x is suitable for content review is calculated, and the parameters of the discriminator D are updated according to the loss.
[0103] In one way of calculating loss, such as Figure 2 As shown, the third candidate value is obtained by squaring the first adversarial label corresponding to the image encoder G; the fourth candidate value is obtained by squaring the difference between 1 and the second adversarial label corresponding to the general encoder P; the third and fourth candidate values are linearly fused to obtain the second loss value L. D This is used to characterize the loss of the discriminator D in determining whether image data x is suitable for content review, thereby maintaining the parameters of the image encoder G unchanged, and according to the second loss value L D Update the parameters of the feedforward network F.
[0104] For example, the second loss value L D It is expressed as follows:
[0105] L D =(1-D(P(x))) 2 +D(G(x)) 2
[0106] Where G is the image encoder, P is the general encoder, D is the discriminator, x is the image data, G(x) is the first image feature, P(x) is the third image feature, D(G(x)) is the first adversarial label corresponding to the image encoder G, and D(P(x)) is the second adversarial label corresponding to the general encoder P.
[0107] In this method, backpropagation is performed on the classification branch, and the second loss value L is... D Substituting into optimization algorithms such as SGD and Adam, the gradient of the discriminator D is calculated, and the parameters of the discriminator D are updated according to the gradient. Backpropagation stops before the image encoder G and does not update the parameters of the image encoder G, thus maintaining the parameters learned by the image encoder G in the adversarial branch during this iteration of training.
[0108] Furthermore, in some architectures, it is possible to combine the calculation of probabilities and the second loss value. In some architectures, probabilities are calculated in the feedforward network F, and the second loss value is calculated independently of the feedforward network F. This embodiment does not impose any restrictions on this.
[0109] Step 10227: Determine whether the preset first training condition is met; if yes, proceed to step 10228; if no, return to steps 10221-10227.
[0110] Step 10228: Determine that the image encoder has completed training.
[0111] In this embodiment, a first training condition can be set in advance for the training image encoder G as a condition to stop training. For example, the number of iterations reaches a certain threshold, the change amplitude of the first loss value is less than a certain threshold and the change amplitude of the second loss value is less than a certain threshold, the first loss value is less than a certain threshold and the second loss value is less than a certain threshold, etc. In each round of iteration training, it is determined whether the data recorded during the current iteration training meets the first training condition.
[0112] If the first training condition is met, the image encoder G can be considered to have completed training. At this point, the parameters in the image encoder G are output and persisted to configuration files such as config.
[0113] If the first training condition is not met, the next round of iterative training can be entered, and steps 10221-10227 can be executed again. This iterative training continues until the image encoder G completes training.
[0114] Step 103: If the image encoder training is completed, under the condition of fixing the image encoder and text encoder, train the image adapter and text adapter to be suitable for content review in a classification manner.
[0115] In this embodiment, considering that the image adapter H and text adapter S are mostly lightweight structures, in order to improve the ability of the image adapter H and text adapter S to adapt to content review, in the second stage of training, the image adapter H and text adapter S can be trained together by classification, so that the image adapter H and text adapter S can adapt to visual content review.
[0116] The classification method involves applying the image encoder G, image adapter H, text encoder T, and text adapter S to the content moderation of visual categories.
[0117] Furthermore, the image encoder G obtained from the first stage of training already has a certain classification ability for image data for content review, but it may still lack the ability to adjust to changes in review rules or newly added categories of violations.
[0118] Considering that language is a symbol of human thought and has stronger generalization and reasoning capabilities, compared to image-based single-modality, this embodiment uses multimodal content review using image data and text information, which can make full use of the inherent connections between language and text.
[0119] Image-based single-modal neural networks require the collection of hundreds of thousands or millions of samples (image data) for iterative updates. In this embodiment, the content moderation model requires dozens or hundreds of samples (image data and its text information) for iterative updates to adapt to changes in moderation rules and newly added categories of violations.
[0120] In one embodiment of this application, step 103 may include the following steps:
[0121] Step 1031: Input the image data used as samples into the image encoder to extract the first image features.
[0122] In this embodiment, image data can be collected through publicly available datasets, or through human or machine review (such as image-based single-modal neural networks) accumulating violation image data related to the business, and used as samples for training image adapter H and text adapter S.
[0123] According to the review rules, the samples can be divided into normal image data and non-compliant image data.
[0124] During each round of training iterations, such as Figure 3 As shown, the image data x, which is used as a sample, is input into the image encoder G. The image encoder G processes the image data x according to its structure and outputs the first image feature extracted from the image data x.
[0125] Step 1032: Input the first image features into the image adapter and map them to the target space to obtain the second image features.
[0126] During each round of training iterations, such as Figure 3 As shown, the first image feature is input into the image adapter H, and the image adapter H processes the first image feature according to its structure, mapping the image adapter H to the target space to obtain the second image feature.
[0127] Step 1033: Input the text information used as a sample into the text encoder to extract the first text feature.
[0128] In this embodiment, as Figure 3 As shown, text information t can be collected by accumulating business-related data (category (name), legend, legend description) through public datasets or business systems by human review or machine review (such as image-based single-modal neural networks), and used as samples to train image adapter H and text adapter S.
[0129] At this point, the image data x (as shown in the legend) and the text information t (such as category and legend description) are paired data.
[0130] If the image data x is determined to be in violation (as shown in the legend), the text information t related to the image data x (such as category, legend description) can be classified as positive samples t according to the review rules. p Text information unrelated to image data x is divided into negative samples t. n .
[0131] Furthermore, such as Figure 3 As shown, in order to improve the efficiency of collecting negative samples, negative sampling is performed on the current image data x. That is, text information t is randomly collected from text information t related to other image data x as text information t unrelated to the current image data x. Other image data x are image data x other than the current image data x.
[0132] During each round of training iterations, such as Figure 3 As shown, the text information t (including positive sample t) will be used as a sample. p Negative sample t n The text encoder T is input into the text encoder T. The text encoder T processes the text information t according to its structure and outputs the first text feature extracted from the text information t.
[0133] Step 1034: Input the first text feature into the text adapter and map it to the target space to obtain the second text feature.
[0134] During each round of training iterations, such as Figure 3 As shown, the first text feature is input into the text adapter S. The text adapter S processes the first text feature according to its structure, maps the first text feature to the target space, and obtains the second text feature.
[0135] Step 1035: While keeping the parameters of the image encoder and the text encoder unchanged, compare the second image feature with the second text feature to update the parameters of the image adapter and the text adapter.
[0136] In each round of training iteration, the parameters of the image encoder G and the text editor T are kept unchanged. Under this condition, the second image features and the second text features are compared in the same spatial vector. Based on the difference between the second image features and the second text features, the loss for adapting content review when mapping features is calculated for the image adapter H and the text adapter S. The parameters of the image adapter H and the text adapter S are updated based on this loss.
[0137] In one way of calculating loss, such as Figure 3 As shown, the text information used as samples includes positive samples t related to the image data. p Negative samples t that are unrelated to image data n .
[0138] On the one hand, for each image data x, the second image features and positive sample t are calculated. p The first similarity between the corresponding second text features is squared by subtracting the first similarity from 1, and this squared value is used as the fifth candidate value.
[0139] For example, for each image data, the second image features and positive sample t are calculated. p The product of the corresponding second text features is used as the first dot product; the length of the second image feature is calculated and multiplied by the positive sample t. p The product of the lengths of the corresponding second text features is used as the first modulus-length product; the ratio between the first dot product and the first modulus-length product is calculated as the first similarity, that is, the first similarity is the cosine of the angle between the second image feature and the second text feature corresponding to the positive sample, and its value range is [-1,1], where -1 is completely dissimilar and 1 is completely similar.
[0140] On the other hand, for each image data, the second image features and negative sample t are calculated. n The second similarity between the corresponding second text features is squared and used as the sixth candidate value.
[0141] For example, for each image data, the second image features and negative sample t are calculated. n The product of the corresponding second text features is used as the second dot product; the length of the second image feature is calculated and multiplied by the negative sample t. n The product of the lengths of the corresponding second text features is used as the second modulus-length product; the ratio between the second dot product and the second modulus-length product is calculated as the second similarity, that is, the second similarity is the cosine of the angle between the second image feature and the second text feature corresponding to the negative sample, and its value ranges from [-1, 1], where -1 is completely dissimilar and 1 is completely similar.
[0142] For all image data, the sum of all fifth candidate values and the sum of all sixth candidate values are linearly fused into a third loss value L to characterize the loss of content moderation when the image adapter H and the text adapter S map features. Thus, while keeping the parameters of the image encoder G and the text encoder T unchanged, the parameters of the image adapter H and the text adapter S are updated according to the third loss value L.
[0143] For example, the third loss value L is represented as follows:
[0144]
[0145]
[0146]
[0147] Where G is the image encoder, H is the image adapter, T is the text editor, S is the text adapter, x is the image data, G(x) is the first image feature, H(G(x)) is the second image feature, and n is the number of text information corresponding to each frame of image data, i∈n, tp i For positive samples, tn i For negative samples, T(tp) i S(T(tp) represents the first text feature corresponding to the positive sample. i T(tn) represents the second text feature corresponding to the positive sample. i S(T(tn) represents the first text feature corresponding to the negative sample. i )) represents the second text feature corresponding to the negative sample, sim represents the similarity, and || represents the length.
[0148] In this approach, backpropagation is performed on the content moderation model. The third loss value L is substituted into optimization algorithms such as SGD and Adam to calculate the gradient of the parameters of the image adapter H and the text adapter S. The parameters of the image adapter H and the text adapter S are then updated according to the gradient. Backpropagation stops before the image encoder G and the text editor T. The parameters of the image encoder G and the text editor T are not updated. The parameters of the image encoder G learned in the first round of training are maintained, as are the original parameters of the text editor T (such as the parameters learned in pre-training).
[0149] Step 1036: Determine whether the preset second training condition is met; if yes, proceed to step 1037; otherwise, return to steps 1031-1036.
[0150] Step 1037: Determine the image adapter and text adapter to complete the training.
[0151] In this embodiment, a second training condition can be set in advance for the training image adapter H and text adapter S as a condition for stopping training. For example, the number of iterations reaches a certain threshold, the change of the third loss value is less than a certain threshold for several consecutive times, the third loss value is less than a certain threshold, and so on. In each round of iteration training, it is determined whether the data recorded during the current iteration training meets the second training condition.
[0152] If the second training condition is met, the image adapter H and the text adapter S can be considered to have completed training. At this time, the parameters of the image adapter H and the parameters of the text adapter S are output and persisted to configuration files such as config.
[0153] If the first training condition is not met, the next round of iterative training can be entered, and steps 1031-1036 can be executed again. This iterative training continues until the image adapter H and the text adapter S are trained.
[0154] Once the image adapter H and text adapter S have completed training, the entire content moderation network is ready for testing and application to internet products.
[0155] In this embodiment, a content moderation model is determined. The content moderation model includes an image encoder, an image adapter, a text encoder, and a text adapter. The image encoder is used to extract a first image feature from image data, and the image adapter is used to map the first image feature to a target space to obtain a second image feature. The text encoder is used to extract a first text feature from text information, and the text adapter is used to map the first text feature to a target space to obtain a second text feature. The image encoder is trained to adapt to content moderation in an adversarial and classification manner. If the image encoder training is completed, the image adapter and text adapter are trained to adapt to content moderation in a classification manner under the condition of fixing the image encoder and text encoder. This embodiment uses multimodal content moderation with image data and text information, which can fully utilize the inherent connections between language and text, resulting in stronger generalization and reasoning capabilities. While ensuring the accuracy of content moderation, the sample size for iteratively updating the content moderation model is reduced from hundreds of thousands or millions to tens or hundreds, significantly reducing the sample size and facilitating the collection of sufficient samples. When the moderation rules change or new violation categories are added, samples can be quickly collected to iteratively update the content moderation model, resulting in strong scalability and improved response speed for content moderation. In addition, this embodiment uses a two-stage training process. The first stage trains the image encoder, which has higher generalization requirements, separately, while the second stage jointly trains the lightweight image adapter and text adapter. This phased approach achieves the generalization and adaptability of the content moderation network, ensuring its performance and improving training speed.
[0156] Example 2
[0157] Figure 4 This is a flowchart illustrating a content moderation method provided in Embodiment 2 of this application. This embodiment is applicable to content moderation based on a multimodal text and image content moderation model. The method can be executed by a content moderation device, which can be implemented in hardware and / or software. This content moderation device can be configured in electronic devices, such as servers, blade servers, mainframe computers, and other suitable computers. Figure 4 As shown, the method includes:
[0158] Step 401: Load the preset content moderation model.
[0159] In this embodiment, the content moderation model can be pre-trained, such as... Figure 5 As shown, the content moderation model includes an image encoder G, an image adapter H, a text encoder T, and a text adapter S.
[0160] The training method is as follows:
[0161] A content moderation model is defined, which includes an image encoder, an image adapter, a text encoder, and a text adapter. The image encoder is used to extract a first image feature from image data, and the image adapter is used to map the first image feature to a target space to obtain a second image feature. The text encoder is used to extract a first text feature from text information, and the text adapter is used to map the first text feature to a target space to obtain a second text feature.
[0162] The image encoder is trained to be suitable for content moderation using adversarial and classification methods.
[0163] If the image encoder training is completed, then under the condition of fixing the image encoder and text encoder, the image adapter and text adapter are trained to be suitable for content review in a classification manner.
[0164] In this embodiment, since the method for training the content review model is basically similar to the application in Embodiment 1, the description is relatively simple. For relevant details, please refer to the description in Embodiment 1. This embodiment will not be described in detail here.
[0165] When an internet product is launched, a pre-set content moderation model can be loaded into memory to run, providing content moderation services for the internet product.
[0166] Step 402: Input the image data to be reviewed into the image encoder to extract the first image features.
[0167] In practical applications, users upload a file to internet products (such as live streaming applications, short video applications, etc.) through a client. The file contains one or more frames of image data. The file format varies depending on the business, such as a user cover (in some cases reused as a user avatar), a custom emoticon, a short video, live streaming data, etc. The intention is to publish the file in the internet product so that other users can read and view it.
[0168] Content review rules can be formulated based on business, legal, and other factors. Before a document is released, the content of the video file is reviewed in accordance with these review guidelines to filter out some documents that do not comply with the review rules, thereby releasing some documents that do comply with the review rules.
[0169] Furthermore, if the file is video data such as short videos or live streaming data, content review can be performed on each frame of image data, or multiple frames of image data can be extracted from the video data in a frame-skipping manner for content review. For example, one frame of image data can be extracted at intervals, or image data can be randomly extracted, etc., to reduce resource consumption. This embodiment does not impose any restrictions on this.
[0170] During content moderation, such as Figure 5 As shown, the image data x to be reviewed is input into the image encoder G. The image encoder G processes the image data x according to its structure and outputs the first image feature extracted from the image data x.
[0171] Step 403: Input the first image features into the image adapter and map them to the target space to obtain the second image features.
[0172] like Figure 5 As shown, the first image features are input into the image adapter H. The image adapter H processes the first image features according to its structure, maps the first image features to the target space, and obtains the second image features.
[0173] Step 404: Input the text information representing the category in content review into the text encoder to extract the first text feature.
[0174] In this embodiment, multiple categories of content to be reviewed can be set in advance according to the review rules, such as hookah, cigarettes, dice, poker, etc.
[0175] like Figure 5 As shown, these categories are represented in the form of text information t. During content review, the text information t (i.e., the text information itself, such as hookah, cigarette, dice, poker, etc.) that represents the category in the content review is input into the text encoder T. The text encoder T processes the text information t according to its structure and outputs the first text feature extracted from the text information.
[0176] Step 405: Input the first text feature into the text adapter and map it to the target space to obtain the second text feature.
[0177] like Figure 5 As shown, the first text feature is input into the text adapter S. The text adapter S processes the first text feature according to its structure, maps the first text feature to the target space, and obtains the second text feature.
[0178] Step 406: Compare the second image features with the second text features to generate an audit result for the image data.
[0179] The second image features and the second text features are compared in the same spatial vector, and the review results are generated for the image data based on the comparison results.
[0180] In specific implementations, such as Figure 5 As shown, the similarity sim between the second image features and the second text is calculated.
[0181] For example, the product between the second image feature and the second text feature is calculated as the dot product; the product between the length of the second image feature and the length of the second text feature is calculated as the modulus-length product; the ratio between the dot product and the modulus-length product is calculated as the similarity, that is, the similarity is the cosine of the angle between the second image feature and the second text feature, and its value ranges from [-1, 1], where -1 is completely dissimilar and 1 is completely similar.
[0182] The similarity is compared with a preset threshold.
[0183] If the similarity is greater than the preset threshold, the review result of the image data is determined to be that the content of the image data belongs to the category, the content of the image data violates the rules, and it belongs to high-risk image data, which can be further reviewed by technical personnel.
[0184] If the similarity is less than or equal to the preset threshold, the review result of the image data is determined to be that the content of the image data does not belong to any category. If the image data does not belong to any category, it can be determined that the content of the image data is not in violation and belongs to low-risk image data.
[0185] In this embodiment, a preset content moderation model is loaded. This model includes an image encoder, an image adapter, a text encoder, and a text adapter. The image data to be moderated is input into the image encoder to extract a first image feature. The first image feature is then input into the image adapter and mapped to a target space to obtain a second image feature. Text information representing the category in the content moderation is input into the text encoder to extract a first text feature. The first text feature is then input into the text adapter and mapped to a target space to obtain a second text feature. The second image feature and the second text feature are compared to generate a moderation result for the image data. This embodiment splits the content moderation model into two parts: the first part consists of an image encoder and a text editor with higher generalization requirements, and the second part consists of an image adapter and a text adapter with higher adaptability requirements. This phased approach to achieving generalization and adaptability of the content moderation network ensures its performance. Furthermore, this embodiment uses multimodal content moderation with image data and text information, which can fully utilize the inherent connections between language and text, resulting in stronger generalization and reasoning capabilities. While ensuring the accuracy of content moderation, the sample size for iteratively updating the content moderation model is reduced from hundreds of thousands or millions to tens or hundreds, significantly reducing the sample size and facilitating the collection of sufficient samples. When the moderation rules change or new violation categories are added, samples can be quickly collected to iteratively update the content moderation model, resulting in strong scalability and improved response speed for content moderation.
[0186] Example 3
[0187] Figure 6 This is a schematic diagram of the structure of a training device for a content moderation model provided in Embodiment 3 of this application. Figure 6 As shown, the device includes:
[0188] The content moderation model determination module 601 is used to determine a content moderation model, which includes an image encoder, an image adapter, a text encoder, and a text adapter. The image encoder is used to extract a first image feature from image data, and the image adapter is used to map the first image feature to a target space to obtain a second image feature. The text encoder is used to extract a first text feature from text information, and the text adapter is used to map the first text feature to the target space to obtain a second text feature.
[0189] Encoder training module 602 is used to train the image encoder to be adapted for content moderation in an adversarial and classification manner.
[0190] The adapter training module 603 is used to train the image adapter and the text adapter to be suitable for content review in a classification manner, under the condition that the image encoder and the text encoder are fixed, once the image encoder has been trained.
[0191] In one embodiment of this application, the encoder training module 602 includes:
[0192] The branch determination module is used to determine the classification branch and the adversarial branch. The classification branch is used to train the image encoder so that the first image features are applicable to classifying various categories in content review. The adversarial branch is used to train the image encoder in an adversarial manner so that the first image features are applicable to content review.
[0193] The branch update module is used to update the classification branch and the adversarial branch sequentially in each training iteration to train the image encoder to be suitable for content review.
[0194] In one embodiment of this application, the adversarial branch includes the image encoder, a general encoder adapted to non-specific operations, and a discriminator; the classification branch includes the image encoder and a feedforward network.
[0195] The branch update module includes:
[0196] The encoding module is used to input the image data used as samples into the image encoder to extract the first image feature and into the general encoder to extract the third image feature.
[0197] The discrimination module is used in the discriminator to generate a first adversarial label for whether the image data is suitable for content review using the first image feature, and to generate a second adversarial label for whether the image data is suitable for content review using the third image feature;
[0198] The feedforward module is used to map the first image features to the classification space in the feedforward network to obtain the fourth image features;
[0199] The probability calculation module is used to map the fourth image feature to the probability that the image data belongs to each category of the content review.
[0200] The classification branch update module is used to update the parameters of the image encoder and the parameters of the feedforward network according to the first adversarial label and the probability.
[0201] An adversarial branch update module is used to update the parameters of the discriminator based on the first adversarial label and the second adversarial label while keeping the parameters of the image encoder unchanged.
[0202] The first training condition judgment module is used to determine whether the preset first training condition is met; if yes, the first completion determination module is called; if no, the encoding module is called back.
[0203] The first completion determination module is used to determine whether the image encoder has completed training.
[0204] In one embodiment of this application, the classification branch update module includes:
[0205] The first loss value calculation module is used to generate a classification label based on the probability, wherein the classification label indicates whether the image data belongs to the category; to square the difference obtained by subtracting the first adversarial label from 1 to obtain a first candidate value; to sum the products between the derivatives of each classification label and the probability to obtain a second candidate value; and to linearly fuse the negative values of the first candidate value and the second candidate value to obtain a first loss value.
[0206] The first loss value update module is used to update the parameters of the image encoder and the parameters of the feedforward network according to the first loss value.
[0207] In one embodiment of this application, the adversarial branch update module includes:
[0208] The second loss value calculation module is used to square the first adversarial label to obtain a third candidate value; square the difference between 1 and the second adversarial label to obtain a fourth candidate value; and linearly fuse the third candidate value and the fourth candidate value into a second loss value.
[0209] The second loss value update module is used to update the parameters of the feedforward network according to the second loss value while keeping the parameters of the image encoder unchanged.
[0210] In one embodiment of this application, the adapter training module 603 includes:
[0211] The first image feature extraction module is used to extract first image features from image data input as samples into the image encoder.
[0212] The first image feature mapping module is used to input the first image features into the image adapter and map them to the target space to obtain the second image features;
[0213] The first text feature extraction module is used to extract the first text features by inputting the text information as a sample into the text encoder.
[0214] The second text feature mapping module is used to input the first text feature into the text adapter and map it to the target space to obtain the second text feature;
[0215] An adapter update module is used to compare the second image feature with the second text feature while keeping the parameters of the image encoder and the text encoder unchanged, so as to update the parameters of the image adapter and the text adapter.
[0216] The second training condition judgment module is used to determine whether the preset second training condition is met; if yes, the second completion determination module is called; if no, the first image feature extraction module is called back.
[0217] The second completion determination module is used to determine whether the image adapter and the text adapter have completed training.
[0218] In one embodiment of this application, the text information used as samples includes positive samples related to the image data and negative samples unrelated to the image data;
[0219] The adapter update module includes:
[0220] The third loss value calculation module is used to calculate, for each of the image data, a first similarity between the second image feature and the second text feature corresponding to the positive sample; square the difference obtained by subtracting the first similarity from 1 as a fifth candidate value; calculate, for each of the image data, a second similarity between the second image feature and the second text feature corresponding to the negative sample; square the second similarity as a sixth candidate value; and linearly fuse the sum of all the fifth candidate values and the sum of all the sixth candidate values to obtain a third loss value.
[0221] The third loss value update module is used to update the parameters of the image adapter and the text adapter according to the third loss value while keeping the parameters of the image encoder and the text encoder unchanged.
[0222] In one embodiment of this application, the third loss value calculation module is further configured to:
[0223] For each of the image data, the product between the second image feature and the second text feature corresponding to the positive sample is calculated as the first dot product;
[0224] Calculate the product between the length of the second image feature and the length of the second text feature corresponding to the positive sample, and use it as the first modulus product;
[0225] The ratio between the first dot product and the first modulus product is calculated as the first similarity.
[0226] In one embodiment of this application, the third loss value calculation module is further configured to:
[0227] For each of the image data, the product between the second image feature and the second text feature corresponding to the negative sample is calculated as the second dot product;
[0228] Calculate the product between the length of the second image feature and the length of the second text feature corresponding to the negative sample, and use it as the second modulus product;
[0229] The ratio between the second dot product and the second modulus product is calculated as the second similarity.
[0230] The training apparatus for the content moderation model provided in this application embodiment can execute the training method for the content moderation model provided in any embodiment of this application, and has the corresponding functional modules and beneficial effects for executing the training method for the content moderation model.
[0231] Example 4
[0232] Figure 7 This is a schematic diagram of a content moderation device provided in Embodiment 4 of this application. Figure 7 As shown, the device includes:
[0233] The content moderation model loading module 701 is used to load a preset content moderation model, which includes an image encoder, an image adapter, a text encoder, and a text adapter.
[0234] The first image feature extraction module 702 is used to extract first image features from the image data to be reviewed by inputting it into the image encoder.
[0235] The second image feature mapping module 703 is used to input the first image features into the image adapter and map them to the target space to obtain the second image features;
[0236] The first text feature extraction module 704 is used to input text information representing the category in content review into the text encoder to extract the first text feature;
[0237] The second text feature mapping module 705 is used to input the first text feature into the text adapter and map it to the target space to obtain the second text feature;
[0238] The review result generation module 706 is used to compare the second image feature with the second text feature to generate a review result for the image data.
[0239] The training method for the content moderation model is as follows:
[0240] A content moderation model is defined, comprising an image encoder, an image adapter, a text encoder, and a text adapter. The image encoder is used to extract a first image feature from image data, and the image adapter is used to map the first image feature to a target space to obtain a second image feature. The text encoder is used to extract a first text feature from text information, and the text adapter is used to map the first text feature to the target space to obtain a second text feature.
[0241] The image encoder is trained to be adapted for content moderation using adversarial and classification methods;
[0242] If the image encoder is trained, then, with the image encoder and the text encoder fixed, the image adapter and the text adapter are trained to be suitable for content review in a classification manner.
[0243] In one embodiment of this application, the audit result generation module 706 includes:
[0244] The similarity calculation module is used to calculate the similarity between the second image features and the second text;
[0245] The first result generation module is used to determine that the content of the image data belongs to the category if the similarity is greater than a preset threshold.
[0246] The second result generation module is used to determine that the content of the image data does not belong to the category if the similarity is less than or equal to a preset threshold.
[0247] In one embodiment of this application, the similarity calculation module includes:
[0248] The dot product calculation module is used to calculate the product between the second image feature and the second text feature, as the dot product.
[0249] The modulus-length product calculation module is used to calculate the product between the length of the second image feature and the length of the second text feature, as the modulus-length product;
[0250] The ratio calculation module is used to calculate the ratio between the dot product and the modulus product, which serves as the similarity.
[0251] The content moderation device provided in this application embodiment can execute the content moderation method provided in any embodiment of this application, and has the corresponding functional modules and beneficial effects for executing the content moderation method.
[0252] Example 5
[0253] Figure 8A schematic diagram of the structure of an electronic device 10 that can be used to implement embodiments of this application is shown. The components shown herein, their connections and relationships, and their functions are merely examples and are not intended to limit the implementation of this application described and / or claimed herein.
[0254] like Figure 8 As shown, the electronic device 10 includes at least one processor 11 and a memory, such as a read-only memory (ROM) 12 or a random access memory (RAM) 13, communicatively connected to the at least one processor 11. The memory stores computer programs executable by the at least one processor. The processor 11 can perform various appropriate actions and processes based on the computer program stored in the ROM 12 or loaded into the RAM 13 from storage unit 18. The RAM 13 may also store various programs and data required for the operation of the electronic device 10. The processor 11, ROM 12, and RAM 13 are interconnected via a bus 14. An input / output (I / O) interface 15 is also connected to the bus 14.
[0255] Multiple components in electronic device 10 are connected to I / O interface 15, including: input unit 16, such as keyboard, mouse, etc.; output unit 17, such as various types of displays, speakers, etc.; storage unit 18, such as disk, optical disk, etc.; and communication unit 19, such as network card, modem, wireless transceiver, etc. Communication unit 19 allows electronic device 10 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0256] Processor 11 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. Processor 11 performs the various methods and processes described above, such as content moderation methods or methods for training content moderation models.
[0257] In some embodiments, the content moderation method or the training method for a content moderation model may be implemented as a computer program tangibly contained in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and / or installed on electronic device 10 via ROM 12 and / or communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the content moderation method or the training method for a content moderation model described above may be performed. Alternatively, in other embodiments, processor 11 may be configured to perform the content moderation method or the training method for a content moderation model by any other suitable means (e.g., by means of firmware).
[0258] In the context of this application, a computer-readable storage medium can be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, apparatus, or device. A computer-readable storage medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. Alternatively, a computer-readable storage medium can be a machine-readable signal medium. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0259] Example 6
[0260] This application also provides a computer program product, which includes a computer program that, when executed by a processor, implements the content moderation method or the training method for the content moderation model provided in any embodiment of this application.
[0261] In the implementation of the computer program product, computer program code for performing the operations of this application can be written in one or more programming languages or a combination thereof. Programming languages include object-oriented programming languages such as Java, Smalltalk, and C++, as well as conventional procedural programming languages such as C or similar languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0262] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this application can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution of this application can be achieved, and this is not limited herein.
[0263] The specific embodiments described above do not constitute a limitation on the scope of protection of this application. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this application should be included within the scope of protection of this application.
Claims
1. A content moderation method, characterized in that, include: Load a preset content moderation model, which includes an image encoder, an image adapter, a text encoder, and a text adapter; The image data to be reviewed is input into the image encoder to extract the first image feature; The first image feature is input into the image adapter and mapped to the target space to obtain the second image feature; The text information representing the category in content review is input into the text encoder to extract the first text feature; The first text feature is input into the text adapter and mapped to the target space to obtain the second text feature; The second image feature is compared with the second text feature to generate an audit result for the image data; The training method for the content moderation model is as follows: A content moderation model is defined, comprising an image encoder, an image adapter, a text encoder, and a text adapter. The image encoder is used to extract a first image feature from image data, and the image adapter is used to map the first image feature to a target space to obtain a second image feature. The text encoder is used to extract a first text feature from text information, and the text adapter is used to map the first text feature to the target space to obtain a second text feature. The image encoder is trained to be adapted for content moderation using adversarial and classification methods; If the image encoder is trained, then, with the image encoder and the text encoder fixed, the image adapter and the text adapter are trained to be suitable for content review in a classification manner. The step of training the image encoder to adapt to content moderation using adversarial and classification methods includes: A classification branch and an adversarial branch are determined. The classification branch is used to train the image encoder so that the first image features are applicable to classifying various categories in content review. The adversarial branch is used to train the image encoder in an adversarial manner so that the first image features are applicable to content review. In each iteration of training, the classification branch and the adversarial branch are updated sequentially to train the image encoder to be suitable for content review. The adversarial branch includes the image encoder, a general encoder adapted to non-specific operations, and a discriminator; the classification branch includes the image encoder and a feedforward network. The step of sequentially updating the classification branch and the adversarial branch in each training iteration to train the image encoder to be suitable for content review includes: The image data used as samples are respectively input into the image encoder to extract the first image feature, and input into the general encoder to extract the third image feature; The discriminator uses the first image feature to generate a first adversarial label on the image data to determine whether it is suitable for content review, and uses the third image feature to generate a second adversarial label on the image data to determine whether it is suitable for content review. In the feedforward network, the first image feature is mapped to the classification space to obtain the fourth image feature; The fourth image is mapped to the probability that the image data belongs to each category of the content review process. The parameters of the image encoder and the parameters of the feedforward network are updated based on the first adversarial label and the probability. While keeping the parameters of the image encoder unchanged, the parameters of the discriminator are updated according to the first adversarial label and the second adversarial label; Determine whether the preset first training condition is met; if yes, then determine that the image encoder has completed training; otherwise, return to the step of inputting the image data used as samples into the image encoder to extract the first image features and inputting them into a general encoder adapted to non-specific operations to extract the third image features.
2. The method according to claim 1, characterized in that, The step of comparing the second image feature with the second text feature to generate an audit result for the image data includes: Calculate the similarity between the second image features and the second text; If the similarity is greater than a preset threshold, then the review result of the image data is determined to be that the content of the image data belongs to the category; If the similarity is less than or equal to a preset threshold, the review result of the image data is determined to be that the content of the image data does not belong to the category.
3. A training method for a content moderation model, characterized in that, include: A content moderation model is defined, comprising an image encoder, an image adapter, a text encoder, and a text adapter. The image encoder is used to extract a first image feature from image data, and the image adapter is used to map the first image feature to a target space to obtain a second image feature. The text encoder is used to extract a first text feature from text information, and the text adapter is used to map the first text feature to the target space to obtain a second text feature. The image encoder is trained to be adapted for content moderation using adversarial and classification methods; If the image encoder is trained, then, with the image encoder and the text encoder fixed, the image adapter and the text adapter are trained to be suitable for content review in a classification manner. The step of training the image encoder to adapt to content moderation using adversarial and classification methods includes: A classification branch and an adversarial branch are determined. The classification branch is used to train the image encoder so that the first image features are applicable to classifying various categories in content review. The adversarial branch is used to train the image encoder in an adversarial manner so that the first image features are applicable to content review. In each iteration of training, the classification branch and the adversarial branch are updated sequentially to train the image encoder to be suitable for content review. The adversarial branch includes the image encoder, a general encoder adapted to non-specific operations, and a discriminator; the classification branch includes the image encoder and a feedforward network. The step of sequentially updating the classification branch and the adversarial branch in each training iteration to train the image encoder to be suitable for content review includes: The image data used as samples are respectively input into the image encoder to extract the first image feature, and input into the general encoder to extract the third image feature; The discriminator uses the first image feature to generate a first adversarial label on the image data to determine whether it is suitable for content review, and uses the third image feature to generate a second adversarial label on the image data to determine whether it is suitable for content review. In the feedforward network, the first image feature is mapped to the classification space to obtain the fourth image feature; The fourth image is mapped to the probability that the image data belongs to each category of the content review process. The parameters of the image encoder and the parameters of the feedforward network are updated based on the first adversarial label and the probability. While keeping the parameters of the image encoder unchanged, the parameters of the discriminator are updated according to the first adversarial label and the second adversarial label; Determine whether the preset first training condition is met; if yes, then determine that the image encoder has completed training; if no, return to the step of inputting the image data used as samples into the image encoder to extract the first image features and inputting them into a general encoder adapted to non-specific operations to extract the third image features.
4. The method according to claim 3, characterized in that, The step of updating the parameters of the image encoder and the parameters of the feedforward network based on the first adversarial label and the probability pair includes: A classification label is generated based on the probability, and the classification label indicates whether the image data belongs to the category; The square of the difference between 1 and the first adversarial label is used to obtain the first candidate value; The second candidate value is obtained by summing the products of each classification label and the derivative of the probability. The negative values of the first candidate value and the second candidate value are linearly merged to form the first loss value; The parameters of the image encoder and the parameters of the feedforward network are updated according to the first loss value.
5. The method according to claim 3, characterized in that, The step of updating the parameters of the discriminator based on the first adversarial label and the second adversarial label while keeping the parameters of the image encoder unchanged includes: Square the first adversarial label to obtain the third candidate value; The square of the difference between 1 and the second adversarial label is used to obtain the fourth candidate value; The third candidate value and the fourth candidate value are linearly fused to form a second loss value; While keeping the parameters of the image encoder unchanged, the parameters of the feedforward network are updated according to the second loss value.
6. The method according to any one of claims 3-5, characterized in that, The step of training the image adapter and the text adapter to be adapted for content review in a classification manner, under the condition of fixing the image encoder and the text encoder, includes: The image data used as samples is input into the image encoder to extract the first image features; The first image feature is input into the image adapter and mapped to the target space to obtain the second image feature; The text information used as a sample is input into the text encoder to extract the first text feature; The first text feature is input into the text adapter and mapped to the target space to obtain the second text feature; While keeping the parameters of the image encoder and the text encoder unchanged, the second image feature and the second text feature are compared to update the parameters of the image adapter and the text adapter. Determine whether the preset second training condition is met; if yes, determine that the image adapter and the text adapter have completed training; otherwise, return to the step of inputting the image data used as samples into the image encoder to extract the first image features.
7. The method according to claim 6, characterized in that, The text information used as samples includes positive samples related to the image data and negative samples unrelated to the image data; The step of comparing the second image feature with the second text feature to update the parameters of the image adapter and the text adapter while keeping the parameters of the image encoder and the text encoder unchanged includes: For each of the image data, calculate the first similarity between the second image feature and the second text feature corresponding to the positive sample; The square of the difference between 1 and the first similarity is used as the fifth candidate value. For each of the image data, calculate the second similarity between the second image feature and the second text feature corresponding to the negative sample; The square of the second similarity is used as the sixth candidate value; The sum of all the fifth candidate values and the sum of all the sixth candidate values are linearly merged to form a third loss value; While keeping the parameters of the image encoder and the text encoder unchanged, update the parameters of the image adapter and the text adapter according to the third loss value.
8. The method according to claim 7, characterized in that, The step of calculating the first similarity between the second image feature and the second text feature corresponding to the positive sample for each of the image data includes: For each of the image data, the product between the second image feature and the second text feature corresponding to the positive sample is calculated as the first dot product; Calculate the product between the length of the second image feature and the length of the second text feature corresponding to the positive sample, and use it as the first modulus product; Calculate the ratio between the first dot product and the first modulus product, and use it as the first similarity. The step of calculating the second similarity between the second image feature and the second text feature corresponding to the negative sample for each of the image data includes: For each of the image data, the product between the second image feature and the second text feature corresponding to the negative sample is calculated as the second dot product; Calculate the product between the length of the second image feature and the length of the second text feature corresponding to the negative sample, and use it as the second modulus product; The ratio between the second dot product and the second modulus product is calculated as the second similarity.
9. A content moderation device, characterized in that, include: A content moderation model loading module is used to load a preset content moderation model, which includes an image encoder, an image adapter, a text encoder, and a text adapter. The first image feature extraction module is used to extract the first image features from the image data to be reviewed by inputting it into the image encoder. The second image feature mapping module is used to input the first image feature into the image adapter and map it to the target space to obtain the second image feature; The first text feature extraction module is used to input text information representing the category in content review into the text encoder to extract the first text feature; The second text feature mapping module is used to input the first text feature into the text adapter and map it to the target space to obtain the second text feature; The review result generation module is used to compare the second image feature with the second text feature to generate a review result for the image data; The training method for the content moderation model is as follows: A content moderation model is defined, comprising an image encoder, an image adapter, a text encoder, and a text adapter. The image encoder is used to extract a first image feature from image data, and the image adapter is used to map the first image feature to a target space to obtain a second image feature. The text encoder is used to extract a first text feature from text information, and the text adapter is used to map the first text feature to the target space to obtain a second text feature. The image encoder is trained to be adapted for content moderation using adversarial and classification methods; If the image encoder is trained, then, with the image encoder and the text encoder fixed, the image adapter and the text adapter are trained to be suitable for content review in a classification manner. The step of training the image encoder to adapt to content moderation using adversarial and classification methods includes: A classification branch and an adversarial branch are determined. The classification branch is used to train the image encoder so that the first image features are applicable to classifying various categories in content review. The adversarial branch is used to train the image encoder in an adversarial manner so that the first image features are applicable to content review. In each iteration of training, the classification branch and the adversarial branch are updated sequentially to train the image encoder to be suitable for content review. The adversarial branch includes the image encoder, a general encoder adapted to non-specific operations, and a discriminator; the classification branch includes the image encoder and a feedforward network. The step of sequentially updating the classification branch and the adversarial branch in each training iteration to train the image encoder to be suitable for content review includes: The image data used as samples are respectively input into the image encoder to extract the first image feature, and input into the general encoder to extract the third image feature; The discriminator uses the first image feature to generate a first adversarial label on the image data to determine whether it is suitable for content review, and uses the third image feature to generate a second adversarial label on the image data to determine whether it is suitable for content review. In the feedforward network, the first image feature is mapped to the classification space to obtain the fourth image feature; The fourth image is mapped to the probability that the image data belongs to each category of the content review process. The parameters of the image encoder and the parameters of the feedforward network are updated based on the first adversarial label and the probability. While keeping the parameters of the image encoder unchanged, the parameters of the discriminator are updated according to the first adversarial label and the second adversarial label; Determine whether the preset first training condition is met; if yes, then determine that the image encoder has completed training; otherwise, return to the step of inputting the image data used as samples into the image encoder to extract the first image features and inputting them into a general encoder adapted to non-specific operations to extract the third image features.
10. A training device for a content moderation model, characterized in that, include: A content moderation model determination module is used to determine a content moderation model. The content moderation model includes an image encoder, an image adapter, a text encoder, and a text adapter. The image encoder is used to extract a first image feature from image data. The image adapter is used to map the first image feature to a target space to obtain a second image feature. The text encoder is used to extract a first text feature from text information. The text adapter is used to map the first text feature to the target space to obtain a second text feature. An encoder training module is used to train the image encoder to be adapted for content moderation in an adversarial and classification manner. An adapter training module is used to train the image adapter and the text adapter to be suitable for content review in a classification manner, provided that the image encoder and the text encoder are fixed, after the image encoder has been trained. The encoder training module includes: The branch determination module is used to determine the classification branch and the adversarial branch. The classification branch is used to train the image encoder so that the first image features are applicable to classifying various categories in content review. The adversarial branch is used to train the image encoder in an adversarial manner so that the first image features are applicable to content review. The branch update module is used to update the classification branch and the adversarial branch sequentially in each training iteration to train the image encoder to be suitable for content review. The adversarial branch includes the image encoder, a general encoder adapted to non-specific operations, and a discriminator; the classification branch includes the image encoder and a feedforward network. The branch update module includes: The encoding module is used to input the image data used as samples into the image encoder to extract the first image feature and into the general encoder to extract the third image feature. The discrimination module is used in the discriminator to generate a first adversarial label for whether the image data is suitable for content review using the first image feature, and to generate a second adversarial label for whether the image data is suitable for content review using the third image feature; The feedforward module is used to map the first image features to the classification space in the feedforward network to obtain the fourth image features; The probability calculation module is used to map the fourth image feature to the probability that the image data belongs to each category of the content review. The classification branch update module is used to update the parameters of the image encoder and the parameters of the feedforward network according to the first adversarial label and the probability. An adversarial branch update module is used to update the parameters of the discriminator based on the first adversarial label and the second adversarial label while keeping the parameters of the image encoder unchanged. The first training condition judgment module is used to determine whether the preset first training condition is met; if yes, the first completion determination module is called; if no, the encoding module is called back. The first completion determination module is used to determine whether the image encoder has completed training.
11. An electronic device, characterized in that, The electronic device includes: At least one processor; and A memory communicatively connected to the at least one processor; wherein, The memory stores a computer program that can be executed by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the content moderation method of any one of claims 1-2 or the training method of the content moderation model of any one of claims 3-8.
12. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the content moderation method of any one of claims 1-2 or the training method of the content moderation model of any one of claims 3-8.
13. A computer program product, characterized in that, The computer program product includes a computer program that, when executed by a processor, implements the content moderation method of any one of claims 1-2 or the training method of the content moderation model of any one of claims 3-8.