A video content forgetting method and system based on double-level optimization and a medium

By constructing a two-level optimized video content forgetting method, and using LoRA low-rank adapter and multiple loss functions to optimize the video generation model, the problem of accuracy and coherence of forgetting specific concepts in video generation is solved, and efficient and accurate forgetting effect is achieved.

CN121174016BActive Publication Date: 2026-06-26GUANGDONG SOUTH SMART MEDIA TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGDONG SOUTH SMART MEDIA TECH CO LTD
Filing Date
2025-08-15
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies struggle to accurately forget specific concepts during video generation, leading to abrupt changes in content or logical inconsistencies in the video after forgetting. This can also affect the generation of irrelevant concepts and makes it difficult to handle synonyms or misspelled words.

Method used

A two-level optimization approach is adopted. By constructing training sets of target forgetting concepts and irrelevant concepts, the lower and upper layers are optimized respectively. The LoRA low-rank adapter is inserted into the cross-attention layer of the video generation model and the model parameters are frozen. The model is optimized to achieve accurate forgetting by combining global forgetting loss, local attention loss, semantic self-comparison loss and irrelevant concept retention loss.

Benefits of technology

It achieves accurate forgetting of specific concepts in video generation, maintains the temporal coherence of the video and the ability to generate irrelevant concepts, improves the stability and forgetting efficiency of the model, and can effectively handle synonyms and misspelled words.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121174016B_ABST
    Figure CN121174016B_ABST
Patent Text Reader

Abstract

The application discloses a video content forgetting method and system based on double-layer optimization, and a medium, which comprises the following steps: inputting a target forgetting concept training set into a preset first video generation model, determining a first joint loss function based on obtained first model output results, and performing lower-layer optimization based on the first joint loss function to obtain a second video generation model; inputting an irrelevant concept training set into the second video generation model, determining a second joint loss function based on obtained second model output results, and performing upper-layer optimization based on the second joint loss function until the first joint loss function and the second joint loss function both satisfy preset conditions, and determining a trained target video generation model; and inputting obtained text descriptions containing target forgetting concepts into the target video generation model to generate videos not containing target forgetting concepts. Therefore, by implementing the application, precise forgetting of specific concepts in video generation can be realized.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of forgetting learning, and more particularly to a video content forgetting method, system, and medium based on two-level optimization. Background Technology

[0002] In today's digital age, the rapid development of video generation technology has brought tremendous convenience to content creation, but it has also raised numerous copyright, privacy, and compliance issues. For example, video generation models may unintentionally generate footage containing trademarks, public figures, copyrighted characters, or other sensitive content, potentially leading to disputes or violations. To avoid these risks, it is crucial to enable video generation models to learn to forget specific concepts. By allowing the model to "forget" certain sensitive concepts, it is possible to effectively prevent the generation of videos containing such content, thereby ensuring the legality and security of the generated content. Furthermore, this forgetting ability can also be applied to scenarios such as privacy protection and brand replacement, providing broader compliance safeguards for the creation and dissemination of video content.

[0003] In existing technologies, video content forgetting learning primarily focuses on the image domain, employing methods such as concept shifting and contrastive learning in attention layers or loss functions to achieve the forgetting of specific concepts. However, these methods face numerous challenges when applied to video generation. First, video generation requires not only considering concept forgetting within a single frame but also ensuring the temporal coherence and consistency of the video. Existing methods often fail to simultaneously meet both requirements, leading to content abrupt changes or logical inconsistencies in the forgotten video. Second, when forgetting the target concept, existing methods may negatively impact the model's ability to generate other unrelated concepts, reducing the overall performance and practicality of the model. Furthermore, existing technologies exhibit significant deficiencies in handling synonyms or misspellings, failing to effectively distinguish and forget these similar concepts, resulting in incomplete forgetting. Therefore, achieving accurate, efficient forgetting of concepts in video generation without affecting other concepts remains a pressing technical challenge. Summary of the Invention

[0004] This invention provides a video content forgetting method, system, and medium based on two-level optimization to achieve accurate forgetting of specific concepts in video generation.

[0005] An embodiment of the present invention provides a video content forgetting method based on two-level optimization, comprising:

[0006] Construct a training set for the target forgetting concept and a training set for irrelevant concepts;

[0007] The target forgotten concept training set is input into a preset first video generation model. Based on the output of the first model, a first joint loss function is determined, and lower-level optimization is performed based on the first joint loss function to obtain a second video generation model. The irrelevant concept training set is input into the second video generation model. Based on the output of the second model, a second joint loss function is determined, and upper-level optimization is performed based on the second joint loss function until both the first joint loss function and the second joint loss function satisfy preset conditions. The trained target video generation model is then determined. The first video generation model is obtained by inserting a LoRA low-rank adapter into the cross-attention layer of the pre-trained initial video generation model and freezing the model parameters of the initial video generation model. In each iteration, the LoRA weights are updated according to the first joint loss function and the second joint loss function.

[0008] The obtained text description containing the concept of target forgetting is input into the target video generation model to generate a video that does not contain the concept of target forgetting.

[0009] This application's embodiments construct a target forgotten concept training set and an irrelevant concept training set to facilitate subsequent model learning how to identify and suppress the target forgotten concept while retaining the generation of other irrelevant concepts. By inputting the target forgotten concept training set, the model can learn the features and context of the target concept. By optimizing the first joint loss function, the model can gradually suppress the generation ability of the target concept while retaining the generation ability of other concepts. By inputting the irrelevant concept training set into the second video generation model, the model can learn the features and context of irrelevant concepts. By optimizing the second joint loss function, the model can retain the generation ability of irrelevant concepts while further optimizing the forgetting effect of the target concept. By focusing on forgetting the target concept in the lower layer optimization and focusing on retaining irrelevant concepts in the upper layer optimization, the model can maintain the generation ability of other concepts while forgetting the target concept through alternating optimization of these two levels. By inputting text descriptions containing the target forgotten concept into the target video generation model, the target concept can be intelligently filtered out. Even if the input text description contains the target concept, the model can still generate a video without the target concept. Compared with the prior art, this application can achieve accurate forgetting of specific concepts in video generation.

[0010] Furthermore, the target forgotten concept training set is generated by the SkyReels-A2 model and includes a first video with the target forgotten concept and a second video without the target forgotten concept paired under the same background; the irrelevant concept training set is generated by the initial video generation model and includes a third video with irrelevant concepts.

[0011] By providing videos containing target concepts, the model can learn how to identify and suppress these concepts. At the same time, providing a training set of irrelevant concepts can help the model distinguish between target concepts and irrelevant concepts, avoiding the accidental rejection of other concepts during the forgetting process.

[0012] Further, the step of inputting the target forgotten concept training set into a preset first video generation model, determining a first joint loss function based on the output of the first model, and performing lower-level optimization based on the first joint loss function to obtain a second video generation model, specifically involves:

[0013] The target forgetting concept training set is input into a preset first video generation model to obtain the output result of the first model;

[0014] The global forgetting loss is calculated based on the output of the first model, wherein the global forgetting loss is obtained by comparing the output difference of the first video generation model for the first video containing the target forgetting concept and the second video not containing the target forgetting concept;

[0015] Based on the attention matrix output by the cross-attention layer of the first video generation model, the local attention loss is calculated, wherein the local attention loss is the sum of the columns corresponding to the target concept in the attention matrix in the spatial and temporal dimensions;

[0016] The first joint loss function is formed by combining the global forgetting loss and the local attention loss, and the LoRA weights are updated according to the first joint loss function to obtain the second video generation model.

[0017] By inputting videos containing the target concept, the model can learn the features and context of the target concept. By optimizing the first joint loss function, the model can gradually suppress the ability to generate the target concept while retaining the ability to generate other concepts.

[0018] Furthermore, the formula for calculating the global forgetting loss is as follows:

[0019]

[0020] In the formula, L esd The global forgetting loss is E; the expectation factor is c. un Text representing concepts that need to be forgotten; Indicates empty text; x t and x t ′ represents paired video data in the dataset that includes and does not include the concept of forgetting; θ o Δθ represents the original model; Δθ represents the LoRA module; v represents the predicted target rate of the diffusion model; t represents the time step of the diffusion model; and η is a hyperparameter for the degree of forgetting.

[0021] Further, the step of inputting the irrelevant concept training set into the second video generation model, determining the second joint loss function based on the output of the obtained second model, and performing upper-level optimization based on the second joint loss function until both the first joint loss function and the second joint loss function satisfy preset conditions, and determining the trained target video generation model, specifically involves:

[0022] The training set of irrelevant concepts is input into the second video generation model to obtain the output result of the second model;

[0023] The irrelevant concept retention loss is calculated based on the output of the second model, wherein the irrelevant concept retention loss is obtained by comparing the difference between the irrelevant concept video and the model prediction output;

[0024] The semantic self-comparison loss is calculated based on the cross-attention matrix output by the second model, wherein the semantic self-comparison loss is obtained by comparing the similarity of the target concept attention vector, the irrelevant concept attention vector, and the synonym attention vector;

[0025] The irrelevant concept retention loss and the semantic self-comparison loss are combined into a second joint loss function, and the LoRA weights are updated based on the second joint loss function until the values ​​of the first joint loss function and the second joint loss function are both lower than a preset threshold, and no forgotten target concept is detected in the random prompt generation results within a preset number of times. The trained target video generation model is then determined.

[0026] By inputting videos containing irrelevant concepts, the model can learn the features and context of those irrelevant concepts. By optimizing the second joint loss function, the model can retain its ability to generate irrelevant concepts while further optimizing the forgetting effect on the target concepts.

[0027] Furthermore, the expression for the semantic self-comparison loss is specifically as follows:

[0028]

[0029] In the formula, L sc For the semantic self-comparison loss; F un The target concept attention vector; F is the attention vector for the irrelevant concepts; syn This is the attention vector for the synonym.

[0030] Furthermore, the rank parameter of the LoRA low-rank adapter is set to less than or equal to 4, and it is only inserted into the forward propagation path of the query mapping matrix and key mapping matrix of the cross-attention layer in the initial video generation model. The remaining parameters of the initial video generation model remain frozen until the end of training.

[0031] This design not only reduces computational complexity and memory requirements, but also ensures that the model can maintain its ability to generate other concepts and the stability of overall performance while forgetting the target concept.

[0032] Furthermore, the video content forgetting method based on two-level optimization also includes:

[0033] When multiple target concepts need to be forgotten simultaneously, the LoRA weights corresponding to each target concept are combined by linear normalization to achieve parallel deletion of multiple concepts.

[0034] By combining the LoRA weights corresponding to multiple target concepts through linear normalization to achieve parallel deletion of multiple concepts, the range of forgetting ability is expanded and the forgetting efficiency is improved. This not only maintains the stability and consistency of the model, enhances the accuracy of forgetting, but also improves the model's versatility and adaptability.

[0035] Another embodiment of the present invention provides a video content forgetting system based on two-level optimization, comprising: an acquisition module, a training module, and a generation module;

[0036] The acquisition module is used to construct a training set of target forgotten concepts and a training set of irrelevant concepts;

[0037] The training module is used to input the target forgotten concept training set into a preset first video generation model, determine a first joint loss function based on the output of the first model, and perform lower-level optimization based on the first joint loss function to obtain a second video generation model. The irrelevant concept training set is then input into the second video generation model, a second joint loss function is determined based on the output of the second model, and upper-level optimization is performed based on the second joint loss function until both the first joint loss function and the second joint loss function satisfy preset conditions, thereby determining the trained target video generation model. The first video generation model is obtained by inserting a LoRA low-rank adapter into the cross-attention layer of a pre-trained initial video generation model and freezing the model parameters of the initial video generation model. In each iteration, the LoRA weights are updated according to the first joint loss function and the second joint loss function.

[0038] The generation module is used to input the acquired text description containing the concept of target forgetting into the target video generation model to generate a video that does not contain the concept of target forgetting.

[0039] Another embodiment of the present invention provides a computer-readable storage medium item, including: a stored computer program, which, when the computer program is running, controls the device where the computer-readable storage medium is located to perform the steps of the video content forgetting method based on the two-level optimization of the present invention. Attached Figure Description

[0040] To more clearly illustrate the technical solution of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0041] Figure 1 This is a flowchart illustrating an embodiment of the video content forgetting method based on two-level optimization provided in this application;

[0042] Figure 2 This is a schematic diagram illustrating the construction process of the target forgetting concept training set provided in this application;

[0043] Figure 3 This is a flowchart illustrating steps S301 to S304 provided in this application;

[0044] Figure 4 This is a flowchart illustrating steps S401 to S404 provided in this application;

[0045] Figure 5 This is a schematic diagram of the training process of the target video generation model provided in this application;

[0046] Figure 6 This is a schematic diagram of an embodiment of the video content forgetting system based on two-level optimization provided in this application. Detailed Implementation

[0047] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings of the embodiments. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0048] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains; the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the application; the terms “comprising” and “having”, and any variations thereof, in the specification, claims, and foregoing description of the drawings are intended to cover non-exclusive inclusion.

[0049] In the description of the embodiments of this application, technical terms such as "first" and "second" are used only to distinguish different objects and should not be construed as indicating or implying relative importance or implicitly specifying the number, specific order, or primary and secondary relationship of the indicated technical features. In the description of the embodiments of this application, "multiple" means two or more, unless otherwise explicitly defined.

[0050] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0051] In the description of the embodiments in this application, the term "and / or" is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Additionally, the character " / " in this document generally indicates that the preceding and following related objects have an "or" relationship.

[0052] In the description of the embodiments of this application, the term "multiple" refers to two or more (including two), similarly, "multiple sets" refers to two or more (including two sets), and "multiple pieces" refers to two or more (including two pieces).

[0053] In the description of the embodiments of this application, unless otherwise expressly specified and limited, technical terms such as "installation," "connection," "joining," and "fixing" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral part; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; they can refer to the internal communication of two components or the interaction between two components. For those skilled in the art, the specific meaning of the above terms in the embodiments of this application can be understood according to the specific circumstances.

[0054] The following is an explanation of the terms used in this application:

[0055] Forgetting learning is a class of methods specifically designed for AI security. Researchers typically employ concepts shift, contrastive learning, and other methods at the attention layer or in the final loss function to forget knowledge of a specific category. This is widely used in traditional classification, detection, segmentation, and the currently popular Large Language Modeling (LLM) and generative models. These methods, when carefully designed, can be used for forgetting in video generation. After forgetting, the generative model can no longer generate videos containing specific concepts, thus avoiding legal and infringement issues arising from generating trademark elements, public figures, film and television characters, or sensitive objects. It is worth noting that forgetting in video generation models refers to causing the video generation model to lose its ability to generate specific concepts, rather than deleting specific concepts from the video through video editing.

[0056] Wan2.1 is a fully open-source large-scale video generation model that uses the standard DiffusionTransformer architecture, which is currently the most mainstream paradigm. It has also achieved good results on both VPench-1.0 and VPench-2.0. Therefore, this invention aims to implement a forgetting task based on this model, thereby improving the versatility and reliability of the method.

[0057] The background of this application will be explained below:

[0058] In the digital age, while video generation technology facilitates creation, it also brings risks related to copyright, privacy, and compliance, such as generating footage containing sensitive content like trademarks and public figures. To mitigate these risks, models need to "forget" sensitive concepts to ensure the legality and security of generated content, while also being applicable to scenarios involving privacy protection and brand replacement. Existing forgetting learning methods are mostly focused on the image domain, but their application in video generation faces significant challenges: first, balancing single-frame concept forgetting with temporal coherence is crucial; second, it may affect the generation of irrelevant concepts; and third, it struggles to handle synonyms or misspellings. Therefore, accurate, efficient video generation forgetting that does not affect other concepts is a pressing problem that needs to be solved.

[0059] See Figure 1 To achieve accurate forgetting of specific concepts in video generation, an embodiment of the present invention provides a video content forgetting method based on dual-level optimization, including steps S101 to S103.

[0060] Step S101: Construct a training set of target forgotten concepts and an irrelevant concept training set;

[0061] In some embodiments, the target forgotten concept training set is generated by the SkyReels-A2 model and includes a first video containing the target forgotten concept and a second video without the target forgotten concept paired in the same background; the irrelevant concept training set is generated by the initial video generation model and includes a third video containing irrelevant concepts.

[0062] In some embodiments, the construction process of the target forgetting concept training set is as follows: First, select the target forgetting concept (e.g., "Pikachu"), the corresponding background image (e.g., "beach"), and the text description; second, use the SkyReels-A2 model to generate a first video containing the target forgetting concept (e.g., a 4-6 second video of "a Pikachu walking on a golden beach") based on the target forgetting concept, background image, and text description, and use the SkyReels-A2 model to generate a second video without the target forgetting concept (e.g., a video of "a golden beach") based on the background image, ensuring that the first and second videos are consistent in background and scene, with only the target forgetting concept differing, thus obtaining the target forgetting concept training set. The construction process of the target forgetting concept training set is as follows: Figure 2 As shown, for example, taking the copyrighted character Pikachu as an example, when collecting the dataset, by inputting an image of Pikachu + a background image of a beach + the text "A Pikachu is walking on a golden sand," videos containing Pikachu are obtained. Inputting a background image of a beach + the text "A golden sand" yields videos that do not contain Pikachu, thus obtaining paired video data. Through different background images, other images, and text descriptions, paired video datasets targeting specific copyrighted content are obtained.

[0063] In some embodiments, firstly, irrelevant concepts (such as "beach") are selected; then, an initial video generation model (such as the Wan2.1 model) is used to generate a third video containing irrelevant concepts (such as a video of "a golden sand beach") based on the irrelevant concepts, and it is ensured that the videos in the irrelevant concept training set are irrelevant to the target forgotten concept in terms of content and semantics, so as to obtain the irrelevant concept set.

[0064] By providing videos containing target concepts, the model can learn how to identify and suppress these concepts. At the same time, providing a training set of irrelevant concepts can help the model distinguish between target concepts and irrelevant concepts, avoiding the accidental rejection of other concepts during the forgetting process.

[0065] Step S102: Input the target forgotten concept training set into a preset first video generation model, determine a first joint loss function based on the output of the first model, and perform lower-level optimization based on the first joint loss function to obtain a second video generation model. Input the irrelevant concept training set into the second video generation model, determine a second joint loss function based on the output of the second model, and perform upper-level optimization based on the second joint loss function until both the first joint loss function and the second joint loss function meet preset conditions, and determine the trained target video generation model. The first video generation model is obtained by inserting a LoRA low-rank adapter into the cross-attention layer of the pre-trained initial video generation model and freezing the model parameters of the initial video generation model. In each iteration, the LoRA weights are updated according to the first joint loss function and the second joint loss function.

[0066] In some embodiments, the rank parameter of the LoRA low-rank adapter is set to less than or equal to 4, and it is only inserted into the forward propagation path of the query mapping matrix and key mapping matrix of the cross-attention layer in the initial video generation model. The remaining parameters of the initial video generation model remain frozen until training ends. Specifically, firstly, in the cross-attention layer of the initial video generation model (i.e., the Wan2.1 model), the query mapping matrix (q-mapping layer) and key mapping matrix (k-mapping layer) are located; then, a LoRA low-rank adapter is inserted into the forward propagation path of the query mapping matrix and key mapping matrix, and the rank parameter r of the LoRA low-rank adapter is set to less than or equal to 4; then, all parameters in the initial video generation model except for the LoRA low-rank adapter are frozen to ensure that only the weights of the LoRA low-rank adapter are updated during training; and in each iteration, the weights of the LoRA low-rank adapter are updated according to the first joint loss function and the second joint loss function until a preset condition is met; after training ends, the weights of the LoRA low-rank adapter are kept frozen to form the final target video generation model.

[0067] It should be noted that the LoRA adapter only applies to the first and last Transformer layers of the spacetime dual-stream block to reduce memory and computational load.

[0068] This design not only reduces computational complexity and memory requirements, but also ensures that the model can maintain its ability to generate other concepts and the stability of overall performance while forgetting the target concept.

[0069] It's important to note that past methods primarily focused on image processing, resulting in forgetting only spatially. This only preserves the concept forgetting within a single frame. However, video presents two crucial differences: a) after forgetting a concept, it's necessary to maintain temporal consistency among unrelated concepts, such as the background; b) after forgetting a concept, the overall video needs to maintain temporal continuity. This necessitates considering spatial attention layers while simultaneously addressing the temporal changes in attention to the concept to be forgotten, accurately locating the concept, and ultimately designing a forgetting mechanism that targets only specific concept regions. This approach ensures temporal consistency without affecting knowledge of unrelated concepts.

[0070] Please refer to Figure 3 In some embodiments, the step of inputting the target forgotten concept training set into a preset first video generation model, determining a first joint loss function based on the output result of the obtained first model, and performing lower-level optimization based on the first joint loss function to obtain a second video generation model includes steps S301 to S304.

[0071] Step S301: Input the target forgetting concept training set into a preset first video generation model to obtain the first model output result;

[0072] In some embodiments, a first video containing the concept of target forgetting and a second video not containing the concept of target forgetting are input into a preset first video generation model (Wan2.1 model, with a LoRA low-rank adapter inserted and other parameters frozen). At this time, the model generates the corresponding video output based on the input text and outputs the attention map of the cross-attention layer, which determines the output result of the first model.

[0073] For example, if the target forgetting concept is "Pikachu", then the first video is a video of "a Pikachu walking on the beach" and the second video is a video of "the beach".

[0074] Step S302: Calculate the global forgetting loss based on the output of the first model, wherein the global forgetting loss is obtained by comparing the output difference of the first video generation model for the first video containing the target forgetting concept and the second video without the target forgetting concept;

[0075] In some embodiments, after obtaining the first model output (i.e., the model's generation results for the first video containing the target forgetting concept and the second video not containing the target forgetting concept), the model's output for the first video containing the target forgetting concept is used. and the output of the second video that does not contain the concept of target forgetting. Then, the global forgetting loss L is calculated based on the two outputs.esd This reflects the model's ability to generate the concept of forgetting the target.

[0076] In some embodiments, the formula for calculating the global forgetting loss is as follows:

[0077]

[0078] In the formula, L esd The global forgetting loss is E; the expectation factor is c. un Text representing concepts that need to be forgotten; Indicates empty text; x t and x t ′ represents paired video data in the dataset that includes and does not include the concept of forgetting; θ o Δθ represents the original model; Δθ represents the LoRA module; v represents the predicted target rate of the diffusion model; t represents the time step of the diffusion model; and η is a hyperparameter for the degree of forgetting.

[0079] It should be noted that the model output This refers to the current training model's performance on video x, which contains concepts to be forgotten. t and text c containing concepts to be forgotten un The output of the target. This refers to using the original model to process video x that does not contain the concept to be forgotten. t ′ and empty text Output Subtract the original model for the same video x t ′ and the text of the concept to be forgotten c un Output Subtraction is used to further suppress the concept to be forgotten. The corresponding coefficients (1+η) and η are used to adjust the degree of suppression of the forgotten concept in the target output, thereby controlling the degree of forgetting of the target concept by the final model. The larger η is, the greater the degree of forgetting. By using the difference between the current model output and the target output as the loss function, the model is guided to suppress the generation of the target concept, thereby achieving the forgetting effect. For example, taking the forgetting of the concept "dog" as an example, c un It is text describing a video "containing a puppy". It is an empty text with "", x t It's a video that includes a puppy, x t ' is a video where the puppy is removed but the same background is retained.

[0080] It's important to note that by comparing videos containing and without the target concept, the video generation model can maintain consistency in other concepts and the coherence of the video while forgetting the target concept. By minimizing this loss, the model will gradually lose its ability to generate the target concept.

[0081] It should be noted that while global forgetting loss can achieve global forgetting of target concepts, this forgetting is often incomplete and may affect the accuracy of irrelevant concepts. Therefore, we supplement it with a local attention loss function.

[0082] Step S303: Based on the attention matrix output by the cross-attention layer of the first video generation model, calculate the local attention loss, wherein the local attention loss is the sum of the columns corresponding to the target concept in the attention matrix in the spatial and temporal dimensions;

[0083] In some embodiments, firstly, the attention matrix output by the cross-attention layer of the first video generation model is extracted; secondly, the token ID corresponding to the target concept in the attention graph is determined by cross-attention of text and image; and then, the local attention loss is calculated based on the attention matrix.

[0084] It should be noted that the tokenid corresponding to the concept to be forgotten in the attention graph is obtained by cross-attention of text and image. By setting an activation threshold, the token of the target concept response can be found.

[0085] In some embodiments, the local attention loss L attn The relevant formulas are as follows:

[0086]

[0087] In the formula, F represents the attention matrix output by the cross-attention layer in the video generation model, and idx refers to the corresponding position of the target's previous concept in the attention matrix.

[0088] It should be noted that by suppressing the activation values ​​at these locations through the loss function, the model is explicitly told which regions should be forgotten, thus achieving local forgetting. Through the combination of global and local forgetting, a preliminary accurate forgetting of the target concept within a single frame has been achieved.

[0089] It should be noted that, in terms of the forgetting target concept, a loss function that combines global and local attention is adopted. First, we use the concept shifting method commonly used in forgetting learning to design a shifting target for a concept that needs to be forgotten (for example, when we need to forget "dog", we shift it to an empty set, i.e., the concept of empty text). Then, the two branches work together to perform the forward process of the diffusion model.

[0090] Step S304: Based on the combination of the global forgetting loss and the local attention loss, a first joint loss function is formed, and the LoRA weights are updated according to the first joint loss function to obtain the second video generation model.

[0091] In some embodiments, the global forgetting loss L esd and local attention loss L attn The first joint loss function is then combined with the LoRA weights, which are then updated based on the first joint loss function. The updated LoRA weights are then used to obtain the second video generation model.

[0092] It should be noted that by using global ESD loss and local attention suppression terms for the forgotten concepts, the attention activation corresponding to the forgotten concepts in all frames is weakened, thereby suppressing the model's ability to express forgotten concepts.

[0093] By inputting videos containing the target concept, the model can learn the features and context of the target concept. By optimizing the first joint loss function, the model can gradually suppress the ability to generate the target concept while retaining the ability to generate other concepts.

[0094] Please refer to Figure 4 In some embodiments, the step of inputting the irrelevant concept training set into the second video generation model, determining the second joint loss function based on the output result of the obtained second model, and performing upper-level optimization based on the second joint loss function until both the first joint loss function and the second joint loss function meet the preset conditions, and determining the trained target video generation model, includes steps S401 to S404.

[0095] Step S401: Input the irrelevant concept training set into the second video generation model to obtain the output result of the second model;

[0096] In some embodiments, videos from an irrelevant concept training set are input into a second video generation model. The model then generates video content based on the input, resulting in the output of the second model.

[0097] Step S402: Calculate the irrelevant concept retention loss based on the output of the second model, wherein the irrelevant concept retention loss is obtained by comparing the difference between the irrelevant concept video and the model prediction output;

[0098] In some embodiments, a loss function (such as mean squared error, cross-entropy loss, etc.) is used to calculate the difference between the output of the second model (the generated video) and the real videos in the irrelevant concept training set, so as to obtain the irrelevant concept retention loss, in order to maintain the model's ability to generate videos with irrelevant concepts, that is, the generated video is consistent with the real video in terms of irrelevant concepts.

[0099] In some embodiments, the irrelevant concept retention loss L ir The relevant formulas are as follows:

[0100]

[0101] In the formula, u t The video is generated from these unrelated concepts, v is the predicted output of the flow matching model, and c is the textual condition. In this way, we can retain knowledge of the other concepts even if we forget one concept.

[0102] Step S403: Calculate semantic self-comparison loss based on the cross-attention matrix output by the second model, wherein the semantic self-comparison loss is obtained by comparing the similarity of the target concept attention vector, the irrelevant concept attention vector, and the synonym attention vector;

[0103] In some embodiments, a major problem exists in the generated forgetting process: synonyms and misspellings cannot be forgotten. For example, when forgetting "ball," a specific type of ball (soccer) or a spelling error (bal) may not be forgotten. Therefore, it is necessary to calculate a semantic self-contrast loss L. sc To enable the model to retain irrelevant concept knowledge while also having the ability to distinguish synonyms, the following steps are taken: First, attention vectors for the target concept, irrelevant concepts, and synonyms are extracted from the cross-attention matrix output by the second model. Then, the similarity between the target concept attention vector and the irrelevant concept attention vector is calculated, as well as the similarity between the target concept attention vector and the synonym attention vector. Finally, using a contrastive learning approach, a semantic self-contrast loss function is designed to minimize the similarity between the target concept and irrelevant concepts, while maximizing the similarity with synonyms. This allows the model to accurately distinguish between irrelevant concepts and synonyms even when forgetting the target concept.

[0104] In some embodiments, the expression for the semantic self-comparison loss is specifically:

[0105]

[0106] In the formula, L sc For the semantic self-comparison loss; F un The target concept attention vector; F is the attention vector for the irrelevant concepts; syn This is the attention vector for the synonym.

[0107] It should be noted that the semantic self-comparison loss minimizes the similarity between the concept to be forgotten and irrelevant concepts, while maximizing the similarity with the concept set of synonyms. This allows the semantic self-comparison loss to preserve the model's fine-grained understanding of semantics.

[0108] Step S404: Combine the irrelevant concept retention loss and the semantic self-comparison loss into a second joint loss function, and update the LoRA weights based on the second joint loss function until the values ​​of the first joint loss function and the second joint loss function are both lower than a preset threshold, and no forgotten target concept is detected in the random prompt generation results within a preset number of times, thus determining the trained target video generation model.

[0109] In some embodiments, the irrelevant concept retention loss and semantic self-comparison loss are combined into a second joint loss function, and the second joint loss function is combined with the first joint loss function to form the final optimization objective, L = L esd +L ir +L attn +L sc Then, the LoRA weights are updated according to the optimization objective, and the lower-level optimization and upper-level optimization steps are repeated until the following conditions are met: 1) The values ​​of the first joint loss function and the second joint loss function are both lower than a preset threshold (e.g., 0.01); 2) Within a preset number of iterations (e.g., 3 times), the external detection model does not detect the forgotten concept of the target in the randomly prompted generation results. When the above conditions are met, the trained target video generation model is determined.

[0110] It should be noted that the training process diagram of the target video generation model is as follows: Figure 5 As shown, in the lower-level optimization, videos with and without target concepts (i.e., videos generated from concept text and videos generated from empty text) are first used to compute L through the network. esd Simultaneously, tensors F of the cross-attention layer are extracted for videos with target concepts. attn The index columns of the target concept are summed to form L. attn Only update the LoRA weights Δθ. b) Upper-level optimization first constructs synonym and irrelevant concept cue pairs to obtain the retain set; calculates the LoRA regularization L. ir And self-contrast loss L sc This aims to narrow the gap between the target concept and its synonyms, and to widen the gap between it and irrelevant concepts.

[0111] By inputting videos containing irrelevant concepts, the model can learn the features and context of those irrelevant concepts. By optimizing the second joint loss function, the model can retain its ability to generate irrelevant concepts while further optimizing the forgetting effect on the target concepts.

[0112] In some embodiments, the video content forgetting method based on two-level optimization further includes: when multiple target concepts need to be forgotten simultaneously, the LoRA weights corresponding to each target concept are linearly normalized to achieve parallel deletion of multiple concepts. Specifically: first, when multiple target concepts need to be forgotten simultaneously, for each target concept to be forgotten, its corresponding LoRA weights are trained separately. These weights are obtained through the above-mentioned two-level optimization method, enabling the model to forget its corresponding target concepts; then, the LoRA weights corresponding to multiple target concepts are linearly combined to obtain combined weights; then, after applying the combined weights, the above-mentioned two-level optimization method is used for training. Specifically, for each target concept, the global forgetting loss and local attention loss still need to be calculated; for irrelevant concepts, the irrelevant concept retention loss and semantic self-comparison loss still need to be calculated. By optimizing these loss functions, the combined weights are further adjusted until all target concepts are forgotten and the generation ability of irrelevant concepts is preserved.

[0113] It's important to note that during training, it's crucial to periodically verify whether the model has successfully forgotten all target concepts. This can be done by using an external detection model to examine the generated videos and check if they still contain the target concepts. If some target concepts are found to be incompletely forgotten, the combination of normalization weights can be adjusted appropriately, or the number of training iterations can be increased to ensure that all target concepts are forgotten.

[0114] By combining the LoRA weights corresponding to multiple target concepts through linear normalization to achieve parallel deletion of multiple concepts, the range of forgetting ability is expanded and the forgetting efficiency is improved. This not only maintains the stability and consistency of the model, enhances the accuracy of forgetting, but also improves the model's versatility and adaptability.

[0115] Step S103: Input the obtained text description containing the concept of target forgetting into the target video generation model to generate a video without the concept of target forgetting.

[0116] In some embodiments, the acquired text description containing the forgotten target concept is input into the target video generation model. The model internally encodes the text using its architecture (such as the Diffusion Transformer architecture of the Wan2.1 model) and combines it with a LoRA low-rank adapter for the generation process to obtain a video without the forgotten target concept. For example, suppose the forgotten target concept is "Pikachu," and the user-input text description is "A Pikachu is running on a golden beach." The model, trained with two levels of optimization, will forget the concept of "Pikachu," and the generated video will not contain the image of "Pikachu," but will retain other conceptual content such as "golden beach," generating a video without "Pikachu" but with a beach background.

[0117] It's important to note that the trained target video generation model effectively avoids outputting specified sensitive content. When a user inputs text containing sensitive concepts (such as specific brand logos, public figures, or copyrighted characters), the generation model automatically filters these elements and generates video content that meets the requirements.

[0118] It should be noted that the advantage of this application's embodiments lies in achieving precise forgetting of a specific concept during video generation, while ensuring that other concepts remain unaffected, maintaining consistency in the generated video's main body and good video coherence, while keeping the inference time consistent with the original model without adding additional overhead. It is worth clarifying that forgetting in the video generation model refers to causing the model to lose its ability to generate specific concepts, rather than deleting specific concepts from the video through video editing. This invention integrates some techniques commonly used in the field of image generation forgetting into video generation. Based on our newly constructed paired dataset and LoRA lightweight parameterization, it achieves efficient, high-quality, robust, and zero-inference-overhead concept deletion, providing a general solution for the safety and compliance of video generation models.

[0119] This application's embodiments construct a target forgotten concept training set and an irrelevant concept training set to facilitate subsequent model learning how to identify and suppress the target forgotten concept while retaining the generation of other irrelevant concepts. By inputting the target forgotten concept training set, the model can learn the features and context of the target concept. By optimizing the first joint loss function, the model can gradually suppress the generation ability of the target concept while retaining the generation ability of other concepts. By inputting the irrelevant concept training set into the second video generation model, the model can learn the features and context of irrelevant concepts. By optimizing the second joint loss function, the model can retain the generation ability of irrelevant concepts while further optimizing the forgetting effect of the target concept. By focusing on forgetting the target concept in the lower layer optimization and focusing on retaining irrelevant concepts in the upper layer optimization, the model can maintain the generation ability of other concepts while forgetting the target concept through alternating optimization of these two levels. By inputting text descriptions containing the target forgotten concept into the target video generation model, the target concept can be intelligently filtered out. Even if the input text description contains the target concept, the model can still generate a video without the target concept. Compared with the prior art, this application can achieve accurate forgetting of specific concepts in video generation.

[0120] like Figure 6 As shown, based on the above method embodiments, corresponding apparatus embodiments are provided;

[0121] One embodiment of the present invention provides a video content forgetting system based on dual-level optimization, comprising: an acquisition module 100, a training module 200, and a generation module 300;

[0122] The acquisition module 100 is used to construct a training set of target forgotten concepts and a training set of irrelevant concepts;

[0123] The training module 200 is used to input the target forgotten concept training set into a preset first video generation model, determine a first joint loss function based on the output of the first model, and perform lower-level optimization based on the first joint loss function to obtain a second video generation model. The irrelevant concept training set is input into the second video generation model, a second joint loss function is determined based on the output of the second model, and upper-level optimization is performed based on the second joint loss function until both the first joint loss function and the second joint loss function meet preset conditions, thereby determining the trained target video generation model. The first video generation model is obtained by inserting a LoRA low-rank adapter into the cross-attention layer of the pre-trained initial video generation model and freezing the model parameters of the initial video generation model. In each iteration, the LoRA weights are updated according to the first joint loss function and the second joint loss function.

[0124] The generation module 300 is used to input the acquired text description containing the concept of target forgetting into the target video generation model to generate a video that does not contain the concept of target forgetting.

[0125] It is understood that the above-described device embodiments correspond to the method embodiments of the present invention, and can implement the video content forgetting method based on two-level optimization provided by any of the above-described method embodiments of the present invention.

[0126] It should be noted that the device embodiments described above are merely illustrative, and some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Furthermore, in the accompanying drawings of the device embodiments provided by this invention, the connection relationships between modules indicate that they have communication connections, which can specifically be implemented as one or more communication buses or signal lines. Those skilled in the art can understand and implement this without any creative effort.

[0127] Based on the above embodiments of the video content forgetting method based on dual-level optimization, another embodiment of the present invention provides a terminal device, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor. When the processor executes the computer program, it implements the video content forgetting method based on dual-level optimization of any embodiment of the present invention.

[0128] For example, in this embodiment, the computer program can be divided into one or more modules, which are stored in the memory and executed by the processor to complete the present invention. The one or more modules may be a series of computer program instruction segments capable of performing a specific function, which describe the execution process of the computer program in the terminal device.

[0129] The terminal device may be a desktop computer, laptop, handheld computer, or cloud server, etc. The terminal device may include, but is not limited to, a processor and a memory.

[0130] The processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor. The processor is the control center of the terminal device, connecting all parts of the terminal device via various interfaces and lines.

[0131] Based on the above-described method embodiments, another embodiment of the present invention provides a computer-readable storage medium including a stored computer program, wherein, when the computer program is executed, it controls the device where the computer-readable storage medium is located to execute the video content forgetting method based on two-level optimization described in any of the above-described method embodiments of the present invention.

[0132] The modules / units integrated in the device / terminal equipment, if implemented as software functional units and sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include: any entity or device capable of carrying the computer program code, recording media, USB flash drives, portable hard drives, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc.

[0133] The above description represents the preferred embodiments of the present invention. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principles of the present invention, and these improvements and modifications are also considered to be within the scope of protection of the present invention.

Claims

1. A video content forgetting method based on two-level optimization, characterized in that, include: Construct a training set for the target forgetting concept and a training set for irrelevant concepts. The training set for the target forgetting concept includes a first video containing the target forgetting concept and a second video without the target forgetting concept, paired together under the same background. The training set for irrelevant concepts includes a third video containing irrelevant concepts. The target forgotten concept training set is input into a preset first video generation model. Based on the output of the first model, a first joint loss function is determined, and lower-level optimization is performed based on the first joint loss function to obtain a second video generation model. The irrelevant concept training set is input into the second video generation model. Based on the output of the second model, a second joint loss function is determined, and upper-level optimization is performed based on the second joint loss function until both the first joint loss function and the second joint loss function satisfy preset conditions. The trained target video generation model is then determined. The first video generation model is obtained by inserting a LoRA low-rank adapter into the cross-attention layer of the pre-trained initial video generation model and freezing the model parameters of the initial video generation model. In each iteration, the LoRA weights are updated according to the first joint loss function and the second joint loss function. The obtained text description containing the concept of target forgetting is input into the target video generation model to generate a video that does not contain the concept of target forgetting.

2. The video content forgetting method based on dual-level optimization according to claim 1, characterized in that, The target forgetting concept training set is generated by the SkyReels-A2 model and includes a first video with the target forgetting concept and a second video without the target forgetting concept paired under the same background; the irrelevant concept training set is generated by the initial video generation model and includes a third video with irrelevant concepts.

3. The video content forgetting method based on dual-level optimization according to claim 2, characterized in that, The process involves inputting the target forgotten concept training set into a preset first video generation model, determining a first joint loss function based on the output of the first model, and performing lower-level optimization based on the first joint loss function to obtain a second video generation model. Specifically: The target forgetting concept training set is input into a preset first video generation model to obtain the output result of the first model; The global forgetting loss is calculated based on the output of the first model, wherein the global forgetting loss is obtained by comparing the output difference of the first video generation model for the first video containing the target forgetting concept and the second video not containing the target forgetting concept; Based on the attention matrix output by the cross-attention layer of the first video generation model, the local attention loss is calculated, wherein the local attention loss is the sum of the columns corresponding to the target concept in the attention matrix in the spatial and temporal dimensions; The first joint loss function is formed by combining the global forgetting loss and the local attention loss, and the LoRA weights are updated according to the first joint loss function to obtain the second video generation model.

4. The video content forgetting method based on dual-level optimization according to claim 3, characterized in that, The formula for calculating the global forgetting loss is as follows: ; In the formula, This refers to the global forgetting loss; Expectation factor; Text representing concepts that need to be forgotten; Indicates empty text; and This represents paired video data in the dataset that includes and does not include the concept of forgetting. Represents the original model; represents the LoRA module; v represents the predicted target rate of the diffusion model; t represents the time step of the diffusion model. This is a hyperparameter for the degree of forgetting.

5. The video content forgetting method based on dual-level optimization according to claim 2, characterized in that, The process involves inputting the irrelevant concept training set into the second video generation model, determining the second joint loss function based on the output of the second model, and performing upper-level optimization based on the second joint loss function until both the first joint loss function and the second joint loss function satisfy preset conditions, thereby determining the trained target video generation model. Specifically: The training set of irrelevant concepts is input into the second video generation model to obtain the output result of the second model; The irrelevant concept retention loss is calculated based on the output of the second model, wherein the irrelevant concept retention loss is obtained by comparing the difference between the irrelevant concept video and the model prediction output; The semantic self-comparison loss is calculated based on the cross-attention matrix output by the second model, wherein the semantic self-comparison loss is obtained by comparing the similarity of the target concept attention vector, the irrelevant concept attention vector, and the synonym attention vector; The irrelevant concept retention loss and the semantic self-comparison loss are combined into a second joint loss function, and the LoRA weights are updated based on the second joint loss function until the values ​​of the first joint loss function and the second joint loss function are both lower than a preset threshold, and no forgotten target concept is detected in the random prompt generation results within a preset number of times. The trained target video generation model is then determined.

6. The video content forgetting method based on dual-level optimization according to claim 5, characterized in that, The expression for the semantic self-comparison loss is as follows: ; In the formula, This refers to the semantic self-comparison loss; The target concept attention vector; The attention vector for the irrelevant concepts; This is the attention vector for the synonym.

7. The video content forgetting method based on dual-level optimization according to claim 1, characterized in that, The rank parameter of the LoRA low-rank adapter is set to 2-8, and it is only inserted into the forward propagation path of the query mapping matrix and key mapping matrix of the cross-attention layer in the initial video generation model. The remaining parameters of the initial video generation model remain frozen until the end of training.

8. The video content forgetting method based on dual-level optimization according to claim 1, characterized in that, Also includes: When multiple target concepts need to be forgotten simultaneously, the LoRA weights corresponding to each target concept are combined by linear normalization to achieve parallel deletion of multiple concepts.

9. A video content forgetting system based on dual-level optimization, characterized in that, include: Acquisition module, training module, and generation module; The acquisition module is used to construct a target forgotten concept training set and an irrelevant concept training set, wherein the target forgotten concept training set includes a first video containing the target forgotten concept and a second video not containing the target forgotten concept paired under the same background; the irrelevant concept training set includes a third video containing irrelevant concepts. The training module is used to input the target forgotten concept training set into a preset first video generation model, determine a first joint loss function based on the output of the first model, and perform lower-level optimization based on the first joint loss function to obtain a second video generation model. The irrelevant concept training set is then input into the second video generation model, a second joint loss function is determined based on the output of the second model, and upper-level optimization is performed based on the second joint loss function until both the first joint loss function and the second joint loss function satisfy preset conditions, thereby determining the trained target video generation model. The first video generation model is obtained by inserting a LoRA low-rank adapter into the cross-attention layer of a pre-trained initial video generation model and freezing the model parameters of the initial video generation model. In each iteration, the LoRA weights are updated according to the first joint loss function and the second joint loss function. The generation module is used to input the acquired text description containing the concept of target forgetting into the target video generation model to generate a video that does not contain the concept of target forgetting.

10. A computer-readable storage medium, characterized in that, include: A stored computer program, wherein, when the computer program is executed, it controls the device containing the computer-readable storage medium to perform the steps of the video content forgetting method based on two-level optimization as described in any one of claims 1-8.