A method and system for maintaining character consistency

By acquiring a reference image set of the target character for feature extraction and enhancement, and combining the denoising process of the diffusion model with feature deviation detection, the problems of inconsistent image and low quality in AIGC character generation are solved, achieving consistency in character image and improvement in generation quality.

CN122289482APending Publication Date: 2026-06-26HUNAN QIANBO TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUNAN QIANBO TECH CO LTD
Filing Date
2026-06-01
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing AIGC character generation technology suffers from problems such as identity deviation, component shape deformation, character referencing confusion, and temporal drift of identity features in continuous frame generation, resulting in inconsistent and low-quality generated content.

Method used

Feature extraction is performed by acquiring a reference image set of the target character, and a steady-state dataset of the character is generated by combining a preset feature enhancement mechanism. A diffusion model is used to constrain and guide the denoising process, and feature deviation detection and user feedback are combined to optimize the consistency and quality of the generated images.

Benefits of technology

It effectively solves the problems of inconsistent character images and low quality in AIGC generated content, ensuring that the generated image fits the steady-state requirements of the character in a single generation and maintains the consistency of the character image in continuous frame generation, thereby improving the generation accuracy and stability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122289482A_ABST
    Figure CN122289482A_ABST
Patent Text Reader

Abstract

This invention provides a method and system for maintaining character image consistency, comprising: acquiring a target character reference image set and performing feature extraction and enhancement; generating a character steady-state dataset; acquiring a single text prompt word; combining the character steady-state dataset; acquiring and applying constraints and guiding corrections to the diffusion model denoising process based on component constraint strength vectors and character condition vectors; generating an initial generated image; performing feature deviation detection and correction on the initial generated image; generating an optimized generated image; acquiring and parsing a continuous text prompt word sequence and a multi-character steady-state dataset; generating an image consistency constraint set; combining the diffusion model; generating a character-consistent image sequence; acquiring and parsing user feedback and historical generation data corresponding to the character-consistent image sequence; generating and optimizing the character image consistency maintenance system based on the feedback optimization dataset, thereby ensuring the consistency of character images in AIGC generated content.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and more specifically, to a method and system for maintaining consistency in character image. Background Technology

[0002] With the rapid development of AI-generated content technology, the application of generating character images and continuous frame sequences based on text prompts is becoming increasingly widespread. It is widely used in animation production, short video generation, virtual character display and high-tech video fields, covering typical high-tech video application scenarios such as ultra-high-definition video, immersive video, VR / AR virtual character creation, and interactive video character adaptation.

[0003] Current AIGC character generation technology still has certain limitations: First, single-shot character generation is prone to issues such as identity deviation and component shape deformation; second, in continuous frame generation scenarios, there are often pain points such as character referencing confusion and temporal drift of identity features. This results in inconsistencies in character appearances and low quality of generated content. Summary of the Invention

[0004] In view of the aforementioned problems of inconsistent character appearances and low quality of generated content, in conjunction with the first aspect of the present invention, embodiments of the present invention provide a method for maintaining character appearance consistency, comprising:

[0005] S1: Obtain the target character reference image set and extract features, and combine it with a preset feature enhancement mechanism to enhance features and generate a character steady-state dataset. The character steady-state dataset contains at least an initial hybrid anchor vector and a component family set.

[0006] S2: Obtain a single text prompt word, and in conjunction with the character steady-state dataset, obtain and apply constraints and guide corrections to the diffusion model denoising process based on the component constraint strength vector and the character condition vector to obtain the initial generated image;

[0007] S3: Based on the initial generated image and the character steady-state dataset, perform feature deviation detection, and correct the initial generated image according to the feature deviation detection results to generate an optimized generated image;

[0008] S4: Obtain and parse the continuous text prompt word sequence and the multi-role steady-state dataset, generate the image consistency constraint set, and output the role-consistent image sequence by combining the process of steps S2-S3;

[0009] S5: Obtain and parse user feedback and historical generated data corresponding to the consistent image sequence of the character, generate and optimize the character image consistency maintenance system based on the feedback to optimize the dataset.

[0010] Furthermore, embodiments of the present invention also provide a character image consistency maintenance system, comprising:

[0011] The feature extraction module is used to extract features from the target character reference image set and enhance the features in combination with a preset feature enhancement mechanism to generate a character steady-state dataset.

[0012] An image generation module is used to apply constraints and guide the correction of the denoising process of the diffusion model based on a single text prompt word and a character steady-state dataset to obtain an initial generated image.

[0013] An image correction module is used to detect and correct feature deviations in the initial generated image to generate an optimized generated image.

[0014] The sequence generation module is used to parse the continuous text prompt word sequence and the multi-role steady-state dataset, and in combination with the process of steps S2-S3, generate a role-consistent image sequence.

[0015] The feedback optimization module is used to parse user feedback data and historical generated data, generate and optimize the corresponding modules of the system based on the feedback optimization dataset.

[0016] Compared with the prior art, the present invention has the following beneficial effects:

[0017] Step S1 establishes a unified role constraint standard and simultaneously clarifies the hybrid anchor point and component family library to provide a benchmark for subsequent generation. Step S2 integrates role identity features into the generation process to ensure that each generated image conforms to the role's steady-state requirements. Step S3 performs steady-state judgment and correction processing, detecting the initial generated image, accurately identifying the root causes of non-compliance issues, and making targeted corrections to further improve the quality of each generation. Step S4 resolves referential issues to analyze the role associations contained in the continuous frame generation scene, and simultaneously detects and intervenes in real time to ensure the consistency of the role image in continuous frames. Step S5 combines user feedback and historical data for feedback optimization to continuously improve generation accuracy and stability. Through coordinated action, these steps effectively solve the problems of inconsistent role images and low generation quality in AIGC-generated content.

[0018] This invention can be applied to animation production, short video generation, virtual character display, and high-tech video fields, covering typical high-tech video application scenarios such as ultra-high-definition video, immersive video, VR / AR virtual character creation, and interactive video character adaptation. Attached Figure Description

[0019] Figure 1This is a flowchart of the steps of a method for maintaining character image consistency according to the present invention;

[0020] Figure 2 This is a schematic diagram of a character image consistency maintenance system according to the present invention. Detailed Implementation

[0021] The present invention will now be described in detail with reference to the accompanying drawings. Figure 1 This is a flowchart of the steps of a method for maintaining character image consistency according to the present invention. The following is a detailed introduction to this method for maintaining character image consistency.

[0022] Step S1: Obtain the target character reference image set and extract its features, and combine it with a preset feature enhancement mechanism to enhance the features and generate a character steady-state dataset.

[0023] Specifically, step S1 includes steps S11-S14:

[0024] Step S11: Acquire reference images of the target character from multiple perspectives, poses, and lighting conditions, and combine them with the character type label corresponding to the target character to construct a reference image set for the target character. Specifically:

[0025] The reference image set should contain no fewer than 10 images. If the character features are complex, such as a cartoon character with many details or a realistic character with rich facial features, the number of images should be increased to 15 to 20 to ensure comprehensive coverage of all the character's features.

[0026] The reference images should include at least three perspectives: front, side, and 45° angle, to ensure that the character's three-dimensional outline features are fully captured.

[0027] The reference image should include at least a variety of facial expressions such as neutral, smiling, frowning, and surprised, as well as various postures such as standing, sitting, and walking, in order to avoid deviations in the mixed anchor points caused by a single facial expression or posture.

[0028] Reference images should cover environmental images under different lighting conditions, such as natural light, strong outdoor light, backlight, and low light, so as to reduce the interference of lighting factors on feature extraction.

[0029] The character type label is an optional input used to match the appropriate pre-trained visual encoder branch, thereby improving the accuracy of feature extraction. Specifically:

[0030] Realistic figures use a face recognition-specific pre-trained visual encoder, cartoon characters use a cartoon image-specific pre-trained visual encoder, and anthropomorphic animals use a corresponding specific pre-trained visual encoder.

[0031] If no role type label is entered, a general-purpose pre-trained visual encoder such as DINOv2, CLIP, or ResNet-50 will be used by default to adapt to the feature extraction of various roles.

[0032] The training content of each pre-trained visual encoder branch is set as follows:

[0033] The face recognition specialized pre-trained visual encoder uses realistic face datasets such as CelebA and LFW as training samples, focusing on training the ability to extract core identity features such as facial contours, facial feature points, and skin texture. The model parameters are optimized through comparative learning tasks to ensure that the unique facial features of realistic people can be accurately captured.

[0034] The cartoon image specialization pre-trained visual encoder uses cartoon datasets such as CartoonSet and AnimeFace as training samples. It is specifically trained on features such as the line outline, color style, and exaggerated facial features of cartoon characters to enhance the recognition and extraction accuracy of the core features of cartoon characters.

[0035] The anthropomorphic animal-specific pre-trained visual encoder uses anthropomorphic animal datasets such as AnimalFace and Pet3K as training samples, focusing on the core features of anthropomorphic animals such as facial features, body proportions, and hair texture. Through feature alignment training, it achieves accurate extraction of the identity features of anthropomorphic animal characters.

[0036] Step S12: Use a pre-trained visual encoder to extract features from each reference image in the target character reference image set, and take the mean of all extracted feature vectors as the initial mixing anchor vector. Specifically:

[0037] Reference image Preprocessing such as size normalization is performed, and the preprocessed reference image is input into the visual encoder. After convolution, pooling, and fully connected operations by the visual encoder, a d-dimensional hybrid feature vector is output. This hybrid feature vector contains various information such as the character's identity, posture, expression, lighting, and background.

[0038] Mixed feature vectors of all reference images Perform mean calculation to obtain the initial mixed anchor point. ,Right now ,in This represents the number of reference images.

[0039] For example, a pre-trained DINOv2-base model is chosen as the general visual encoder, with its output feature dimension d set to 768. First, 15 images from the reference image set are preprocessed, normalized to a size of 224×224 pixels, and pixel values ​​are converted to the [0,1] range using pixel normalization. Then, the 15 preprocessed images are sequentially input into the visual encoder, processed by the model's 12 Transformer encoder layers, and output respectively. to The 15 768-dimensional mixed feature vectors are obtained; finally, the mean of the 15 mixed feature vectors is calculated to obtain the initial 768-dimensional mixed anchor point. .

[0040] Step S13: Obtain the component family set and define the semantic mapping table.

[0041] Specifically, step S13 includes steps S131-S132:

[0042] Step S131: Using a pre-trained component parsing network, hierarchical component decomposition is performed on the target character reference image set to generate a component family set. Specifically:

[0043] First, a pre-trained component parsing network is used to parse representative images of the target character reference image set that are clear, complete, unobstructed, and pose-neutral. The images are decomposed into a hierarchical component family set C, and each component family is assigned an invariance level of 1 to 4. The annotation method is manual annotation or automatic annotation by a pre-trained classifier followed by manual verification.

[0044] Among them, the component parsing network needs to be selected according to the character type. For example, realistic characters can use human body parsing models such as HRNet and DeepLabV3+, while cartoon characters can use cartoon component parsing models such as Cartoon Parsing Net.

[0045] The component family is divided into primary and secondary components according to the principle of moving from the whole to the part to ensure hierarchy and integrity.

[0046] Invariance levels are divided into four levels based on their contribution to identifiability, specifically:

[0047] Level 1 is the core identity component and cannot be changed; Level 2 is the high contribution component, whose core features must be maintained and can be slightly adjusted; Level 3 is the general contribution component and can be adjusted; Level 4 is the low contribution component and can be adjusted.

[0048] During the annotation process, spatial mask templates for each component are generated synchronously. The binary mask image is output and normalized through the component parsing network, and its size is consistent with the size of the reference image.

[0049] For example, for a realistic character A, HRNet is chosen as the component parsing network. First, a representative image of the character in a neutral pose with no occlusion and high definition under natural light is selected. Then, after parsing by the component parsing network, five primary components are obtained: facial features, body posture, hairstyle, clothing, and accessories. The facial features are further broken down into secondary components such as eyes, nose, and mouth. During annotation, a manual annotation method is used, with facial features labeled as level 1, body posture and hairstyle as level 2, clothing as level 3, and accessories as level 4.

[0050] Step S132: Combining the component decomposition and invariance level annotation results of the role, a semantic mapping table is constructed. This semantic mapping table defines the correspondence between semantic labels for different scenarios and the constraint strength of each component. Specifically:

[0051] Combining the application scenarios of the characters and the invariance levels of the components, the semantic tags of the main scenarios in which the characters may appear are covered. For example, the scenario tags for realistic characters may include battle, dinner, daily life, sports, business, etc., while the scenario tags for cartoon characters may include fairy tale, adventure, school, family, etc.

[0052] The dimension of the component constraint strength vector is the same as the number of first-level components in the component family set, and each element... ∈[0,1], indicating that for the th The larger the value of the constraint strength of a first-level component, the stronger the constraint strength and the smaller the range of variation of the component; the smaller the value, the weaker the constraint strength and the larger the range of variation of the component.

[0053] The setting of constraint strength needs to be combined with the invariance level of the component and the requirements of scene semantics, and should specifically follow the following principles:

[0054] For Class 1 components, their constraint strength is set. ≥0.9, to ensure that the core features of the component remain unchanged.

[0055] For Level 2 components, their constraint strength is set. ∈[0.7,0.9], and select the constraint strength value within this range according to the needs of the scene semantics corresponding to different scene semantic labels, so as to ensure that the essential characteristics of the component do not change.

[0056] For Level 3 components, their constraint strength is set. Similarly, the constraint strength value within the range of [0.3, 0.7] is selected according to the semantic requirements of the scene corresponding to different scene semantic labels.

[0057] For Level 4 components, their constraint strength is set. ∈[0,0.3], can be freely adjusted according to the needs of scene semantics, that is, almost no constraints are imposed.

[0058] Among them, the adjustment rules for level 3 components are that, under the premise that core attributes such as the overall style of clothing, the basic length of hairstyle, and the core type of accessories cannot be changed, some detailed parameters can be adjusted according to the constraint strength of the corresponding scene in the semantic mapping table, and the adjustment range is controlled within 30%. For example, clothing styles can be switched within the same style, hairstyle details can be trimmed but the overall outline remains unchanged, and accessories can be changed to different styles of the same type.

[0059] The specific operation process is as follows: First, extract scene semantic tags, match constraint strength, and determine the adjustment range; then, locate the specific component position based on the spatial mask template corresponding to each component, and make local detail adjustments. After adjustment, it needs to be manually reviewed or the similarity of components before and after adjustment should be ≥70% to ensure that the character recognition is not affected.

[0060] The adjustment rules for level 4 components are that there are no limits to the adjustment range. Components can be completely replaced, added, or deleted without retaining their original style, material, details, or other features.

[0061] The specific operation process is as follows: adjust the components directly according to the semantic tags of the scene. For example, accessories can be freely changed to any type according to the scene, clothing patterns can be completely replaced, and hairstyle details can be modified at will. In order to avoid adjustments that are seriously inconsistent with the overall image of the character, such as realistic characters paired with exaggerated cartoon accessories, after adjustment, sampling selection and manual review are required for quality screening.

[0062] For example, suppose that the component family set of a realistic character A has facial features as its first-level component, body shape and hairstyle as its second-level component, clothing as its third-level component, and accessories as its fourth-level component, and its corresponding scene semantic tags include five common types: combat, banquet, daily life, sports, and business. Then, the component constraint strength vector corresponding to each scene semantic tag can be represented as:

[0063] In the combat scenario, {tag: combat, component constraint strength vector: [1.0, 0.8, 0.7, 0.3, 0.2]}, the dimension of the component constraint strength vector is 5, corresponding to facial features, body posture, hairstyle, clothing, and accessories respectively. Among them, the constraint strength of facial features is 1.0, which is used to ensure that the identity characteristics remain absolutely unchanged; the constraint strength of body posture is 0.8, which allows for moderate adjustment of posture to adapt to the combat scenario while maintaining the essential characteristics of the body posture; the constraint strength of hairstyle is 0.7, which allows for adjustment of details such as bangs to maintain the overall outline; the constraint strength of clothing is 0.3, which can be freely adjusted to combat clothing to adapt to the needs of the scenario; and the constraint strength of accessories is 0.2, which can be freely adjusted to combat-related accessories such as helmets and protective gear.

[0064] Similarly, for the dinner scene, {label: dinner, component constraint strength vector: [0.95, 0.85, 0.8, 0.8, 0.5]}; for the daily scene, {label: daily, component constraint strength vector: [1.0, 0.9, 0.6, 0.6, 0.4]}; for the sports scene, {label: sports, component constraint strength vector: [1.0, 0.7, 0.5, 0.3, 0.1]}; for the business scene, {label: business, component constraint strength vector: [0.95, 0.85, 0.75, 0.75, 0.4]}.

[0065] Step S14: Construct a lightweight learnable network consisting of fully connected layers as a decoupling projector. Its input is a mixed feature vector. Through the nonlinear transformation of the network, the identity features and interference features in the mixed features are separated, and the output is a decoupled feature vector. Specifically:

[0066] Decoupled projector It consists of 2 to 3 fully connected layers, each followed by an activation function. The input and output dimensions are the same as the mixed feature vector. The choice of activation function needs to be combined with decoupling requirements, and a combination of ReLU and Sigmoid is preferred. Specifically:

[0067] The ReLU activation function can effectively alleviate the gradient vanishing problem and enhance the nonlinear expressive power of the network, making it suitable for intermediate layers; the Sigmoid activation function can map the output feature values ​​to the [0,1] interval, making the decoupled features more stable, making it suitable for the output layer.

[0068] Decoupled projector There are two methods for initializing the weights: random initialization and small-scale pre-training.

[0069] Random initialization is suitable for scenarios with special character types and a lack of corresponding pre-trained datasets. Xavier random initialization or He random initialization can be used to ensure that the initial weight distribution is reasonable and to avoid gradient explosion or gradient vanishing problems during network training.

[0070] Small-scale pre-training is suitable for scenarios with relatively common character types, such as realistic figures and common cartoon characters. The decoupled projector is pre-trained using small-scale general face / character contrast learning datasets such as CelebA and CartoonSet. Specifically:

[0071] A contrastive learning strategy is employed, using a small-scale, general face / role contrastive learning dataset as training samples. Images of the same character under different poses, expressions, and lighting conditions are used as positive sample pairs, while images of different characters are used as negative sample pairs. These sample images are then input into the aforementioned visual encoder to extract mixed features, which are then input into the training dataset. The decoupling features are obtained, and training is performed by comparing the similarity of the decoupling features of positive and negative sample pairs;

[0072] During training, the total loss is a weighted sum of InfoNCE loss and L2 regularization loss. The InfoNCE loss function serves as the core loss function, constructing the loss value by calculating the difference between the similarity between positive sample pairs and decoupled features and the similarity between negative sample pairs and decoupled features, thus guiding the training process. The model decouples identity features from interference features; at the same time, it introduces L2 regularization loss to constrain network weight parameters and prevent model overfitting.

[0073] Step S2: Obtain a single text prompt word, and in conjunction with the character steady-state dataset, obtain and apply constraints and guide corrections to the diffusion model denoising process based on the component constraint strength vector and the character condition vector to obtain the initial generated image.

[0074] Specifically, step S2 includes steps S21-S24:

[0075] Step S21: Input a single text prompt word into a pre-trained text conditional diffusion model to obtain the text embedding vector and scene semantic label. Based on the scene semantic label, query the semantic mapping table to obtain the component constraint strength vector.

[0076] First, obtain the single text prompt word corresponding to a single frame image. The single text prompt word is represented as a natural language string and must contain at least the character name and scene information, such as "Character A is in a battle scene, holding a long sword, with a serious expression, and wearing a black combat suit".

[0077] If the text prompt does not contain scene information and has no historical data, the default scene is "daily", and the constraint strength vector corresponding to the daily scene is enabled.

[0078] Next, the input single text prompts are processed to remove redundancy and standardize them, eliminating meaningless words such as "good-looking" and "beautiful", and standardizing the expression format.

[0079] The preprocessed single text prompt word is input into the text encoder of the text conditional diffusion model to obtain the text embedding vector.

[0080] The text embedding vector is input into a lightweight text classifier to extract scene semantic labels, and the scene semantic labels are obtained. Based on the scene semantic labels, the semantic mapping table is queried to obtain the component constraint strength vector.

[0081] It should be noted that if the scene semantic label extracted from the text prompt does not match any label in the semantic mapping table, the default constraint strength vector will be enabled.

[0082] The training samples for the lightweight text classifier are paired data of manually labeled scene semantic labels and corresponding text embedding vectors. The training task is a multi-classification task, and the loss function is cross-entropy loss.

[0083] Step S22: Decouple the initial hybrid anchor vector contained in the character steady-state dataset by projection and conditional mapping to obtain the character conditional vector. Specifically:

[0084] After processing the initial hybrid anchor vector into the decoupling projector constructed in step S14, a decoupling anchor vector is output. .

[0085] The decoupling anchor vector Input a linear projection network and perform linear mapping to generate a character condition vector. For example, use a fully connected layer as a linear projection network for linear mapping.

[0086] Step S23: The text conditional diffusion model performs standard denoising guided by text and role conditions step by step according to the text embedding vector and role condition vector, and performs intermediate identity preservation intervention and correction during the denoising process.

[0087] Specifically, step S23 includes steps S231-S232:

[0088] Step S231 injects dual constraints into the denoising process based on the text embedding vector, role condition vector, component constraint strength vector, and corresponding spatial mask template. Specifically:

[0089] First, set the number of noise reduction steps. A linear noise scheduler is selected, the spatial mask template corresponding to the component is loaded, and it is normalized to the latent variables of the text conditional diffusion model. Generate a mask matrix with consistent size. .

[0090] Next, time-step denoising is performed, for each time step... Perform the following operations:

[0091] Character condition vector With text embedding vectors Perform element-wise weighted fusion to obtain the fusion condition vector. .

[0092] Will Time step Embedding vector, Input the U-Net denoising network to predict the noise in the current latent variables. .

[0093] Simultaneously, based on the component constraint strength vector corresponding to each primary component, the mask channel corresponding to each primary component is... Calculate the guiding coefficient ,in This is the preset component guidance coefficient. This is represented as the component constraint strength vector corresponding to the component.

[0094] It should be noted that the above It can be adjusted based on three core dimensions: the invariance level of the components, the semantic requirements of the scene, and the balance of generation quality. Using 1.0 as the baseline value, specifically: for adapting component invariance levels, Level 1 and Level 2 core components require enhanced constraints, thus the corresponding scenarios can be improved. Up to [1.2, 1.5], ensure the stability of the core component's form; Level 3 and Level 4 secondary components need to consider scene adaptability, so their values ​​can be appropriately reduced. The value ranges from [0.5, 0.8], providing more room for adjustment; it matches the semantic complexity of the scene. The more complex the scene semantics, such as a battle scene or a complex banquet scene, the more refined the component adjustment requirements need to be, and the higher the value needs to be. This means upgrading to [1.0, 1.3] to avoid component deformation; the scene semantics are relatively simple, such as everyday leisure scenes, which can reduce... That is, reducing it to [0.7, 1.0] to improve generation flexibility; thirdly, balancing generation quality and computational efficiency. Too high a value can lead to rigid component shapes and increased generation time, while too low a value can easily result in component deformation and decreased identity recognition. Therefore, the value range of [0.5, 1.5] is set.

[0095] mask matrix and Multiplying each channel sequentially yields the component guide mask. The component guidance mask It is used to accurately locate the spatial position of each component, thereby ensuring that the guidance only acts on the corresponding component area and does not affect other components.

[0096] Will With prediction noise Multiply to obtain the noise after component guidance. This allows for differentiated noise suppression for different components, meaning that the noise suppression intensity is high for core components and low for secondary components.

[0097] Based on the noise scheduler's update formula and the suppressed noise, update the latent variables:

[0098] Right now ;

[0099] in , The preset parameters for the noise dispatcher are, specifically:

[0100] and The specific preset method is as follows: First, preset the noise intensity sequence. ( ∈[0, ]), A linear scheduling strategy is adopted, starting from the initial value Linearly increasing to the terminal value To ensure that the noise intensity increases smoothly with time step; then calculations are performed. , used to indicate the first The noise reduction and retention coefficients of the step; for The cumulative product, i.e. Used to measure from the initial time to the th The overall noise accumulation level of the step is preset and fixed for the entire noise reduction process, requiring no dynamic adjustment.

[0101] For example, for character A in a business scenario, assuming the number of noise reduction steps... =50, select a linear noise tuner. =1.0; Assuming the component constraint strength vectors corresponding to the 5 components of role A are [0.95, 0.85, 0.75, 0.75, 0.4], then the corresponding guiding coefficients are... The values ​​are 0.95, 0.85, 0.75, 0.75, and 0.4 respectively; load the spatial mask templates of 5 components and normalize them to the same level. Consistent size yields a 5-channel mask matrix. ;Will and Multiplying each channel sequentially yields the component guide mask. At time step When =50, , Embedding vector = 50 Input U-Net to obtain predicted noise ;Will and Multiplication is used to suppress noise in each component area; calculations are performed based on the update formula. After denoising, the final pure latent variables are obtained. .

[0102] Step S232, based on the latent variables of the current time step With decoupling anchor vector In the denoising process, intermediate identity-preserving intervention and correction are performed, specifically:

[0103] First, when time step satisfy , This is represented as a modulo operation, and ≥ When the condition is met, the current time step is determined to be the critical intervention time step, where This is a preset hyperparameter, whose value is set based on the user satisfaction of historically generated images. Its minimum value is no less than 7. For example, assuming the user satisfaction rate of historically generated images is 80%, meaning 80% of users reported satisfaction with the images, then it can be set to... It is 0.8 * 10 = 8; when =50、 When =7, the critical intervention time step is =49, 42, 35, 28, 21, 14, 7;

[0104] If the current time step If it is a critical intervention time step, then the subsequent intervention operation is executed; otherwise, it is skipped, and denoising is continued.

[0105] Next, the time step determined to be the critical intervention time step will be... Input to VAE decoder to generate intermediate image ;

[0106] Will The visual encoder mentioned in step S12 extracts the mixed features. Then Input decoupling projector The intermediate decoupling characteristics are obtained. .

[0107] Subsequently, calculation and identity distance The calculation is performed using Euclidean distance, i.e. .

[0108] like ≤Preset identity distance tolerance threshold If the current identity is stable, then noise reduction continues.

[0109] like > If the current identity is unstable, a correction operation needs to be performed.

[0110] Next, the noise correction amount is calculated. ,in These are preset hyperparameters, and their values ​​are based on... The value is set in conjunction with a mapping table generated based on historical data, such as the historical data showing when... At that time, it can be The mean is 0.05, then at this time The value is set to 0.05, and so on.

[0111] Will Transformation with latent variables A noise vector with consistent dimensions, thus... Perform corrections and obtain the corrected latent variables. ,Right now ; the corrected As a latent variable at the current time step, continue the denoising process.

[0112] It should be noted that if a certain key time step Much larger ,like If a single calibration is insufficient to meet the requirements, the calibration operation can be performed multiple times, such as setting a maximum of 3 repetitions.

[0113] Recalculate after each correction until If the requirements still cannot be met after repeated corrections, it means that there is a serious deviation in the current generation process. The generation must be terminated immediately, and the corresponding hyperparameters adjusted before regenerating. For example, increasing... , reduce .

[0114] Finally, after denoising, the final pure latent variables are... Decode the image to generate the initial generated image.

[0115] It should be noted that the training data for the text conditional diffusion model consists of image and text pairing data for various scenes and characters. This training data must include samples of characters under different postures, expressions, and lighting conditions, and match the corresponding text descriptions.

[0116] We employ the classic denoising training strategy of the diffusion model, starting from pure noise latent variables and learning the denoising process step by step over time. We combine text embedding vector injection to guide the model to generate content that matches the text. We also combine contrastive learning for auxiliary training to improve the matching accuracy between text and image. During training, we use a joint loss function that combines mean squared error loss and contrastive loss, and simultaneously introduce L2 regularization loss to prevent overfitting.

[0117] Step S3: Based on the initial generated image and the character steady-state dataset, perform feature deviation detection, and correct the initial generated image according to the feature deviation detection results to generate an optimized generated image.

[0118] Specifically, step S3 includes steps S31-S33:

[0119] Step S31, perform steady-state determination on the initially generated image, specifically:

[0120] First, the initially generated image is input into the visual encoder mentioned in step S12 for feature extraction, and then processed via a decoupled projector. Decouple the components and obtain candidate decoupling features.

[0121] At the same time, a decoupled projector is used. Decouple the initial hybrid anchor vectors contained in the character steady-state dataset by performing decoupling projection to obtain the decoupled anchor vectors.

[0122] Next, the identity distance between the candidate decoupling features and the decoupling anchor vector is obtained. and compare it with the preset identity distance tolerance threshold. Compare:

[0123] like If so, the current identity is determined to be stable;

[0124] like If so, the current identity is determined to be unstable;

[0125] Meanwhile, a pre-trained keypoint detection model is used to extract geometric features from the initial generated image and the reference image in the target character reference image set. The geometric features include at least the keypoint coordinates, contour dimensions, and proportional relationships of each component contained in the target character.

[0126] The keypoint detection model can select the corresponding keypoint detection model based on the character type label in the target character reference image set. For example, the MediaPipe human body keypoint detection model is used for realistic characters, and the cartoon keypoint detection model is used for cartoon characters.

[0127] The MediaPipe human keypoint detection model uses the COCO human keypoint dataset and the MPII human pose dataset as its core training sets, and is further trained using a custom dataset of realistic human characters with different poses, expressions, clothing, and lighting conditions, along with corresponding keypoint annotations. The training task combines keypoint regression and classification, aiming to accurately detect 17 core human keypoints and corresponding keypoints of character components. Its loss function uses a combination of mean squared error loss and cross-entropy loss to constrain the accuracy of keypoint coordinate regression and keypoint classification.

[0128] The cartoon keypoint detection model uses cartoon human keypoint datasets such as CartoonPose and AnimePose, and combines the target cartoon character type (2D / 3D, Q version / realistic cartoon) to construct a custom annotation dataset. The annotation content covers the key points of the core components of the cartoon character, such as facial features, torso and limbs, and clothing features.

[0129] The training objective is to accurately locate key points of cartoon character components; the loss function adopts smooth L1 loss to reduce the impact of abnormal annotation points on the training effect; data augmentation for cartoon style transfer is added during training to ensure that the model can adapt to character generation scenarios of different cartoon styles and accurately extract the geometric features of cartoon components.

[0130] The difference in geometric features between the initial generated image and the reference image is obtained by using chamfer distance quantization. and compare it with the preset geometric distance tolerance threshold. Compare:

[0131] like If so, the geometric detection passes;

[0132] Otherwise, the geometric detection fails, indicating that component deformation exists.

[0133] The initial image that is determined to have a stable current identity and passes geometric detection is determined to be a steady-state compliant image;

[0134] Otherwise, it is judged as deviating from steady state and defined as a non-compliant image.

[0135] Step S32: For the steady-state compliant image, perform post-processing operations such as deblurring, color correction, and detail enhancement to eliminate minor imperfections that may occur during the generation process, thereby improving image clarity, color uniformity, and detail richness; at the same time, record the key parameters of this generation, generate the final compliant image, and output it as the optimized generated image.

[0136] For example, for the steady-state compliant image of character A, a 3×3 Gaussian filter is used to remove blur and eliminate slight blurring; histogram equalization is used to correct the color, making the tone of the business scene more natural; the Laplacian operator is used to enhance details and highlight features such as suit folds and facial textures; key parameters such as the component constraint strength vector and denoising time step are recorded; finally, an optimized generated image is output that shows "character A in a business scene with stable identity features, clothing and posture adapted to the scene, and clear details".

[0137] Step S33: For non-compliant images, determine their corresponding identity distance. Geometric distance The problem was attributed to its cause, and a redrawing and correction were performed. Specifically:

[0138] First, based on identity distance Geometric distance Attributing the problem to its causes:

[0139] like and If the identity constraint is insufficient, for example, the role condition vector in step S231. With text embedding vectors Insufficient weights during weighted fusion can cause character identities to gradually deviate during the generation process. In this case, the corresponding weight coefficients need to be increased.

[0140] like and The component is deemed to have insufficient constraint strength; for example, the guiding coefficient of the corresponding component... Insufficient lifting capacity leads to deformation of the components, requiring corresponding lifting. ;

[0141] like and The problem was determined to be due to insufficient dual constraints. The root cause was that both the identity constraint and the corresponding component constraint were insufficient, and they needed to be adjusted simultaneously.

[0142] Next, after adjusting the corresponding parameters, the generation process of step S2 is re-executed based on the adjusted parameters to generate a new initial generated image.

[0143] Next, the new initially generated image is subjected to the steady-state determination process of step S31. If it is determined to be a steady-state compliant image, the process of step S32 is carried out; if it is still determined to be a non-compliant image, the above problem attribution and redrawing correction process is repeated until it is determined to be a steady-state compliant image or the preset redrawing correction limit is reached.

[0144] If the maximum number of redraws is reached and a compliant image is still not obtained, an error message will be output and the process will be terminated.

[0145] Step S4: Obtain and parse the continuous text prompt word sequence and the multi-role steady-state dataset to generate an image consistency constraint set. Combined with the process of steps S2-S3, output the role-consistent image sequence.

[0146] Specifically, step S4 includes steps S41-S43:

[0147] Step S41: Input the continuous text prompt sequence into the pre-trained referential resolution model, and label the referential type and confidence of each character in each frame of prompts. Specifically:

[0148] First, for a continuous text prompt sequence, remove meaningless modifiers such as "good-looking," "handsome," and "very" from each frame's prompts;

[0149] Different descriptions of the same character are unified into a standard title corresponding to a unique character ID. For example, titles such as Character A, A, Protagonist, and Male Lead with Glasses are all confirmed to refer to Character A after manual pre-association. Therefore, all the above titles are uniformly corrected to "Character A (ID: A001)".

[0150] The colloquial expressions are converted into standard written language. For example, "newbie" and "expert" are corrected to their corresponding character ID titles, and ambiguity is eliminated. For example, "he is running" is clarified in context as "character A (ID: A001) is running", thus ensuring that there is no ambiguity in the character references in each frame's prompts.

[0151] Next, a pre-trained BERT-based fine-tuned pronoun resolution model is used to extract the preprocessed continuous text prompt word sequence frame by frame, obtaining the role pronoun type, corresponding role ID, and pronoun confidence for each frame. The pronoun resolution model is trained using multi-role text corpus as training data, employs a BERT-based architecture, uses cross-entropy loss as the loss function, and combines data augmentation to improve adaptability to multiple scenarios.

[0152] The character reference type is an enumeration type, and its value can be any one of {core protagonist, secondary character, temporary character, no explicit reference}.

[0153] The confidence score is calculated by the fine-tuning of the referential resolution model through semantic similarity calculation. The higher the confidence score, the clearer the surface referential relationship.

[0154] For example, suppose the sequence of three consecutive text prompts is: T1 = "Character A (A001) is working in a business setting, with a briefcase in front of him", T2 = "He looks up at the window, and Character B (A002) next to him is handing over a document", T3 = "The person in the suit takes the document, and a temporary staff member is taking notes next to him"; then the output after referential annotation is: T1: {Referential type: core protagonist, character ID: A001, referential confidence: 0.98}; T2: {Referential type 1: core protagonist, character ID: A001, referential confidence: 0.96; Referential type 2: secondary character, character ID: A002, referential confidence: 0.95}; T3: {Referential type 1: core protagonist, character ID: A001, referential confidence: 0.89; Referential type 3: temporary character, character ID: A003, referential confidence: 0.82}.

[0155] Subsequently, the role constraint strength coefficient is assigned according to the role's referential type. For example, the core protagonist ∈[0.8,1.0], minor role ∈[0.4,0.7], temporary role ∈[0.1,0.3], without explicit reference. It will then be bound to the core main character by default.

[0156] Run the process of steps S21-S22 to obtain the corresponding component constraint strength vector and character condition vector. For a single character frame, directly call the original character condition vector.

[0157] For situations where multiple characters are in the same frame, based on each character Reference confidence Role constraint strength coefficient Calculate weighted weights ,Right now .

[0158] Based on the above In the case of multiple characters in the same frame with weighted calculation, the condition vectors of each character are used to generate a global fusion condition vector.

[0159] Step S42: Call the corresponding process of step S2 to generate a single-frame initial generated image, and call the process of step S3 to perform steady-state determination and redrawing correction to generate a steady-state compliant image.

[0160] The steady-state compliant image is input into the visual encoder mentioned in step S12 for feature extraction, and then processed via a decoupled projector. Decouple the components and obtain candidate decoupling features. Obtain the candidate decoupling features of the corresponding role in the frame.

[0161] The candidate decoupling features are then updated to the decoupling feature state sequence of the corresponding role.

[0162] The decoupling feature state sequence uses a queue data structure to store the candidate decoupling features of each role corresponding to each frame. When the sequence length in the queue exceeds the queue capacity, the earliest frame feature at the beginning of the sequence is removed according to the first-in-first-out criterion, and only the feature of the latest frame is retained.

[0163] Step S43: For roles whose decoupled feature state sequence length exceeds a preset threshold, perform progressive drift trend detection, and make preventative adjustments based on the detection results to generate a role-consistent image sequence. Specifically:

[0164] First, the cosine similarity between candidate decoupling features in adjacent frames within the decoupling feature state sequence is obtained, and the case where the similarity of three consecutive frames is lower than the preset drift threshold is defined as having a gradual drift trend.

[0165] Next, when a drift trend is detected, the constraint parameters corresponding to the core protagonist are increased; for example, the constraint strength coefficient of the core protagonist is increased. Increase by [0.1, 0.15], and simultaneously increase the component constraint strength of the corresponding level 1 component.

[0166] After the constraint parameters are improved, there is no need to regenerate the completed frame; the improved constraint parameters are only used when generating subsequent frames.

[0167] If no drift trend is detected, the original constraint parameters remain unchanged, and subsequent frame generation proceeds normally.

[0168] Finally, once all frames have been generated, they are stitched together into a continuous image sequence with consistent character appearance, i.e., a character-consistent image sequence.

[0169] Step S5: Obtain and parse user feedback and historical generated data corresponding to the consistent image sequence of the role, generate and optimize the role image consistency maintenance system based on the feedback optimization dataset.

[0170] Specifically, feedback data for consistent image sequences of characters is obtained, valid feedback is filtered, and the accumulated generation history stored in steps S3 and S4 is called synchronously. Each data entry must include a unique task ID, generation time, text prompt, generation parameters, and generation image path. , Information such as redrawing records and compliance status.

[0171] For example, suppose the user feedback data contains 50 records, of which 42 are valid binary ratings, 35 are satisfactory, 7 are unsatisfactory, and 8 are fuzzy ratings such as "okay" or "average". The fuzzy ratings are removed to obtain the valid feedback data; the generation history of role A in steps 3 and 4 is called to filter out 35 complete records.

[0172] Subsequently, records marked as satisfactory by users were placed into the positive sample pool, and records marked as unsatisfactory were placed into the negative sample pool.

[0173] When the number of samples in the positive sample pool exceeds a preset threshold, all samples are retrieved. , The 95th percentile value is used as the new preset identity distance tolerance threshold. With geometric distance tolerance threshold .

[0174] Next, the mixed features corresponding to each target role recorded in the positive sample pool are... As a valid feature recognized by users, the mixed features corresponding to each target role in the negative sample pool are used. As negative sample pairs, the InfoNCE loss function is used, with the goal of minimizing the loss function, and gradient descent algorithm is simultaneously applied to the decoupled projector. The parameters are updated incrementally.

[0175] The specific usage and function of this embodiment are explained below:

[0176] The target character reference image set is obtained and its features are extracted. The features are then enhanced by a preset feature enhancement mechanism to generate a character steady-state dataset. The character steady-state dataset contains at least an initial mixed anchor vector and a component family set. This step uses a visual encoder to extract the initial mixed features and initialize a learnable decoupled projector to generate a comprehensive dataset containing various information such as the character's identity, posture, expression, lighting, and background, providing a data foundation for subsequent steps.

[0177] A single text prompt word is obtained, and combined with the character steady-state dataset, constraints are applied to the diffusion model denoising process and guided correction is performed based on the component constraint strength vector and the character condition vector to obtain the initial generated image. This step avoids the generation stiffness caused by indiscriminate constraints by applying two complementary constraints simultaneously during the diffusion model denoising process. Furthermore, identity-preserving gradient correction is performed at preset key time steps to correct the identity deviation trend in time during denoising, reducing the probability of image distortion in the final generated result.

[0178] Based on the initial generated image and the character steady-state dataset, feature deviation detection is performed, and the initial generated image is corrected according to the feature deviation detection results to generate an optimized generated image. In this step, the generated image is projected onto the decoupled feature space to perform dual determination of identity distance and geometric distance, thereby effectively reducing the unusability rate of the generated results and reducing the workload of manual screening and repair for users.

[0179] Acquire and parse the continuous text prompt word sequence and multi-role steady-state dataset, generate image consistency constraint set, and output the role consistent image sequence by combining the process of steps S2-S3. This step uses the referential resolution technology to parse the role referential relationship in the continuous prompt word sequence and performs progressive drift detection. Before the image deviation exceeds the preset threshold, the constraint intensity is actively increased, thereby effectively suppressing the cumulative deformation phenomenon in multi-step generation.

[0180] The system acquires and parses user feedback and historical generation data corresponding to the consistent image sequence of the role, generates and optimizes the dataset based on the feedback to maintain the consistency of the role image. This step uses user feedback and generation history to construct positive and negative samples, and optimizes the feature decoupling projector through comparative learning, so that the system's ability to represent the role identity gradually improves with the increase of usage frequency.

[0181] Figure 2 This is a schematic diagram of a character image consistency maintenance system according to the present invention.

[0182] Specifically, a system for maintaining character consistency includes:

[0183] The feature extraction module is used to extract features from the target character reference image set and enhance the features in combination with a preset feature enhancement mechanism to generate a character steady-state dataset.

[0184] An image generation module is used to apply constraints and guide the correction of the denoising process of the diffusion model based on a single text prompt word and a character steady-state dataset to obtain an initial generated image.

[0185] An image correction module is used to detect and correct feature deviations in the initial generated image to generate an optimized generated image.

[0186] The sequence generation module is used to parse the continuous text prompt word sequence and the multi-role steady-state dataset, and in combination with the process of steps S2-S3, generate a role-consistent image sequence.

[0187] The feedback optimization module is used to parse user feedback data and historical generated data, generate and optimize the corresponding modules of the system based on the feedback optimization dataset.

[0188] It should be noted that the above formulas are all dimensionless calculations. The formulas are derived from software simulations based on a large amount of collected data to obtain the most recent real-world results. The preset parameters in the formulas are set by those skilled in the art according to the actual situation.

[0189] It should be understood that, in the embodiments of the present invention, the order of the above-mentioned process numbers does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

[0190] The above-described embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.

Claims

1. A method for maintaining consistency in character image, characterized in that, include: S1: Obtain the target character reference image set and extract features, and combine it with a preset feature enhancement mechanism to enhance features and generate a character steady-state dataset. The character steady-state dataset contains at least an initial hybrid anchor vector and a component family set. S2: Obtain a single text prompt word, and in conjunction with the character steady-state dataset, obtain and apply constraints and guide corrections to the diffusion model denoising process based on the component constraint strength vector and the character condition vector to obtain the initial generated image; S3: Based on the initial generated image and the character steady-state dataset, perform feature deviation detection, and correct the initial generated image according to the feature deviation detection results to generate an optimized generated image; S4: Obtain and parse the continuous text prompt word sequence and the multi-role steady-state dataset, generate the image consistency constraint set, and output the role-consistent image sequence by combining the process of steps S2-S3; S5: Obtain and parse user feedback and historical generated data corresponding to the consistent image sequence of the character, generate and optimize the character image consistency maintenance system based on the feedback to optimize the dataset.

2. The method for maintaining character image consistency according to claim 1, characterized in that, Step S1 includes: Collect reference images of the target character from multiple perspectives, poses, and lighting conditions, and construct a reference image set of the target character; A pre-trained visual encoder is used to extract features from each reference image in the target character reference image set, and the mean of all extracted feature vectors is taken as the initial mixing anchor vector. The initial hybrid anchor vector includes at least the target character's identity, posture, expression, and lighting features; A pre-trained component parsing network is used to hierarchically decompose the target character reference image set into components, generating a component family set. The initial hybrid anchor vector and the component family set are encapsulated into a role steady-state dataset for output.

3. The method for maintaining character image consistency according to claim 2, characterized in that, The method further includes: The component family set contains the basic components that make up the target role; The basic components include at least the target character's facial features, body shape, hairstyle, clothing, and accessories.

4. The method for maintaining character image consistency according to claim 1, characterized in that, Step S2 includes: Input a single text prompt word into a pre-trained text conditional diffusion model to obtain text embedding vectors and scene semantic labels; Based on the scene semantic tags, query the semantic mapping table to obtain the component constraint strength vector; Decouple the initial hybrid anchor vector contained in the character steady-state dataset by projection and conditional mapping to obtain the character condition vector; The text conditional diffusion model performs standard denoising guided by text and character conditions step by step according to the text embedding vector and the character condition vector; In the standard denoising process, role offset analysis is performed within the key intervention time step, and the latent variables corresponding to the current time step are guided and corrected based on the results of the role offset analysis. After denoising, the latent variables of the final text conditional diffusion model are decoded to generate the initial generated image.

5. The method for maintaining character image consistency according to claim 4, characterized in that, Role shift analysis is performed within key intervention time steps, and the latent variables corresponding to the current time step are guided and corrected based on the results of the role shift analysis, including: Determine whether the time step corresponding to the current text conditional diffusion model is a key intervention time step based on preset rules; If the current time step is determined to be a non-critical intervention time step, no action will be taken; If the current time step is determined to be a critical intervention time step, then the latent variables corresponding to the current time step are obtained, and an intermediate image is generated based on the latent variables. Feature extraction and decoupling projection are performed on the intermediate image to obtain intermediate decoupling features. Based on the aforementioned intermediate decoupling features, a role shift analysis is performed. If the results show that a role has shifted, the latent variables are then guided and corrected. If the result shows that the character has not shifted, no action is taken.

6. The method for maintaining character image consistency according to claim 1, characterized in that, Step S3 includes: Feature extraction and decoupled projection are performed on the initially generated image to obtain candidate decoupled features; Decouple the initial hybrid anchor vector contained in the character steady-state dataset by performing decoupling projection to obtain the decoupled anchor vector; Obtain the identity distance between the candidate decoupling features and the decoupling anchor vector, as well as the geometric distance between the initially generated image and the preset reference image, and perform steady-state determination in combination with the preset threshold; If the steady-state determination result shows that the image is in a steady state, then the initially generated image is directly output as the optimized generated image. If the steady-state determination result shows a deviation from steady state, then the correction condition vector is obtained based on the candidate decoupling features and the decoupling anchor vector; The correction condition vector is injected into the diffusion model to generate an optimized image.

7. The method for maintaining character image consistency according to claim 1, characterized in that, Step S4 includes: Input the continuous text prompt sequence into the pre-trained referential resolution model and label the referential type and confidence of each character in each frame of prompts; When generating frame by frame, the character constraint strength coefficient is assigned according to the reference type, and when multiple characters are in the same frame, the character condition vector corresponding to each character is weighted and fused. The corresponding processes of steps S2 and S3 are called to generate a single frame image. After each frame is generated, the candidate decoupling features corresponding to the single frame image are obtained according to the feature extraction and decoupling projection process in step S3. Update the candidate decoupling features to the decoupling feature state sequence of the corresponding role; For roles whose decoupled feature state sequence length exceeds a preset threshold, progressive drift trend detection is performed, and preventive adjustments are made based on the detection results to generate a role-consistent image sequence.

8. The method for maintaining character image consistency according to claim 7, characterized in that, The corresponding processes in steps S2 and S3 are invoked to generate a single-frame image, including: The corresponding process in step S2 is called to generate a single-frame initial generated image, and the process in step S3 is called to perform steady-state determination and redrawing correction to generate a steady-state compliant image.

9. The method for maintaining character image consistency according to claim 1, characterized in that, Step S5 includes: Obtain and construct positive and negative sample pools based on user feedback of the role-consistent image sequence and the intermediate data generated in steps S3 and S4. The parameters of the corresponding module in the system for maintaining consistency of character image are updated based on the samples in the positive and negative sample pools.

10. A character image consistency maintenance system, used to implement the method according to any one of claims 1 to 9, characterized in that, include: The feature extraction module is used to extract features from the target character reference image set and enhance the features in combination with a preset feature enhancement mechanism to generate a character steady-state dataset. An image generation module is used to apply constraints and guide the correction of the denoising process of the diffusion model based on a single text prompt word and a character steady-state dataset to obtain an initial generated image. An image correction module is used to detect and correct feature deviations in the initial generated image to generate an optimized generated image. The sequence generation module is used to parse the continuous text prompt word sequence and the multi-role steady-state dataset, and in combination with the process of steps S2-S3, generate a role-consistent image sequence. The feedback optimization module is used to parse user feedback data and historical generated data, generate and optimize the corresponding modules of the system based on the feedback optimization dataset.