A video image rewriting method, system and device based on candidate frame quality score and multi-level constraint control
By using a method based on candidate frame quality scoring and multi-level constraint control, adaptive sparse frame extraction and multi-feature fusion scoring are performed. After selecting the best frame, multi-level verification and repair are carried out. Combined with visual language model and semantic desensitization, the problems of inaccurate frame selection and unstable generation in the existing technology are solved, and an efficient and stable video image rewriting process is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XIAMEN CHANJING TECH CO LTD
- Filing Date
- 2026-04-29
- Publication Date
- 2026-06-19
AI Technical Summary
Existing video image rewriting solutions cannot automatically select the optimal frame with a clear subject, a complete frontal face, no obstructions, and no subtitles. They lack pre-generation constraint verification and post-generation quality acceptance mechanisms, resulting in problems such as abnormal characters, structural distortion, and residual subtitles in the generated results. Furthermore, they have not constructed an integrated control logic for candidate frame scoring, multi-level quality gates, and closed-loop backoff, making it impossible to achieve autonomous optimization of the processing flow and stable output.
By using a candidate frame quality scoring and multi-level constraint control method, adaptive sparse frame extraction is used to extract candidate frames. The optimal key frame is selected by combining a multi-feature fusion quality scoring function. Multi-level constraint verification is performed before generation. Subtitles, text, and color block regions are jointly detected and repaired. Cross-modal semantic translation is performed using a visual language model. Pixel-to-semantic decoupling is achieved through a semantic desensitization filter. Latent space semantic consistency verification is introduced to construct a multi-dimensional post-quality inspection and system self-iterative optimization.
It achieves efficient selection of the optimal frame, reduces redundant calculations, ensures the stability and quality of the generated image, avoids the risk of distortion and infringement of the generated result, supports batch scheduling and system self-iterative optimization, and improves the throughput and traceability of engineering production.
Smart Images

Figure CN122248218A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing technology, and in particular to a video image rewriting method, system, and device based on candidate frame quality scoring and multi-level constraint control. Background Technology
[0002] Existing video image rewriting solutions have significant shortcomings, failing to meet the demands of batch, stable, and high-quality engineering processing. At the frame selection level, most solutions use fixed time intervals for frame extraction, failing to automatically select the optimal frames that are clear, have complete frontal faces, and are free of obstructions and subtitles, resulting in inefficiency due to reliance on manual frame selection. At the image preprocessing level, there is a lack of detection and repair capabilities for subtitles, overlaid text, and occluded areas, with inferior frames directly entering the generation stage, significantly increasing the failure rate. At the generation control level, there is a lack of pre-generation constraint verification and post-generation quality acceptance mechanisms, leading to issues such as abnormal characters, structural distortion, and residual subtitles in the generated results, resulting in insufficient stability. At the system engineering level, there is a lack of cache reuse, idempotent control, concurrent scheduling, and anomaly rollback mechanisms, resulting in excessive repetitive calculations, high failure rates, and poor scalability. Furthermore, existing solutions lack integrated control logic for candidate frame scoring, multi-level quality gates, and closed-loop rollback, making it impossible to achieve autonomous optimization of the processing flow and stable output.
[0003] In addition, the existing solution only processes a single frame independently and does not consider the consistency of cross-frame timing, which is prone to problems such as jitter in subtitle repair and inconsistent picture style; it does not have the ability to predict the success rate of rewriting before generation and cannot avoid invalid generation in advance; it does not perform compliance verification on the structure of the characters, which is prone to generation defects such as facial deformities and limb abnormalities; and it does not establish an intelligent source tracing and compensation mechanism for failed tasks, resulting in insufficient engineering iteration and optimization capabilities.
[0004] 1. Relying on manual browsing of videos one by one and manually capturing frames is inefficient and difficult to scale up.
[0005] 2. Conventional frame extraction schemes often extract frames at fixed time intervals, which cannot guarantee that the subject is clear, the face is visible, and there are no obstructions. This is especially true for spoken videos (hosts, presenters, etc.), where the subject is fixed and the timing is redundant. Dense frame extraction results in a waste of computing power.
[0006] 3. For video footage featuring people, issues such as multiple people, side profiles, small faces in distant shots, subtitles obscuring the view, or abnormal proportions in keyframes will significantly reduce the quality of subsequent image generation.
[0007] 4. Existing image rewriting systems mostly adopt the "image-to-image" mode, where pixel-level features are directly passed to the generation model. This can easily lead to the generated result being highly similar to the face / background of the reference frame, posing a risk of infringement of portrait rights and background copyrights. Furthermore, they lack an automated and compliant desensitization mechanism.
[0008] 5. The lack of specific quality inspections for human body structure (such as fingers and facial features), image compliance, and downstream audio driver compatibility after generation results in the generated images being unusable for downstream links such as Lip-Sync, leading to a high rework rate.
[0009] 6. Most systems lack unified scheduling, exception recovery, result caching, idempotent control, and quality closed-loop capabilities for batch tasks, which is not conducive to engineering implementation. Summary of the Invention
[0010] To solve the above-mentioned technical problems, the present invention provides the following technical solution: A video image rewriting method based on candidate frame quality scoring and multi-level constraint control includes: acquiring the video to be processed and constructing a hierarchical cache; performing adaptive sparse frame extraction based on the characteristics of the broadcast scene to extract face, clarity, occlusion, and subtitles to form a candidate frame set; weighting and sorting the candidate frame set based on a multi-feature fusion quality scoring function to select the optimal keyframe and candidate keyframes, and establishing a three-layer elimination and backoff strategy of coarse filtering, fine sorting, and anomaly backoff; performing multi-level constraint verification on the keyframes before generation, and switching candidate keyframes in sequence if they are unqualified; jointly detecting and performing local repair on the three branches of subtitles, text, and color block regions in the keyframes, and forcibly excluding the lip extension region during the repair process to protect the lip region and ensure the compatibility of downstream audio drivers; and inputting the verified and repaired keyframes into a visual language (VL) model to perform cross-modal language processing. The system translates and outputs a four-dimensional structured text prompt, including character features, environment setting, composition parameters, and style tags. A semantic desensitization filter generalizes biometric-level unique features into category-level features, achieving pixel-to-semantic decoupling. The structured text prompt is input into a text-to-image (T2I) model. A parameter chain is generated by dynamically matching CFG intensity, sampling steps, and ControllloNet weights with style and composition parameters. Latent space semantic consistency verification is introduced to complete the generative image rewriting. Multi-dimensional post-quality checks are performed on the generated images, covering human structure, facial similarity compliance threshold verification, background copyright, audio driver adaptation, and clarity verification. Failure to meet the standards triggers multi-level backtracking. Images that pass the quality checks undergo metadata cleaning and embed an invisible traceability watermark before being output. Parallel processing of multiple tasks enables batch scheduling and system self-iterative optimization.
[0011] A video image rewriting system based on candidate frame quality scoring and multi-level constraint control is provided. The device is used to execute executable instructions to perform the above-mentioned video image rewriting method based on candidate frame quality scoring and multi-level constraint control.
[0012] Its beneficial effects are as follows: Addressing the low dynamic range of spoken video, it uses a default sampling step of 3-5 seconds, dynamically adjusting the frame density based on optical flow amplitude; it utilizes human detection bounding boxes to lock the main subject area, extracting only effective region features, reducing redundant computation by 90%. For spoken video scenarios, it prioritizes the clarity of the face, the visibility of the lips, and the sense of eye contact, selecting the optimal keyframe and candidate frames through a quality scoring function, and setting a layered elimination strategy to avoid "stopping upon hit." Replacing traditional direct image output, it converts keyframes into structured text prompts (person features, environment setup, composition parameters, style tags), and incorporates a semantic desensitization filter, generalizing descriptions pointing to the uniqueness of a specific individual into category-level style words, achieving decoupling of visual information from semantic information.
[0013] The parameter chain is dynamically generated based on the VL output text, and semantic alignment verification is performed in the latent space to ensure that the generated image retains the environmental logic and temperament characteristics of the broadcast scene, rather than being a mechanical replication. New features include finger / joint integrity detection, facial feature vector similarity compliance threshold control (<θ_legal), background trademark / copyright mark interception, and audio-driven adaptability verification for frontal angle / lip occlusion / illumination uniformity to eliminate distortion and infringement risks. A four-level fallback chain is constructed: "prompt word fine-tuning -- parameter switching -- alternative frame re-translation -- failed feature rewriting," combined with cache reuse, idempotency verification, backpressure control, and invisible watermark traceability to achieve high-throughput, traceable, and self-iterative engineered production. Attached Figure Description
[0014] Figure 1 A flowchart of a video image rewriting method based on candidate frame quality scoring and multi-level constraint control provided in an embodiment of the present invention; Figure 2 This is a schematic diagram of a video image rewriting system based on candidate frame quality scoring and multi-level constraint control, provided in an embodiment of the present invention. Detailed Implementation
[0015] The preferred embodiments of the present invention will be described below with reference to the accompanying drawings. It should be understood that the preferred embodiments described herein are for illustration and explanation only and are not intended to limit the present invention. Figure 1 This application describes a video image rewriting method based on candidate frame quality scoring and multi-level constraint control according to exemplary embodiments of the present application. In one embodiment, the present application further proposes a video image rewriting method based on candidate frame quality scoring and multi-level constraint control.
[0016] In this application embodiment, a video image rewriting method based on candidate frame quality scoring and multi-level constraint control is provided, such as... Figure 1 As shown: S101: Obtain the video to be broadcast and build a hierarchical cache. Perform adaptive sparse frame extraction based on the characteristics of the broadcast scene to extract faces, clarity, occlusion, and subtitles to form a candidate frame set.
[0017] In one implementation, the system first reads an externally received list of tasks to be processed, which includes a unique video identifier and storage path information. The system locates the storage location of the original video based on the unique video identifier and completes the reading and loading of the original video data stream. During data reading, the system first checks if there is already loaded video data with the same identifier in the local cache path. If the corresponding video data exists in the local cache path, the cached data is directly reused, and the video reading and decoding operations are not repeated. If the corresponding video data does not exist in the local cache path, the data is written to a hierarchical cache structure after the video reading is completed. The hierarchical cache structure stores data hierarchically according to video identifier, frame number, and timestamp, enabling fast indexing and reuse of frame data and reducing resource consumption caused by repeated readings.
[0018] To address the characteristics of scenarios where the main subject in a spoken video is fixed in position and has high temporal redundancy, the system initiates an adaptive sparse frame extraction process. The system samples the video using a default time step of three to five seconds. Simultaneously, it calculates the average optical flow amplitude between adjacent frames in real time and compares the result with a preset static threshold. When the average optical flow amplitude is lower than the static threshold, the default sparse sampling step size remains unchanged. When the average optical flow amplitude is higher than the static threshold, it is determined that a sudden change in the scene or a scene transition has occurred, and the sampling step size is temporarily reduced to increase the frame sampling density in the area of scene change. During the frame extraction process, only valid frames containing the main subject are retained, excluding empty shots and transitional frames without a subject, thus completing the initial screening of candidate frames.
[0019] The system performs multi-dimensional feature extraction operations on each selected candidate frame. Facial pose features are extracted to determine whether the person in the frame is facing forward, sideways, or exhibiting tilt or rotation. Facial area features are extracted to measure the proportion of the face in the image. Image sharpness features are extracted to assess the presence of motion blur or out-of-focus issues. Subject integrity features are extracted to determine if there are any cropped or missing parts of the face and key body areas. Occlusion features are extracted to detect whether the face is obscured by masks, sunglasses, hands, or other foreground objects. Subtitle risk features are extracted to calculate the probability of text overlay in the image. Composition features are extracted to assess the subject's centering, the proportion of white space, and the aspect ratio. All features are jointly output by the human keypoint analysis module, image statistical feature calculation module, and rule calculation module of the object detection model, forming a unified candidate frame feature vector.
[0020] The system will arrange all candidate frames for feature extraction in chronological order within the video. Each candidate frame is bound to a corresponding frame number, timestamp, and complete feature vector. The system will then aggregate all candidate frames that meet the basic image requirements to form a structured candidate frame set. This set serves as the sole data source for subsequent quality scoring and keyframe selection, ensuring that subsequent processing stages can quickly access frame data and corresponding features to achieve automated selection of high-quality keyframes.
[0021] S102, based on the multi-feature fusion quality scoring function, the candidate frame set is weighted and sorted to select the best key frame and candidate key frames, and a three-layer elimination and back-off strategy of coarse filtering, fine sorting and anomaly back-off is established.
[0022] In one implementation, a weighted fusion of the candidate frame feature set and a multi-feature fusion quality scoring function is performed, along with dual correction calculations of time decay coefficient and image stability coefficient, to generate candidate frame quality scores and ranking results. The candidate frame feature set includes facial pose, facial area, sharpness, subject integrity, occlusion, caption risk, and compositional feature data. The multi-feature fusion quality scoring function includes a subject quality sub-score, a pose usability sub-score, an image cleanliness sub-score, a compositional fit sub-score, and a time decay coefficient. Image stability coefficient Parameters. The system takes a set of candidate frame features as input data, which includes face pose, face area, sharpness, subject integrity, occlusion, caption risk, and composition features. The system inputs these features into the corresponding sub-score calculation modules to obtain the subject quality sub-score, pose usability sub-score, image cleanliness sub-score, and composition fit sub-score in sequence.
[0023] The subject quality sub-score is obtained by weighting the proportion of face area, image sharpness, and subject integrity mask coverage, and then mapping it to the zero-to-one range using a normalization function. The pose usability sub-score is obtained by weighting the average face rotation angle, lip visibility, and the angle between the line of sight and the camera, and then mapping it to the zero-to-one range using a normalization function. The image cleanliness sub-score is obtained by weighting the subtitle risk density, occlusion area proportion, and background interference entropy, and then subtracting the normalized result from the numerical value. The composition fit sub-score is obtained by weighting the subject centering offset, image white space ratio, and aspect ratio fit, and then mapping it to the zero-to-one range using a normalization function.
[0024] The system weights and fuses the four sub-scores according to preset weighting coefficients to obtain an initial quality score. Then, a time decay coefficient and a frame stabilization coefficient are introduced to double-correct the initial quality score. The time decay coefficient is calculated based on the relative distance between the candidate frame's time position and the center of the video's climax, used to suppress scores from unstable narration frames at the beginning and end of the video. The frame stabilization coefficient is calculated based on the variance of optical flow amplitude between adjacent frames; the smaller the variance of optical flow amplitude, the larger the frame stabilization coefficient value. After this double correction, the final total quality score of the candidate frames is obtained, and all candidate frames are ranked according to their total quality scores.
[0025] The quality scores and ranking results are subjected to stratified filtering to generate keyframe and candidate keyframe lists. The keyframe and candidate keyframe lists include the best keyframe, the second-best candidate keyframe, and the coarse filter layer quality score threshold. The system compares the total quality score with a preset coarse filtering layer threshold, directly discarding candidate frames with total quality scores below the threshold, completing the first layer of coarse filtering. The remaining candidate frames after coarse filtering are then sorted in descending order of total quality score, completing the second layer of fine filtering. The system selects the first-ranked candidate frame as the optimal keyframe and the second to fourth-ranked candidate frames as suboptimal alternative keyframes. The system summarizes the optimal keyframe and all alternative keyframes to generate a list of keyframes and alternative keyframes, clearly recording the frame number, timestamp, and total quality score of each frame.
[0026] The quality score and ranking results, along with the keyframe and candidate keyframe lists, are processed to generate hierarchical elimination and backoff constraints, including... < The system employs three core constraints based on the aforementioned processing flow: First, the coarse filtering constraint (below the scoring threshold) stipulates that if the total quality score of a candidate frame does not reach the coarse filtering layer's threshold, it is directly removed from the candidate set and does not proceed to subsequent sorting and filtering stages. Second, the fine filtering constraint (descending order of score) requires that the coarsely filtered candidate frames be strictly arranged according to their total quality score from highest to lowest, with keyframes and candidate keyframes selected only from the top-ranked candidate frames to ensure the quality priority of the selected frames. Third, the candidate frame back-off constraint (switching back to candidate frames in order of failure during generation and quality inspection) stipulates that when the optimal keyframe fails in subsequent generation or quality inspection stages, it is switched back to candidate keyframes in the order of their ranking until a frame that can pass processing is found, preventing process interruption due to a single selection failure. These constraints together constitute the hierarchical elimination and back-off control logic, ensuring the stability and continuity of keyframe selection and anomaly handling.
[0027] S103 performs multi-level constraint verification on keyframes before generation; if the verification fails, alternative keyframes are switched in sequence.
[0028] In one implementation, before a keyframe enters the image generation process, the system initiates a joint decision gate to perform full constraint verification. The verification process takes the keyframe image as input and uses a face detection module, a human body analysis module, a sharpness assessment module, a lip keypoint detection module, and a background complexity analysis module as parallel processing units to perform seven hard condition checks on the keyframe. The judgment results from all modules are converged to the constraint decision unit, which performs a logical AND operation. If any constraint is not met, the current keyframe is deemed unqualified, the system immediately terminates the subsequent processing of that frame, and initiates a sequential switching mechanism for alternative frames.
[0029] The calculation and judgment logic for each of the seven pre-generation constraints is as follows. For the people constraint verification, the system takes the keyframe image as input and sends it to the human detection module. The module outputs the number of people appearing in the image and their corresponding detection boxes. The system determines that the constraint passes if only a single person exists in the image. The constraint fails if two or more people appear in the image. For example, if a keyframe shows both the anchor and a staff member, then this constraint fails.
[0030] For subject integrity constraint verification, the system sends keyframes to the face and torso parsing module to extract the integrity information of the facial and torso regions. The system determines that the constraint passes if the face is not cropped and the key areas of the torso are not missing. The constraint fails if the face is cropped at the edge or if the shoulders or head are truncated. For example, if a person's face is close to the right edge of the image and is cropped in a keyframe, then this constraint fails.
[0031] For posture constraint verification, the system calculates the yaw, pitch, and roll angles of the person's head in keyframes using the face pose estimation module. The system calculates the average deviation of these three angles; a deviation of no more than 20 degrees passes the constraint. Constraints fail if the facial tilt exceeds a threshold or if the tilting or tilting of the head is too large. For example, if the person's face is tilted by 30 degrees in a keyframe, this constraint fails.
[0032] For sharpness constraint verification, the system uses the Laplacian variance algorithm to calculate the sharpness of keyframes without reference. The constraint is passed when the system's calculation result is higher than a preset sharpness threshold. The constraint fails when motion blur, out-of-focus blur, or fogging exists in the image. For example, if a keyframe is blurred due to camera shake, this constraint will fail.
[0033] For occlusion constraint verification, the system sends keyframes to the face occlusion detection module to detect occlusions such as masks, sunglasses, hands, and props. The system passes the constraint if the face is completely unobstructed. It fails the constraint if any form of occlusion is present on the face. For example, if a character touches their cheek in a keyframe, the constraint fails.
[0034] For lip visibility constraint verification, the system extracts the lip region and its extended protection region using the facial keypoint localization module, and detects whether the lips are occluded. The system passes the constraint if the lip region is completely visible, unobstructed, and uncovered. The constraint fails if the lips are obscured by hands, objects, or text. For example, if subtitles cover the lips in a keyframe, this constraint fails.
[0035] For background complexity constraint verification, the system calculates texture entropy and detects interfering elements in the keyframe background. The system approves the constraint if the background is simple, free of high-frequency cluttered textures, and lacks strong interfering targets. It disqualifies the constraint if the background is too cluttered, contains multiple interfering objects, or has areas with high texture conflict. For example, if a keyframe background is a crowded shelf, this constraint will fail.
[0036] The constraint decision unit receives the judgment results of all seven constraints. If all constraints are passed, the keyframe is allowed to proceed to the subsequent subtitle repair and semantic translation process. If any constraint fails, the decision unit outputs a frame failure command.
[0037] Upon receiving a frame failure instruction, the system reads the sequence of candidate frames sorted by quality score from the keyframe and candidate keyframe lists. The system then sequentially sends the next candidate frame into the multi-level constraint verification process, according to the score from highest to lowest. The system repeats the verification process until all candidate frames pass the constraints. If all candidate frames fail, the system marks the current task as a frame selection failure and writes an error log. For example, if the optimal keyframe fails the attitude constraints, the system automatically switches to the second-ranked candidate frame. If this frame passes all constraints, it becomes the new keyframe and the process continues.
[0038] This verification process deeply integrates the requirements for generating audio-visual videos with visual algorithms. All constraint thresholds are preset for audio-driven lip-syncing tasks, ensuring that the selected keyframes meet the input conditions for subsequent image rewriting and lip-syncing. Inferior frames are directly intercepted through a pre-emptive hard constraint gate, reducing invalid generation calculations and improving overall process stability and engineering throughput.
[0039] S104 performs joint detection and local repair on the three branches of subtitles, text, and color block regions in the keyframe. During the repair process, the lip expansion area is forcibly excluded to protect the lip area and ensure the compatibility of downstream audio drivers.
[0040] In one implementation, feature extraction processing is performed on the keyframe image region, the lip protection region, and the repair constraints to generate subtitle text region features, color block overlay region features, lip outward protection features, and downstream audio driver adaptation features. The keyframe image region features include spatial position and overlap data of subtitles, text, and color blocks determined through three branches: color segmentation, geometric contour analysis, and text region detection. The lip protection region features include lip outward exclusion zone data defined based on facial key points. The downstream audio driver adaptation features include constraint data indicating that the lip region is unblurred, unobstructed, and has a complete lip shape. The system takes the keyframe image, which has undergone multi-level constraint verification before generation, as input and initiates a multi-dimensional feature extraction process. The system performs full-domain feature analysis on the keyframe image, extracting four types of core features. The first type is the subtitle text region features, which extract high-contrast text regions in the image through the color segmentation branch, regular rectangular overlay regions through the geometric contour analysis branch, and character structure regions through the text region detection branch, ultimately obtaining the spatial position and overlap data of the subtitle text. The second category is color block overlay region features. The system detects solid color blocks and graphic overlay regions with non-natural textures in the image, recording the position and coverage of the color blocks. The third category is lip outward protection features. The system locates the lip region based on a facial key point detection model, and expands a safe range outward from the lip key points to form lip forbidden zone data. This area is prohibited from being modified by any repair algorithm. The fourth category is downstream audio-driven adaptation features. The system detects the clarity, occlusion status, and lip shape integrity of the lip region, forming constraint data for lip shape synthesis. The system outputs the above four types of features uniformly as the basis for subsequent joint detection and repair.
[0041] A horizontal subtitle bar exists at the bottom of the keyframe. The color segmentation branch identifies the high-contrast edges of the subtitles, the geometric contour analysis branch determines that the subtitles are regular rectangles, and the text region detection branch identifies the subtitle characters. The three together output the subtitle text region features. At the same time, facial key points locate the lip position and expand outward to form a protective area. The system detects that the lips are unobstructed and have a clear outline, forming downstream audio driver adaptation features.
[0042] The system performs feature extraction processing on the joint detection and local repair of subtitles, text, and color block regions, generating three-branch detection fusion features, local content repair features, and lip exclusion zone masking features. The three-branch detection fusion features include comprehensive mask data considering spatial location, overlap, text density, and continuous frame stability. The local content repair features include processing data for content completion, background reconstruction, masking, and cropping. The lip exclusion zone masking features include restriction data that prevents the repair algorithm from operating on the outer lip area. Based on the results of the first three branches of detection, the system performs feature fusion and repair feature extraction, generating three types of processing features. The first type is the three-branch detection fusion feature, where the system weighted and fused the outputs of color segmentation, geometric contour analysis, and text region detection, comprehensively considering spatial location, overlap, text density, and continuous frame stability to generate accurate comprehensive mask data for subtitles and color blocks. The second type is the local content repair feature, where the system determines the repair method based on the mask data, including missing content completion, background texture reconstruction, region masking, or edge cropping, forming executable repair operation data. The third category is the lip restricted area shielding feature. The system sets the outward protection area of the lips as an unmodifiable region and establishes shielding constraints for the repair algorithm to ensure that the repair process does not cover, blur, or distort the lip area. The system combines the above three types of features to form a complete set of repair control features.
[0043] The system integrates the detection results from the three branches, accurately locates the subtitle coverage area and generates a mask, while simultaneously locking the lip restricted area. It then determines that the subtitle area should be repaired using background reconstruction, thus forming local content repair features and lip restricted area masking features.
[0044] Based on image region restoration constraints, this system analyzes and processes features including subtitle text region features, color block overlay region features, lip expansion protection features, downstream audio driver adaptation features, as well as three-branch detection fusion features, local content restoration features, and lip restricted area shielding features. During local restoration, the system verifies whether the restoration area avoids the lip restricted area and meets the requirements of audio driver lip-syncing. A restored keyframe image is generated, which characterizes the image quality features of no subtitle interference, no text residue, and clear and complete lip shape, forming an image region restoration result that includes three-branch detection, forced lip protection, and downstream adaptation. The system jointly analyzes the restoration constraints and all extracted features, following a logic of first protecting the restricted area, then performing restoration, and finally verifying adaptation. The system first verifies whether the restoration area completely avoids the lip restricted area, ensuring no overlap between the restoration area and the lip restricted area. Then, restoration operations are performed on the subtitle, text, and color block regions according to the local content restoration features, using background reconstruction or content completion methods to eliminate interfering elements. After restoration, the system verifies that the lip area remains clear, complete, unblurred, unobstructed, and distortion-free, ensuring it meets the input requirements for audio-driven lip-sync synthesis. Once all verifications pass, the system outputs a restored keyframe image. This image is free of subtitle interference, text residue, clear and complete lip shapes, and a clean visual appearance, allowing it to directly proceed to the cross-modal semantic translation stage. The final result is an integrated image region restoration process encompassing three-branch detection, mandatory lip protection, and downstream audio driver adaptation. The system performs background reconstruction and restoration on the bottom subtitle area, avoiding any contact with the lip restricted zone. After restoration, the lips are checked for clarity, completeness, and unobstructed appearance, with no subtitle residue, meeting lip-sync requirements and generating a qualified restored keyframe image.
[0045] S105 inputs the verified and repaired keyframes into the Visual Language (VL) model to perform cross-modal semantic translation, outputting four-dimensional structured text prompts of character features, environment layout, composition parameters, and style tags. It also generalizes the biometric-level unique features into category-level features through a semantic desensitization filter, achieving pixel-to-semantic decoupling.
[0046] In one implementation, the input to the VL model is preprocessed and adapted based on the verified and repaired keyframe images. The image is then subjected to structured translation through a visual language model, constructing four-dimensional prompt word parsing rules based on the image's visual features. The system uses the keyframe images, after subtitle restoration and lip protection, as input data for the visual language model. The system performs size normalization and channel alignment on the input image to ensure the image format meets the model's input requirements. After processing, the image is fed into the visual encoding layer of the visual language model. The model performs global feature parsing on the image through a multi-layer visual feature extraction network, outputting a high-dimensional visual feature vector containing characters, environment, composition, and style. Based on the scenario of voice-over video applications, the system pre-constructs four-dimensional prompt word parsing rules, mapping the visual feature vectors to four parsing dimensions: character features, environment layout, composition parameters, and style tags, establishing a unified conversion framework for subsequent structured text output.
[0047] The system unifies the repaired keyframes with inconsistent resolution and channels into a standard input format, sends them into the visual language model, extracts complete visual features of character clothing, background environment, image composition, and overall style, and then completes feature partitioning according to preset four-dimensional rules.
[0048] This system performs a joint image feature-semantic description conversion, achieving precise feature-text mapping for visual information such as people, backgrounds, compositions, and styles through templated output, thus realizing a standardized translation from pixel-level information to semantic-level information. The system initiates the joint image feature and semantic description conversion process, inputting high-dimensional visual feature vectors into the text generation layer of the visual language model. The generation layer performs dimension-by-dimensional semantic analysis of the visual features according to four-dimensional parsing rules. For human features, it generates text descriptions of clothing style, posture angle, hairstyle outline, and lighting direction. For environmental settings, it generates text descriptions of background type, prop position, lighting layout, and spatial depth. For composition parameters, it generates text descriptions of human proportion, gaze direction, white space ratio, and perspective relationships. For style tags, it generates text descriptions of photographic quality, color tendency, and rendering style. All descriptions are output according to a fixed template combination, achieving precise mapping from visual features to semantic text, completing the standardized translation from pixel-level information to semantic-level information.
[0049] After the model analyzes the keyframes, it outputs the following in terms of character features: the character is wearing a simple top and facing the camera; the background is a solid color wall with even and soft lighting; the character is centered with appropriate white space in the composition parameters; and the style tags are realistic photography and natural color tone, forming a complete structured text.
[0050] This system employs an adaptive generalization approach, combining biometric and categorical features, to define semantic desensitization regularization terms for unique identity-related features. This enhances the universality of the desensitized descriptions and their ability to avoid infringement. The system initiates a semantic desensitization filtering process, inputting structured text into the desensitization processing module. The module incorporates semantic desensitization regularization terms to perform word-by-word detection and recognition of the text content. The system replaces unique biometric features pointing to a specific natural person with general category-level descriptions. It replaces exclusive identifiers and personalized features with general style descriptions. Through adaptive generalization, it eliminates unique information that can be used for identity recognition while retaining general semantic information such as scene, composition, and style, achieving a standardized conversion from biometric features to categorical features. This effectively eliminates the risk of portrait infringement and content replication at the text level. For example, the system generalizes the unique features in the text, such as a noticeable mole on a person's left eyebrow and wearing a necklace from a certain brand, into a general description of a clean face and simple accessories, thus completing the desensitization process.
[0051] The system integrates and optimizes structured prompt text by combining semantic translation output and desensitization filtering rules to generate standardized semantic prompts containing character features, environment setting, composition parameters, and style tags. These results represent the pre-generation features for image-text cross-modal decoupling and compliant generation. The system integrates the semantic translation output with the desensitized text, splicing them together in a fixed order of character features, environment setting, composition parameters, and style tags to form a continuous and fluent standardized semantic prompt. The system performs logical verification on the spliced prompt text to ensure no conflicts or redundancy between the descriptions of each dimension. The final output of standardized semantic prompts fully represents the cross-modal decoupling information from image to text, while also meeting the pre-generation requirements for compliant generation. It can be directly input into the text to generate an image model for subsequent generative image rewriting. The system splices the desensitized four-dimensional description into a figure wearing a simple top, facing the camera, with a solid-color wall background, even and soft lighting, the figure centered, and moderate white space in the image. The overall style is realistic photography with natural tones, forming a standard prompt text that can be directly used for generation.
[0052] This process deeply integrates visual language models with the compliant rewriting requirements of spoken video. Four-dimensional structured rules ensure complete and clear semantic descriptions. De-identification regularization terms achieve biometric generalization, meeting portrait rights compliance requirements. Standardized output ensures the input stability of the text-to-image model. The entire process achieves secure decoupling from pixel images to semantic text, preserving core information while completely avoiding infringement risks, providing compliant, stable, and high-quality input conditions for subsequent generation stages.
[0053] S106 uses a T2I model to generate images from structured text prompts. It combines style and composition parameters to dynamically match CFG intensity, sampling steps, and ControllloNet weights to generate a parameter chain, and introduces latent space semantic consistency verification to complete generative image rewriting.
[0054] In one implementation, standardized semantic prompts and dynamic parameters are integrated based on the constraints of voice-over scene generation and latent space semantic alignment rules. A text-image generation dual-track mapping model maps semantic prompt features and generation parameter chains to the latent space creation dimension. The system globally integrates the de-identified standardized semantic prompts with the voice-over scene generation constraints and latent space semantic alignment rules. The system inputs the integrated semantic prompt features and dynamic parameters into the text-image generation dual-track mapping model. The model synchronously projects text semantic features and generation parameter chains into the model's latent space, completing the dimensional transformation from text information to the creation space and establishing a unified semantic benchmark for subsequent image generation. The system combines semantic prompts such as a centered, frontal portrait, a simple background, and a realistic style with the voice-over video generation constraints, projecting them into the latent space through the dual-track mapping model to form a stable creation benchmark.
[0055] Utilizing a multi-parameter collaborative scheduling mechanism, the system dynamically routes style tags and composition parameters, and dynamically adapts generation parameters to output image creation alignment results. The system initiates the multi-parameter collaborative scheduling mechanism, executing dynamic routing of generation parameters based on style tags and composition parameters in semantic prompts. The system automatically matches key parameters such as unclassified guidance strength, sampling steps, and control network weights based on text content, forming a generation parameter chain highly adapted to the current semantics. Through dynamic parameter adaptation, the system ensures consistency between the generation process and semantic prompts, outputting image alignment results that meet creative requirements. When semantic prompts include a realistic photography style, the system automatically routes to a parameter combination with high unclassified guidance strength, multiple sampling steps, and standard control network weights, adapting to the realistic style generation requirements.
[0056] By combining latent space semantic consistency constraints with rationality regularization of the generation process, the system removes compliance desensitization threshold checks and retains only real-time latent space semantic deviation checks. This ensures that the generated image retains the environmental logic and character's temperament rather than being a mechanical replica, suppressing excessive similarity between the generated result and the original image. The system generates a rewritten target image that unifies character structure, background logic, and style. Throughout the generation process, the system performs real-time latent space semantic consistency checks, continuously comparing the semantic alignment between the current generated latent state and the original semantic prompts. The system removes compliance desensitization threshold checks and retains only latent space deviation monitoring. When the deviation exceeds a preset range, the system automatically adjusts the sampling strategy and generation step size, implementing rationality regularization for the generation process. This mechanism ensures that the generated image follows the original environmental logic and character's temperament, avoiding mechanical replication of the original image and suppressing excessive similarity between the generated result and the original image. When the latent state deviates from the simple background semantics during generation, the system automatically corrects the sampling direction, maintaining the simple background attribute and not copying the specific textures and details of the original image.
[0057] After completing the entire generation and verification process, the system outputs a rewritten target image that includes standardized character structure, reasonable background logic, and a unified style. This image retains the core features of the voiceover scene while achieving pixel-level decoupling from the original keyframes, meeting both compliant rewriting and downstream audio-driven requirements. The system ultimately outputs a rewritten image with a proper posture, clean background, and unified style, without retaining facial details and textures from the original image, only inheriting the semantic level of character demeanor and environmental logic.
[0058] In another implementation, the system employs a text-to-image (T2I) model based on a diffusion architecture as the core generation model. The system uses standardized semantic prompts output by a visual language model as the sole text input condition for the T2I model. The system inputs the semantic prompts into the T2I model's text encoder, which encodes fixed-dimensional text semantic features. The system then loads the anonymized standardized semantic prompt results. The standardized semantic prompts depict a young woman facing the camera, against a solid-color wall background, with the subject centered, exhibiting a realistic photographic style. The anonymization process replaces the unique facial features with general category features, eliminating identity-specific information.
[0059] The system integrates standardized semantic prompts and dynamic parameters based on constraints for voice-over scene generation and latent space semantic alignment rules. Constraints for voice-over scene generation include an unobstructed face, clear lip contours, a background free of copyright notices and sensitive information, and a 16:9 aspect ratio. Latent space semantic alignment rules include maintaining a centered composition, a realistic style, and a face-to-face orientation. The integration process combines semantic prompts, scene constraints, and alignment rules into a unified input sequence.
[0060] The system integrates semantic prompt features and dynamic parameters to generate a dual-track mapping model based on the input text image. This model includes a text encoding branch and a parametric encoding branch. The text encoding branch performs multi-level linear transformations and feature normalization on the semantic prompt features. The parametric encoding branch performs vector embedding and dimension alignment on the dynamic parameters. The dual-track mapping model simultaneously completes the encoding of both features and projects the fused feature vector into the latent space of the T2I model, thus transforming the text information into the latent space's creative dimension. The system establishes a fixed semantic baseline vector within the latent space, corresponding to a centered, frontal portrait with a simple background, a realistic photographic style, suitable for use in voice-over scenarios.
[0061] The system initiates a multi-parameter collaborative scheduling mechanism. Based on the style tag and composition parameters in the semantic prompts, the system dynamically routes the generated parameters. The style tag is "realistic photography," and the composition parameters are "center subject" and "16:9 aspect ratio." The system has a built-in parameter mapping table. The realistic photography style corresponds to a high unclassified guidance intensity range. A centered subject composition corresponds to a standard control network weight range. High-resolution images correspond to a high sampling step range. The system automatically matches the unclassified guidance intensity, sampling steps, and control network weights according to the mapping table. The system forms a generated parameter chain highly adapted to the current semantics, containing an unclassified guidance intensity of 10, a sampling step count of 30, and a control network weight of 0.75. The system injects the generated parameter chain into the inference configuration of the T2I model, ensuring consistency between the generation process and the semantic prompts. The system outputs an image creation alignment result that meets the creative requirements.
[0062] The system incorporates latent space semantic consistency constraints to apply rationality regularization to the generation process. It removes compliance and anonymization threshold checks, retaining only real-time latent space semantic deviation checks. At each sampling step, the system extracts the current latent state vector and calculates its cosine similarity with the semantic baseline vector. The system continuously compares the semantic alignment of the generated latent state with the original semantic prompts. When the latent state deviates from the semantic requirement of a concise background during generation, the cosine similarity falls below a preset threshold. The system determines that the deviation exceeds a preset range, automatically reduces the noise input intensity of the current sampling step, automatically corrects the noise prediction direction of the generation step, and applies rationality regularization to the generation process, bringing the latent state back within the semantic baseline range.
[0063] The system ensures that the generated image adheres to the original environment logic and character's temperament through the aforementioned mechanisms, avoiding mechanical replication of the original image, suppressing excessive similarity between the generated result and the original image, completing the entire generation and verification process, and outputting the rewritten target image. The rewritten target image possesses a standardized character structure, normal facial proportions, and no limb deformities. It has a reasonable background logic, no copyright information, and no messy textures. The rewritten target image has a unified style and temperament, is realistic and clear, and has soft lighting. The rewritten target image does not retain facial details and textures from the original image. The rewritten target image only inherits the semantic level of character temperament and environment logic. The rewritten target image and the original keyframes are decoupled at the pixel level. The rewritten target image meets compliant rewriting requirements. The rewritten target image meets downstream audio driver adaptation requirements, with a complete and clear lip area, and can be directly used for lip-sync synthesis.
[0064] S107 performs multi-dimensional post-quality checks on the generated images, covering human body structure, face similarity compliance threshold verification, background copyright, audio driver adaptation, and clarity verification. If the images fail to meet the requirements, multi-level rollback is triggered.
[0065] In one implementation, the rewritten target image and generation parameters are integrated based on generated image quality standards, portrait rights compliance standards, and downstream audio driver adaptation standards. A multi-dimensional quality inspection and multi-level backtracking dual-track verification model maps human body structure, facial similarity, background copyright, audio driver adaptation, and clarity verification items to a pass / fail judgment space. The system takes the rewritten target image and generation parameters as input and globally integrates the generated image quality standards, portrait rights compliance standards, and downstream audio driver adaptation standards. The system inputs the integrated standards into the multi-dimensional quality inspection and multi-level backtracking dual-track verification model. The model maps five types of verification items—human body structure integrity, facial similarity, background copyright, audio driver adaptation, and clarity—to a unified pass / fail judgment space, establishing a comprehensive quantitative judgment benchmark. The system integrates the image quality requirements, portrait rights similarity requirements, and lip-sync adaptation requirements for generated images from spoken video, mapping these five types of detection items to a unified judgment space through the dual-track verification model, forming a complete pass / fail judgment benchmark.
[0066] Utilizing a multi-dimensional parallel quality inspection mechanism, the system outputs image quality and compliance verification results through a shared base image layer and an independent branch verification structure. The system initiates a multi-dimensional parallel quality inspection process, using the shared base image layer as a unified input and employing an independent branch verification structure to simultaneously execute five checks. The first check is for the integrity of the human body structure, using a hand keypoint model to verify the number of fingers and joint morphology, and a face analysis model to verify the symmetry of facial features and the logic of eyelid and lip closure. The second check is for face similarity compliance verification, extracting facial feature vectors from the generated image and the original keyframes, and calculating the cosine similarity value. The third check is for background legality verification, using a target detection model to scan the background area and identify trademarks, copyright marks, and sensitive symbols. The fourth check is for downstream audio driver compatibility verification, detecting frontal face angle deviation, lip occlusion status, and facial illumination uniformity. The fifth check is for sharpness and image quality evaluation, calculating the no-reference quality index and Laplacian variance value. All branches output independent verification results in parallel, and the system aggregates these to obtain the global image quality and compliance verification results. The system shares the same generated image and simultaneously completes five tasks: finger and facial feature detection, face similarity calculation, background trademark scanning, lip shape matching detection, and sharpness calculation. The results are output in parallel.
[0067] The generated images are regularized for reasonableness based on downstream Lip-Sync adaptation constraints of the broadcast video, and a face similarity compliance threshold is applied. The system employs constraints such as a face similarity of 0.60~0.65, a frontal face angle deviation of ≤15°, no-reference image quality metrics (BRISQUE / NIQE and Laplacian variance), human body structural integrity, and no copyright-sensitive information in the background to ensure that images meet portrait rights compliance requirements, have normal structure, and are usable for audio-driven synthesis. This suppresses deviations such as image distortion, infringement, and incompatibility with downstream components, and generates quality inspection and rollback control information including a pass / fail judgment conclusion and a four-level rollback state machine rollback instruction. The system combines downstream lip-sync adaptation constraints for spoken video to apply reasonableness regularization to the generated images and performs item-by-item judgment based on preset constraint thresholds. Facial similarity must fall within the compliance range, the frontal face angle deviation must not exceed 15 degrees, the no-reference image quality metrics and Laplacian variance must meet threshold requirements, the human body structure must remain intact, and the background must be free of copyright-sensitive information. Through these constraints, the system ensures that images meet portrait rights compliance requirements, have normal human body structure, and can be directly used for audio-driven video synthesis, effectively suppressing issues such as image distortion, infringement, and incompatibility with downstream components. The system determined that the facial similarity of a generated image is 0.62, the frontal angle deviation is 10 degrees, the clarity index meets the standard, the fingers and facial features are normal, the background has no trademarks, and all constraints are met.
[0068] The system generates a final pass / fail conclusion based on the results of all verification items. If all verification items pass, a pass / fail conclusion is output, and the image proceeds to the subsequent traceability output stage. If any item fails, a fail / fail conclusion is output, triggering a four-level rollback state machine to generate corresponding rollback instructions. The rollback instructions are prioritized as follows: fine-tune and regenerate the prompt word, switch the generation seed and parameter chain, enable alternative keyframes to re-execute cross-modal semantic translation, and mark the task as failed and write it to the compensation pool. The system's final output includes quality control and rollback control information containing the pass / fail conclusion and rollback instructions. If a generated image is deemed unqualified due to an abnormal finger structure, the system triggers a first-level rollback instruction, fine-tunes the prompt word, and re-executes the generation process.
[0069] S108 performs metadata cleaning and embeds invisible traceability watermarks on qualified images before outputting them. It processes multiple tasks in parallel to achieve batch scheduling and system self-iterative optimization.
[0070] In one implementation, the system uses qualified images that have undergone comprehensive multi-dimensional post-quality inspection as processing objects and initiates a full-domain metadata cleanup process. The system iterates through all additional information in the image file, removing all metadata, including shooting device information, shooting time information, original video source information, and editing history. The system standardizes and reconstructs the image file header and data footer, retaining only the basic information required for image encoding and eliminating additional data that may pose traceability risks and copyright concerns. The cleaned image maintains its image quality and size, removing only irrelevant supplementary information, laying the foundation for subsequent secure output. The system performs cleanup on a qualified generated image, removing additional data such as the original camera model, shooting location, and editing software version, retaining only the image encoding information, achieving file lightweighting and information purification.
[0071] The system performs frequency domain digital watermark embedding on images that have undergone metadata cleaning. The watermark embedding process employs an invisible frequency domain transform algorithm, encoding the unique task identifier, authorized usage identifier, and generation time information into a digital sequence, which is then embedded into the high-frequency component region of the image. The embedded watermark does not alter the image's visual appearance, does not affect image quality or subsequent use, and possesses characteristics of tamper-proof, extractable, and traceable. The watermark information can be extracted using dedicated decoding tools for subsequent ownership verification and process traceability. The system combines the task number, authorization number, and generation time into a digital sequence and embeds it into the high-frequency region of the image in an invisible manner, without altering the image's appearance and preserving its complete usability.
[0072] The system performs unified encoding on images with embedded source watermarks, outputting them in a common image format. Output paths are categorized and stored according to task number and video identifier, forming searchable and manageable result files. The system synchronously records the processing status, key step time, quality score, verification results, generation parameters, and other information for this task, writing it to the task log library for subsequent process traceability and system iteration. The system saves the processed image to a designated directory, simultaneously recording the prompt words used for that image, generation parameters, quality inspection results, frame number, and other information, forming a complete task archive.
[0073] The system employs a parallel processing mechanism for batch video rewriting tasks. It utilizes a thread pool and task queue management approach to allocate processing resources to multiple video tasks simultaneously. An idempotent verification mechanism is integrated during task execution to prevent duplicate processing of the same tasks. A hierarchical caching and reuse mechanism is implemented to cache processed video and feature data, reducing redundant computation. The system integrates failure retry strategies, timeout interrupt strategies, dynamic rate limiting strategies, and backpressure control strategies to ensure stable operation under multi-task concurrency. Each functional module is deployed independently, supporting horizontal scaling to meet high-throughput engineering requirements. The system can simultaneously process ten voice-over video rewriting tasks. Each task independently executes frame extraction, scoring, verification, repair, translation, generation, and quality inspection processes without interference. Overall processing efficiency increases linearly with the number of tasks.
[0074] The system stores all compliant image features, prompt word templates, generation parameters, constraint thresholds, and quality score data into a vector database. The system uses a sliding window statistical method to analyze historical successful tasks. Based on the distribution characteristics of successful samples, the system dynamically optimizes the visual language model's prompt word templates to improve semantic translation accuracy. Based on historical compliance rates, the system dynamically adjusts the face similarity compliance threshold to balance compliance and generation success rate. Based on the characteristics of failed tasks, the system optimizes the quality score weights and constraint thresholds to improve keyframe selection accuracy. The system continuously accumulates effective data to improve the overall process stability, success rate, and engineering availability. After analyzing nearly a thousand successful tasks, the system discovered that a certain type of prompt word paired with a certain set of parameters had the highest pass rate, and then automatically optimized the visual language model's output template to improve the generation quality and efficiency of subsequent tasks.
[0075] like Figure 2 As shown, a video image rewriting system based on candidate frame quality scoring and multi-level constraint control includes: The video hierarchical caching and adaptive frame extraction module 201 is used to acquire the video to be processed and build a hierarchical cache. Based on the characteristics of the video broadcasting scene, it performs adaptive sparse frame extraction to extract face, clarity, occlusion and subtitle information to form a candidate frame set. The multi-feature keyframe filtering module 202 is used to weight and sort the candidate frame set based on the multi-feature fusion quality scoring function, filter the best keyframe and candidate keyframe, and establish a three-layer elimination and back-up strategy of coarse filtering, fine sorting and anomaly back-up. The keyframe pre-constraint verification module 203 is used to perform multi-level constraint verification on keyframes before generation. If the verification fails, the alternative keyframes are switched in sequence. The keyframe local intelligent repair module 204 is used to perform three-branch joint detection and local repair on the subtitle, text and color block areas in the keyframe. During the repair process, the lip expansion area is forcibly excluded to protect the lip area and ensure the compatibility of the downstream audio driver. The cross-modal semantic structured translation module 205 is used to input the verified and repaired keyframes into the visual language VL model to perform cross-modal semantic translation, output four-dimensional structured text prompts of character features, environment layout, composition parameters, and style tags, and generalize the biometric-level unique features into category-level features through a semantic desensitization filter to achieve pixel-to-semantic decoupling. The generative image latent space rewriting module 206 is used to generate an image T2I model from structured text prompt input text, and dynamically match CFG intensity, sampling steps, and ControlNet weights to generate a parameter chain by combining style and composition parameters, and introduce latent space semantic consistency verification to complete generative image rewriting. The multi-dimensional post-quality inspection module 207 for generating images is used to perform multi-dimensional post-quality inspections on the generated images, covering human body structure, face similarity compliance threshold verification, background copyright, audio driver adaptation, and clarity verification. If the quality inspection fails, multi-level rollback is triggered. The image output and system iteration module 208 is used to clean up metadata of quality inspection qualified images, embed invisible traceability watermarks, and output them. It can process multiple tasks in parallel to achieve batch scheduling and system self-iterative optimization.
[0076] A computing device includes a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein when the computer program instructions are executed by the processor, the device is triggered to execute any video image rewriting method based on candidate frame quality scoring and multi-level constraint control.
[0077] The methods and / or embodiments in this application can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowchart. When the computer program is executed by a processing unit, it performs the functions defined in the methods of this application.
[0078] It should be noted that the computer-readable medium described in this application can be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this application, a computer-readable medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. It will be apparent to those skilled in the art that this application is not limited to the details of the exemplary embodiments described above, and that this application can be implemented in other specific forms without departing from the spirit or essential characteristics of this application.
Claims
1. A video image rewriting method based on candidate frame quality scoring and multi-level constraint control, characterized in that, include: The system acquires the video to be broadcast and builds a hierarchical cache. Based on the characteristics of the broadcast scene, it performs adaptive sparse frame extraction to extract faces, clarity, occlusion, and subtitles, forming a candidate frame set. Based on the multi-feature fusion quality scoring function, the candidate frame set is weighted and sorted to select the best key frame and candidate key frame, and a three-layer elimination and back-up strategy of coarse filtering, fine sorting and anomaly back-up is established. Perform multi-level constraint checks on keyframes before generation; if they fail to meet the requirements, switch to alternative keyframes in sequence. The three branches of subtitles, text, and color block regions in the keyframe are jointly detected and local repair is performed. During the repair process, the lip expansion area is forcibly excluded to protect the lip area and ensure the compatibility of downstream audio drivers. The verified and repaired keyframes are input into the Visual Language (VL) model to perform cross-modal semantic translation, and output four-dimensional structured text prompts of character features, environment setting, composition parameters, and style tags. The biometric-level unique features are generalized to category-level features through a semantic desensitization filter, achieving pixel-to-semantic decoupling. The structured text prompt input text is used to generate an image T2I model. The model dynamically matches CFG intensity, sampling steps, and ControllloNet weights to generate a parameter chain, and introduces latent space semantic consistency verification to complete the generative image rewriting. The generated images undergo multi-dimensional post-quality checks, covering human body structure, face similarity compliance threshold verification, background copyright, audio driver adaptation, and clarity verification. If the images fail to meet the requirements, multi-level rollback is triggered. After cleaning the metadata of the quality-inspected images and embedding an invisible traceability watermark, the images are output. The parallel processing of multiple tasks enables batch scheduling and system self-iterative optimization.
2. The video image rewriting method based on candidate frame quality scoring and multi-level constraint control according to claim 1, characterized in that, Based on a multi-feature fusion quality scoring function, the candidate frame set is weighted and sorted to select the optimal keyframe and candidate keyframes. A three-layer elimination and backoff strategy is established, consisting of coarse filtering, fine ranking, and anomaly backoff, including: The candidate frame feature set and multi-feature fusion quality scoring function are weighted and fused, and dual corrections of time decay coefficient and image stability coefficient are calculated to generate candidate frame quality scores and ranking results. The candidate frame feature set includes face pose, face area, sharpness, subject integrity, occlusion, caption risk, and composition feature data. The multi-feature fusion quality scoring function includes subject quality sub-scores, pose usability sub-scores, image cleanliness sub-scores, composition fit sub-scores, and a time decay coefficient. Image stability coefficient parameter; The quality scores and ranking results are subjected to stratified filtering to generate keyframe and candidate keyframe lists. The keyframe and candidate keyframe lists include the best keyframe, the second-best candidate keyframe, and the coarse filter layer quality score threshold. parameter; The quality score and ranking results, along with the keyframe and candidate keyframe lists, are processed to generate hierarchical elimination and backoff constraints, including... < Constraints include: coarse filtering below the scoring threshold, fine sorting by score in descending order, and backoff constraint for switching alternative frames in order in case of abnormal failure in the generation / quality inspection process.
3. The video image rewriting method based on candidate frame quality scoring and multi-level constraint control according to claim 1, characterized in that, The system performs joint detection and local repair on the subtitle, text, and color block regions in keyframes. The repair process forcibly excludes the lip extension area to protect the lip region and ensure downstream audio driver compatibility. This includes: Feature extraction processing is performed on keyframe image regions, lip protection regions, and repair constraints to generate subtitle text region features, color block overlay region features, lip outward protection features, and downstream audio driver adaptation features. Among them, keyframe image region features include spatial position and overlap data of subtitles, text, and color blocks determined by three branches: color segmentation, geometric contour analysis, and text region detection; lip protection region features include lip outward exclusion zone data defined based on facial key points; and downstream audio driver adaptation features include constraint data for unblurred, unoccluded, and complete lip shape in the lip region. Feature extraction processing is performed on the joint detection and local restoration process of subtitles, text, and color block regions to generate three-branch detection fusion features, local content restoration features, and lip exclusion zone masking features. Among them, the three-branch detection fusion features include comprehensive mask data of spatial location, overlap, text density, and continuous frame stability; the local content restoration features include processing data of content completion, background reconstruction, masking, and cropping; and the lip exclusion zone masking features include restriction data on which the restoration algorithm does not apply to the outer lip area. Based on image region restoration constraints, this study analyzes and processes features including subtitle text region features, color block overlay region features, lip expansion protection features, downstream audio driver adaptation features, as well as three-branch detection fusion features, local content restoration features, and lip restricted area shielding features. During local restoration, the study verifies whether the restoration range avoids the lip restricted area and meets the requirements of audio driver lip-syncing. This generates a restored keyframe image, which is used to characterize the image quality features of no subtitle interference, no text residue, and clear and complete lip shape. This results in an image region restoration result that includes three-branch detection, forced lip protection, and downstream adaptation.
4. The video image rewriting method based on candidate frame quality scoring and multi-level constraint control according to claim 3, characterized in that, The verified and repaired keyframes are input into the Visual Language (VL) model to perform cross-modal semantic translation, outputting four-dimensional structured text prompts including character features, environment setting, composition parameters, and style tags. A semantic desensitization filter generalizes biometric-level unique features into category-level features, achieving pixel-to-semantic decoupling, including: Based on the verified and repaired keyframe images, the input of the VL model is preprocessed and adapted. The image is then subjected to structured translation through the visual language model, and four-dimensional prompt word parsing rules are constructed for the visual features of the image. Perform joint transformation of image features and semantic descriptions, and achieve precise feature-text mapping for visual information such as people, backgrounds, compositions, and styles through templated output, realizing standardized translation from pixel-level information to semantic-level information; Adaptive generalization of biometric features and category features is adopted, and semantic desensitization regularization terms are defined for unique identity pointing features to enhance the universality of the desensitized description and the ability to avoid infringement. By combining semantic translation output with desensitization filtering rules, the structured prompt text is integrated and optimized to generate standardized semantic prompt results containing character features, environment setting, composition parameters, and style tags. The results are used to characterize the image-text cross-modal decoupling and compliant generation pre-features.
5. The video image rewriting method based on candidate frame quality scoring and multi-level constraint control according to claim 1, characterized in that, The structured text prompt input text is used to generate an image T2I model. This model dynamically matches CFG intensity, sampling steps, and ControlNet weights to generate a parameter chain, combining style and composition parameters. Latent space semantic consistency verification is introduced to complete generative image rewriting, including: Based on the constraints of verbal scene generation and the semantic alignment rules of latent space, the standardized semantic prompt results and dynamic parameter basis are integrated and processed. The semantic prompt features and generation parameter chain are mapped to the latent space creation dimension through the text-image generation dual-track mapping model. By utilizing a multi-parameter collaborative scheduling mechanism, the system dynamically routes style tags and composition parameters, and generates structurally aligned output images with dynamically adapted parameters. By combining latent space semantic consistency constraints to perform rationality regularization on the generation process, eliminating compliance desensitization threshold verification, and retaining only latent space semantic deviation real-time verification, the generated image is ensured to retain environmental logic and character temperament rather than mechanical replication. This suppresses the deviation of excessive similarity between the generated result and the original image, and generates a rewritten target image that includes unified character structure, background logic, style and temperament.
6. The video image rewriting method based on candidate frame quality scoring and multi-level constraint control according to claim 5, characterized in that, Multi-dimensional post-quality checks are performed on the generated images, covering human body structure, facial similarity compliance threshold verification, background copyright, audio driver adaptation, and sharpness verification. Failure to meet the requirements triggers multi-level rollback, including: Based on the generated image quality standard, portrait rights compliance standard, and downstream audio driver adaptation standard, the rewritten target image and generation parameters are integrated and processed. Through a multi-dimensional quality inspection and multi-level backtracking dual-track verification model, human body structure, face similarity, background copyright, audio driver adaptation, and clarity verification items are mapped to the qualified judgment space. By utilizing a multi-dimensional parallel quality inspection mechanism, the system outputs image quality and compliance verification results through a structure that shares a basic image layer and allows for independent branch verification. Combining the downstream Lip-Sync adaptation constraints of the voice-over video, the generated images are regularized for reasonableness, and the following constraints are applied: face similarity compliance threshold θlegal=0.60~0.65, frontal face angle deviation ≤15°, no reference image quality indicators BRISQUE / NIQE and Laplacian variance, human body structure integrity, and no copyright sensitive information in the background. This ensures that the images meet the requirements of portrait rights compliance, normal structure, and usability of audio drivers, and suppresses deviations such as generation of deformed, infringing, and incompatible downstream components. Quality inspection and rollback control information containing qualified judgment conclusions and four-level rollback state machine rollback instructions is generated.
7. A video image rewriting system based on candidate frame quality scoring and multi-level constraint control, characterized in that, The system is used to execute executable instructions to perform the video image rewriting method based on candidate frame quality scoring and multi-level constraint control as described in any one of claims 1 to 6.
8. An electronic device, characterized in that, include: First processor; The processor also includes a memory for storing executable instructions of the first processor; wherein the first processor is configured to execute the video image rewriting method based on candidate frame quality scoring and multi-level constraint control as described in any one of claims 1 to 6 by executing the executable instructions.
9. A computing device, the device comprising a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein, When the computer program instructions are executed by the processor, the device is triggered to execute the video image rewriting method based on candidate frame quality scoring and multi-level constraint control as described in any one of claims 1 to 6.