A method and system for real-time generation of animation of a virtual character in an animation style

By extracting multi-dimensional style features from anime reference images and combining them with portrait videos to generate dynamic semantic identity anchor streams, the problems of style consistency and temporal coherence in anime-style virtual character animations are solved, achieving high-quality real-time generation.

CN122265486APending Publication Date: 2026-06-23HANGZHOU QIGUO CULTURE MEDIA CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HANGZHOU QIGUO CULTURE MEDIA CO LTD
Filing Date
2026-03-12
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies struggle to maintain stylistic consistency, temporal coherence, and interactive realism in anime-style virtual character animations. Furthermore, the balance between computational complexity and real-time performance is difficult to achieve, resulting in unstable and substandard outputs.

Method used

By extracting multidimensional style features from anime reference images and combining them with real-time captured portrait videos to generate dynamic semantic identity anchor streams, controlled diffusion inference in the latent space and temporal smoothing filtering are used to generate animation frame sequences and correct styles.

Benefits of technology

It achieves high stylistic consistency, temporal coherence, and interactive realism in anime-style virtual character animation, improving the stability and quality of the generated results and ensuring real-time generation capabilities.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122265486A_ABST
    Figure CN122265486A_ABST
Patent Text Reader

Abstract

The application provides a kind of animation style virtual person animation real-time generation method and system, comprising: extracting multidimensional style features from animation reference map, wherein multidimensional style features include line drawing features, color features, texture features and character form features;From real-time captured portrait video, extract portrait key point sequence and interactive object sequence, and generate dynamic semantic identity anchor point stream according to interactive object sequence;With portrait key point sequence as structural constraint condition, with multidimensional style features as global guidance, dynamic semantic identity anchor point stream is embedded into latent space as hidden variable constraint, and controlled diffusion reasoning is carried out in latent space, to obtain animation latent feature frame sequence;Three-dimensional causal decoding and time domain smoothing filtering are carried out on animation feature frame sequence, to obtain smooth animation frame sequence;Real-time style feedback and style correction are carried out on smooth animation frame sequence, to obtain virtual person animation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of virtual video technology, and in particular to a method and system for real-time generation of anime-style virtual character animations. Background Technology

[0002] Anime-style virtual character animation generation is an important technological direction in digital content creation, virtual reality, and digital human interaction, widely used in animation production, virtual anchors, game character generation, and immersive interactive scenarios. This technology typically achieves continuous animation with a specific anime art style by performing pose-driven, motion-smoothing, and stylized rendering on character animation sequences. However, in practical applications, existing technologies still face several challenges.

[0003] First, traditional methods based on single-frame style transfer or static style mapping struggle to maintain style consistency across animation sequences. They are prone to flickering and style drift due to independent frame-by-frame processing, negatively impacting the overall viewing experience. Second, existing stylization methods often rely on fixed style losses or global feature constraints, lacking a fine-grained perception of local structures, texture details, and dynamic changes within animation frames. This leads to over- or under-correction during style adjustment. Furthermore, some methods fail to adequately consider temporal continuity constraints, ignoring the relationships between consecutive frames during style adjustment, which can cause abrupt changes and temporal instability. Simultaneously, with the increasing demand for real-time virtual character animation, existing high-precision stylization algorithms struggle to balance computational complexity and real-time performance, failing to meet the application requirements of simultaneous real-time generation and high-quality style expression. Therefore, there is an urgent need for a virtual character animation generation technology that can achieve precise, controllable, and real-time animation style feedback and correction while ensuring temporal stability. Summary of the Invention

[0004] (a) Technical problems to be solved To address the shortcomings of existing technologies, this invention provides a real-time generation method and system for anime-style virtual character animations, which has advantages such as high style consistency, strong temporal continuity, and outstanding interactive realism. It solves the problems of difficulty in maintaining artistic style, significant temporal jitter, and low rationality of character interaction in anime video generation scenarios.

[0005] (II) Technical Solution To achieve the above objectives, the present invention provides the following technical solution: This invention provides a method for real-time generation of anime-style virtual character animations, comprising the following steps: Multidimensional style features are extracted from anime reference images, including line art features, color features, texture features, and character form features. Extract the sequence of key points of the human face and the sequence of interactive objects from the real-time captured human face video, and generate a dynamic semantic identity anchor stream based on the sequence of interactive objects; Using the portrait key point sequence as structural constraints and the multidimensional style features as global guidance, the dynamic semantic identity anchor stream is embedded as a latent variable constraint in the latent space, and controlled diffusion reasoning is performed in the latent space to obtain the animation latent feature frame sequence. The animation feature frame sequence is subjected to 3D causal decoding and temporal smoothing filtering to obtain a smooth animation frame sequence; Real-time style feedback and style correction are performed on the smooth animation frame sequence to obtain virtual character animation.

[0006] According to a preferred embodiment of the present invention, multi-dimensional style features are extracted from anime reference images, including: The anime reference image is scaled and color space converted, and layered feature extraction is performed to obtain shallow and deep features. Edge detection and curve fitting are performed on the shallow features to obtain the line art features; Color clustering is performed on the shallow features to obtain clustered color clusters, and the spatial color probability distribution corresponding to the clustered color clusters is extracted. Color features are generated based on the clustered color clusters and the spatial color probability distribution. Using the line art features as spatial constraints and the color features as attribute constraints, patch matching and feature distribution calculations are performed on the anime reference image to obtain texture features; Anime characters are identified and separated from the line art features based on the deep features. Face alignment and skeletal proportion analysis are performed on the anime characters. Character morphological features are generated based on the face alignment results and skeletal proportion analysis results.

[0007] According to another preferred embodiment of the present invention, using the line art features as spatial constraints and the color features as attribute constraints, patch matching and feature distribution calculation are performed on the anime reference image to obtain texture features, including: Based on the line art features, the spatial outline of the animation reference image is extracted. Using the spatial outline as the boundary, the animation reference image after color space conversion is decomposed into a set of tile patches with overlapping areas. The patch color features of each patch in the patch set are calculated based on color features, and the similarity of each patch color feature is calculated to obtain a color association matrix. Based on the patch color association matrix, color classification is performed on the color features of each patch to obtain a semantic color patch cluster set; Texture sampling is performed on the tile patches corresponding to each semantic color patch cluster in the semantic color patch cluster set to obtain a local texture feature set, and the texture correlation matrix corresponding to the local texture feature set is calculated. Based on the texture correlation matrix, statistical modeling is performed on the local texture feature set to obtain texture features.

[0008] According to another preferred embodiment of the present invention, extracting a sequence of key human figures and a sequence of interactive objects from a real-time captured human figure video includes: The real-time captured human video is split into a human video frame sequence. Multi-object detection is performed on the first video frame in the human video frame sequence to obtain a primary human bounding box and a primary object bounding box near the human bounding box. Human pose estimation is performed in the primary human body bounding box. Primary human key points are extracted based on the human pose estimation results. Instance segmentation and semantic encoding are performed on the primary object bounding box to extract primary interactive objects. A joint state transition matrix is ​​established based on the displacement of primary human key points and the centroid displacement of primary interactive objects in adjacent video frames in the human portrait video frame sequence, and a target motion vector is generated based on the joint state transition matrix. The motion vector prediction-based method extracts the key points of the human portrait video frames after the first video frame and the interactive objects one by one according to the target motion vector, so as to obtain the primary key point sequence and the primary object sequence. The primary keypoint sequence and primary object sequence are subjected to confidence position correction and temporal smoothing to obtain the portrait keypoint sequence and interactive object sequence.

[0009] According to another preferred embodiment of the present invention, generating a dynamic semantic identity anchor stream based on the sequence of interactive objects includes: Based on the human portrait key point sequence, cross-frame association and visibility analysis are performed on the interactive object sequence to generate an object trajectory sequence with occlusion labels; The object trajectory sequence is subjected to attitude analysis to obtain dynamic object attitude features; Based on the object trajectory sequence and dynamic object posture features, a cooperative motion mode between the human portrait key point sequence and the interactive object sequence is constructed to obtain a human-object time-varying interactive state sequence. The occlusion decoupling and semantic consistency modeling of the time-varying human-object interaction state sequence are performed to obtain the semantic identity anchor stream; Interaction stability weights are extracted from the time-varying human-object interaction state sequence, and spatiotemporal smoothing constraints are applied to the semantic identity anchor stream based on the interaction stability weights to obtain a dynamic semantic identity anchor stream.

[0010] According to another preferred embodiment of the present invention, using the portrait keypoint sequence as structural constraints and the multidimensional style features as global guidance, the dynamic semantic identity anchor stream is embedded as a latent variable constraint into the latent space, and controlled diffusion inference is performed within the latent space to obtain an animation latent feature frame sequence, including: The key point sequence of the human face is topologically encoded to obtain a structural constraint feature sequence. The structural constraint feature sequence is projected into the latent space to obtain a structural constraint latent vector sequence. The multidimensional style features are projected into the latent space to obtain a global style prior distribution. The dynamic semantic identity anchor stream is embedded as a latent variable constraint in the latent space to obtain a sequence of semantic constraint latent conditions. In the latent space, the structural constraint latent vector sequence is used as a hard structural constraint, the global style prior distribution is used as a distribution guiding condition, and iterative noise reduction sampling and temporal consistency constraints are performed based on the cross attention mechanism combined with the semantic constraint latent condition sequence to obtain the animation latent feature frame sequence. The animation latent feature frame sequence is obtained by performing inter-frame smoothing based on cross-frame latent feature similarity.

[0011] According to another preferred embodiment of the present invention, the animation feature frame sequence is subjected to three-dimensional causal decoding and temporal smoothing filtering to obtain a smooth animation frame sequence, including: Tensor quantization and spatiotemporal alignment are performed on the animation feature frame sequence to obtain a reconstructed feature sequence, and three-dimensional causal convolution is performed on the reconstructed feature sequence to obtain an aligned feature sequence. The alignment feature sequence is subjected to three-dimensional causal decoding to obtain the original animation frame sequence; Inter-frame optical flow calculations are performed on the original animation frame sequence to obtain a temporal optical flow residual sequence; Temporal noise filtering is performed on the original animation frame sequence based on the temporal optical flow residual sequence to obtain a smooth animation frame sequence.

[0012] According to another preferred embodiment of the present invention, temporal noise filtering is performed on the original animation frame sequence based on the temporal optical flow residual sequence to obtain a smooth animation frame sequence, including: Each original animation frame in the original animation frame sequence is selected as the target animation frame, and the corresponding pixel confidence map is calculated based on the temporal optical flow residual corresponding to the target animation frame in the temporal optical flow residual sequence. Local contrast and edge information are extracted from the target animation frame, and a dynamic filtering weight matrix is ​​constructed based on the local contrast, edge information and pixel confidence map. The previous original animation frame of the target animation frame is used as the preceding animation frame. The preceding animation frame is spatially remapped according to the temporal optical flow residual corresponding to the target animation frame. The spatially remapped preceding animation frame and the target animation frame are weighted and aggregated according to the dynamic filtering weight matrix to obtain the denoised animation frame. Based on the target animation frame, residual concatenation is performed on the denoised animation frame, and bilateral filtering is performed on the residual concatenated denoised animation frame to obtain a smooth animation frame. The smooth animation frames corresponding to all target animation frames in the original animation frame sequence are then aggregated into a smooth animation frame sequence.

[0013] According to another preferred embodiment of the present invention, real-time style feedback and style correction are performed on the smooth animation frame sequence to obtain virtual character animation, including: Multidimensional style extraction is performed on each smooth animation frame in the smooth animation frame sequence to obtain smooth style features; Calculate the feature difference degree between the smooth style feature and the multidimensional style feature, and generate a style deviation gradient map based on the feature difference degree; A style compensation operator is generated based on the style deviation gradient map, and a second pixel remapping is performed on the smooth animation frame sequence based on the style compensation operator to obtain a style calibration frame sequence. The style calibration frame sequence is smoothed for style stability, and the smoothed style calibration frame sequence is then stitched together to form a virtual character animation.

[0014] To achieve at least one of the above-mentioned objectives, the present invention further provides a real-time generation system for anime-style virtual character animation, the system comprising a feature extraction module, an anchor point recognition module, a controlled diffusion module, a noise smoothing module, and a style correction module, wherein: The feature extraction module is used to extract multi-dimensional style features from anime reference images, wherein the multi-dimensional style features include line art features, color features, texture features, and character shape features. An anchor point recognition module is used to extract the sequence of key human figures and the sequence of interactive objects from real-time captured human figure videos, and generate a dynamic semantic identity anchor point stream based on the sequence of interactive objects. The controlled diffusion module is used to embed the dynamic semantic identity anchor stream as a latent variable constraint into the latent space, with the portrait key point sequence as a structural constraint and the multidimensional style features as a global guide, and to perform controlled diffusion inference in the latent space to obtain the animation latent feature frame sequence. The noise smoothing module is used to perform three-dimensional causal decoding and temporal smoothing filtering on the animation feature frame sequence to obtain a smooth animation frame sequence. The style correction module is used to provide real-time style feedback and style correction to the smooth animation frame sequence to obtain virtual character animation.

[0015] The present invention further provides a computer-readable storage medium storing a computer program, which is executed by a processor to implement the above-described method for real-time generation of anime-style virtual character animation.

[0016] (III) Beneficial Effects Compared with existing technologies, the present invention provides a method and system for real-time generation of anime-style virtual character animations, which has the following beneficial effects: This real-time generation method for anime-style virtual character animation extracts layered features from anime reference images and establishes multiple constraints on structure, color, and semantics between shallow and deep features, achieving decoupled modeling of the anime style. Stable line art features are generated through edge detection and curve fitting, providing clear structural boundaries for subsequent style transfer. Color clustering and spatial color probability distribution modeling create color features with regional consistency, effectively avoiding color drift during generation. Furthermore, patch-level texture modeling is performed under line art spatial and color attribute constraints, ensuring continuity and consistency of texture features at both spatial and semantic levels, thereby suppressing flickering noise in the animation sequence. Character recognition, face alignment, and skeletal proportions are guided by deep features. For example, the analysis extracts character morphological features to provide high-level constraints for the consistency of virtual character identity; by jointly modeling the sequence of human portrait key points and the sequence of interactive objects, and introducing a joint state transition matrix and motion vector prediction mechanism, the collaborative tracking and stable extraction of human portraits and interactive objects in the temporal dimension are realized, effectively reducing the recognition error caused by detection jitter, short-term occlusion or rapid changes in posture; by constructing a human-object collaborative motion mode, low-level geometric motion features are mapped to high-level interactive state semantics, and further semantic identity anchor flow and interaction stability weights are introduced to impose spatiotemporal smoothing constraints on interactive semantics, thereby maintaining the continuity and consistency of interactive identity and semantic expression in complex dynamic interaction scenarios, and improving the robustness and interpretability of human-object interaction modeling.

[0017] This real-time generation method for anime-style virtual character animation achieves fine-grained control over the animation generation process by introducing a multi-condition controlled diffusion mechanism that incorporates structural constraints, style priors, and semantic anchors in the latent space. By utilizing the structural constraint latent vector constructed from the sequence of human key points, the stability of the human body's topological structure can be guaranteed at the latent level, effectively avoiding structural distortion problems in the generated animation. By mapping multi-dimensional style features to a global style prior distribution, the diffusion sampling process is guided to always be constrained by the target anime style space, significantly improving the style consistency of the generated results. Simultaneously, by introducing dynamic semantic identity anchor streams as latent variable conditions, the diffusion model explicitly perceives the semantics of human-object interaction and its temporal continuity during the generation process. Combined with temporal consistency constraints and cross-frame smoothing mechanisms in the iterative process, it can improve the stability and coherence of the animation in the temporal dimension while ensuring the quality of a single frame, providing a high-quality and controllable latent feature foundation for subsequent animation decoding and rendering.

[0018] This real-time generation method for anime-style virtual character animation, through tensor quantization reconstruction and spatiotemporal alignment, unifies discrete animation features into a stable spatiotemporal representation, providing consistent input for 3D causal convolution and decoding, effectively avoiding frame order misalignment and scale inconsistency issues. The combination of general 3D causal convolution and causal decoding ensures that animation frame generation relies solely on historical information, guaranteeing the system's real-time generation capability while maintaining motion continuity. By introducing a temporal noise filtering mechanism based on optical flow residuals, strong smoothing can be applied to regions with consistent motion, while weak smoothing can be applied to complex motion and occluded regions, thereby suppressing flicker and jitter while avoiding blur and motion blur. Attached Figure Description

[0019] Figure 1 The diagram shown is a flowchart of a real-time generation method for anime-style virtual character animation according to the present invention. Detailed Implementation

[0020] The following description is intended to disclose the present invention and enable those skilled in the art to implement it. The preferred embodiments described below are merely examples, and other obvious modifications will occur to those skilled in the art. The basic principles of the invention defined in the following description can be applied to other embodiments, modifications, improvements, equivalents, and other technical solutions that do not depart from the spirit and scope of the invention.

[0021] It is understood that the term "a" should be understood as "at least one" or "one or more", that is, in one embodiment, the number of an element can be one, while in another embodiment, the number of the element can be multiple, and the term "a" should not be understood as a limitation on the number.

[0022] Example 1: Please combine Figure 1 This invention discloses a method for real-time generation of anime-style virtual character animations, the method comprising the following steps: Multidimensional style features are extracted from anime reference images, including line art features, color features, texture features, and character form features.

[0023] The anime reference images refer to reference pictures or image sets with a specific anime style, used to guide the generation of virtual character animations; the portrait video is a real-time captured video containing elements such as real people, interactive objects, and backgrounds.

[0024] In detail, multi-dimensional style features are extracted from anime reference images, including: The anime reference image is scaled and color space converted, and layered feature extraction is performed to obtain shallow and deep features. Edge detection and curve fitting are performed on the shallow features to obtain the line art features; Color clustering is performed on the shallow features to obtain clustered color clusters, and the spatial color probability distribution corresponding to the clustered color clusters is extracted. Color features are generated based on the clustered color clusters and the spatial color probability distribution. Using the line art features as spatial constraints and the color features as attribute constraints, patch matching and feature distribution calculations are performed on the anime reference image to obtain texture features; Anime characters are identified and separated from the line art features based on the deep features. Face alignment and skeletal proportion analysis are performed on the anime characters. Character morphological features are generated based on the face alignment results and skeletal proportion analysis results.

[0025] The scale normalization refers to standardizing the size of the anime reference image by scaling or cropping it proportionally. The color space conversion refers to converting the anime reference image from its original color space to a unified and easily processed color space, such as HSV or Lab color space. The hierarchical feature extraction refers to using shallow and deep components in a pre-trained convolutional neural network or residual neural network (such as VGG network or ResNet network) to extract features from the color space-converted anime reference image. Shallow features are extracted by convolutional layers near the input layer, that is, features including lines, edges, and color points are extracted through convolutional kernels with smaller receptive fields. Deep features are extracted by convolutional layers near the output layer, that is, semantic features are extracted through convolutional kernels with larger receptive fields.

[0026] Edge detection can be performed using the Canny operator or Holistically-Nested Edge Detection (HED). Curve fitting using Bézier curves or B-splines can be used to optimize the smoothness of the detected lines, thereby improving the representation of line art features. Color clustering can be performed using the K-Means++ method to obtain the distribution data of pixels of various colors in shallow features. Spatial color probability distribution can be extracted through statistical methods, including calculating the pixel proportion and spatial distribution concentration of each color cluster in shallow features, and using it as color features. Identifying and separating anime characters from deep features involves using Mask R-CNN to locate characters within deep features, generating Regions of Interest (ROIs) based on these ROIs in the line art features, aligning each ROI, classifying the aligned ROIs for character classification, adjusting character positions using bounding box regression, and separating anime characters using masking tools. Face alignment involves identifying key points unique to anime characters, including the corners of the eyes, nose, and mouth, rotating the anime character to a standard frontal position using affine transformation, and extracting facial contour features. Skeletal proportion analysis involves extracting the skeletal nodes of the anime character after affine transformation, calculating the ratio of head length to body length, and the ratio of eye width to face width, and combining these with the key points of face alignment and facial contour features to transform them into morphological vectors representing character morphological features.

[0027] Specifically, using the line art features as spatial constraints and the color features as attribute constraints, patch matching and feature distribution calculation are performed on the anime reference image to obtain texture features, including: Based on the line art features, the spatial outline of the animation reference image is extracted. Using the spatial outline as the boundary, the animation reference image after color space conversion is decomposed into a set of tile patches with overlapping areas. The patch color features of each patch in the patch set are calculated based on color features, and the similarity of each patch color feature is calculated to obtain a color association matrix. Based on the patch color association matrix, color classification is performed on the color features of each patch to obtain a semantic color patch cluster set; Texture sampling is performed on the tile patches corresponding to each semantic color patch cluster in the semantic color patch cluster set to obtain a local texture feature set, and the texture correlation matrix corresponding to the local texture feature set is calculated. Based on the texture correlation matrix, statistical modeling is performed on the local texture feature set to obtain texture features.

[0028] The spatial contour can be extracted using the watershed algorithm or connected component analysis. The overlap rate between the tile patch sets can be 25%, and each tile patch can be a 16px × 16px tile. Calculating patch color features involves using color moments to calculate the average chromaticity and corresponding sub-color features of each tile patch based on the clustered color clusters and spatial color probabilities. Color classification involves dividing the tiles based on similarity matrices and classifying the patch color features into types such as smooth color patches, shadow transition color patches, or highlight texture color patches. Texture sampling involves using a smaller sliding window to calculate Local Binary Patterns (LBP) or gray-level co-occurrence matrices for each semantic color patch cluster and its corresponding tile patch to obtain local texture features. Similarly, the texture association matrix can be obtained by calculating the similarity between various local texture features. Gram... Statistical modeling is performed using matrix modeling or Gaussian mixture modeling methods. This involves determining local texture feature classes based on the texture association matrices, calculating the mean, standard deviation, and other features of each local texture feature class to form class region texture features that represent the local texture feature classes, and then fitting each class region texture feature into a texture feature.

[0029] By extracting layered features from anime reference images and establishing multiple constraints on structure, color, and semantics between shallow and deep features, decoupled modeling of anime style is achieved. Stable line art features are generated through edge detection and curve fitting, providing clear structural boundaries for subsequent style transfer. Color clustering and spatial color probability distribution modeling form color features with regional consistency, effectively avoiding color drift during generation. Furthermore, patch-level texture modeling is performed under line art spatial and color attribute constraints, ensuring the continuity and consistency of texture features at the spatial and semantic levels, thereby suppressing flickering noise in the animation sequence. Character recognition, face alignment, and skeletal proportion analysis guided by deep features extract character morphological features, providing high-level constraints for the consistency of virtual character identity.

[0030] Extract the sequence of key points of the human face and the sequence of interactive objects from the real-time captured human face video, and generate a dynamic semantic identity anchor stream based on the sequence of interactive objects.

[0031] In detail, the sequence of key points for human figures and the sequence of interactive objects are extracted from real-time captured portrait videos, including: The real-time captured human video is split into a human video frame sequence. Multi-object detection is performed on the first video frame in the human video frame sequence to obtain a primary human bounding box and a primary object bounding box near the human bounding box. Human pose estimation is performed in the primary human body bounding box. Primary human key points are extracted based on the human pose estimation results. Instance segmentation and semantic encoding are performed on the primary object bounding box to extract primary interactive objects. A joint state transition matrix is ​​established based on the displacement of primary human key points and the centroid displacement of primary interactive objects in adjacent video frames in the human portrait video frame sequence, and a target motion vector is generated based on the joint state transition matrix. The motion vector prediction-based method extracts the key points of the human portrait video frames after the first video frame and the interactive objects one by one according to the target motion vector, so as to obtain the primary key point sequence and the primary object sequence. The primary keypoint sequence and primary object sequence are subjected to confidence position correction and temporal smoothing to obtain the portrait keypoint sequence and interactive object sequence.

[0032] Multi-object detection refers to using a lightweight object detection model (such as YOLOv8-tiny) to divide the first video frame into a grid, and using a human face detection head to predict the probability of human beings in each grid. Grids with high human body probability are selected to form primary human body bounding boxes through connectivity analysis. An object detection head is used to predict the entity probability of each grid, and grids with high entity probability and Euclidean distance to the primary human body bounding box less than a distance threshold are selected to generate primary object bounding boxes. Human pose estimation refers to using components such as OpenPose or HRNet to extract the pixel coordinates of each joint point and each facial key point in the primary human body bounding box, thereby obtaining primary human face key points. Instance segmentation and semantic encoding refer to using the pixel-level mask of the object output by the segmentation head and combining it with the high-dimensional semantic feature vector output by the object detection head for semantic encoding to obtain primary interactive objects. The joint state transition matrix is ​​a joint transition matrix with the coordinates of the human keypoints and the centroid coordinates of the interactive objects as state variables. Generating the target motion vector means calculating the movement speed (combined with the frame rate) between the primary human keypoints and the centroid of the primary interactive objects in the first and second video frames in adjacent video frames based on the joint state transition matrix, and forming a motion vector (including the direction vector and speed information of each point) based on the displacement. Then, the coordinate regions of the human keypoints and interactive objects in subsequent video frames are predicted based on the motion vector using linear regression prediction or Kalman filtering. The corresponding primary keypoint sequence and primary object sequence are extracted by local feature matching and regression, and the corresponding confidence scores are obtained. Position correction means that when the confidence score is lower than a preset threshold, the position of the primary keypoint sequence and primary object sequence is weighted and compensated or linearly interpolated using the motion vector prediction values. Temporal smoothing can be performed using sliding time windows, first-order lag filtering, or adaptive low-pass filtering.

[0033] Specifically, generating a dynamic semantic identity anchor stream based on the sequence of interactive objects includes: Based on the human portrait key point sequence, cross-frame association and visibility analysis are performed on the interactive object sequence to generate an object trajectory sequence with occlusion labels; The object trajectory sequence is subjected to attitude analysis to obtain dynamic object attitude features; Based on the object trajectory sequence and dynamic object posture features, a cooperative motion mode between the human portrait key point sequence and the interactive object sequence is constructed to obtain a human-object time-varying interactive state sequence. The occlusion decoupling and semantic consistency modeling of the time-varying human-object interaction state sequence are performed to obtain the semantic identity anchor stream; Interaction stability weights are extracted from the time-varying human-object interaction state sequence, and spatiotemporal smoothing constraints are applied to the semantic identity anchor stream based on the interaction stability weights to obtain a dynamic semantic identity anchor stream.

[0034] The cross-frame association refers to assigning a unique and continuous object identity ID to each interactive object in each video frame based on the object bounding boxes and corresponding object semantics of the interactive object sequence; the visibility analysis refers to comparing the overlap relationship between each object bounding box or between the object bounding box and the corresponding primary human keypoint bounding box, calculating the IoU overlap score, generating corresponding occlusion labels based on the overlap score interval, and using the occlusion labels to label the interactive objects assigned object identity IDs to obtain the object trajectory sequence; the pose analysis refers to calculating the position of each object trajectory in the object trajectory sequence relative to the human body, using the center corresponding to the human keypoint as the origin of the base coordinate system. The dynamic appearance features of objects are extracted by combining the temporal changes of the shape and texture features of interactive objects corresponding to each object trajectory. Then, the posture change model of interactive objects is performed, the rotation angular velocity and the orientation change trend of interactive objects are calculated, and they are vectorized into dynamic object posture features. The construction of cooperative motion mode refers to calculating the cosine similarity and relative distance variance between the motion vector of the hand key points in the human portrait key point sequence and the motion vector of the object centroid in the object trajectory sequence. Based on the temporal features such as the rate of change of cosine similarity and relative distance variance, the motion vectors of the hand key points and the object centroid are state encoded to obtain the human-object time-varying interactive state sequence.

[0035] In detail, occlusion decoupling refers to using an attention mechanism to split the image of the overlapping area corresponding to the occlusion label, assigning the occluded interactive object to the background layer and the image of the human body to the foreground layer; semantic consistency modeling refers to projecting the dynamic appearance features and dynamic object pose features of the interactive object sequence into the semantic manifold space corresponding to the multi-dimensional style features, thereby generating semantic identity anchors for the interactive objects in each video frame, resulting in a semantic identity anchor stream; the interaction stability weight refers to weighting the system motion pattern matching degree, occlusion duration of the occlusion label, and overlap score corresponding to the human-object time-varying interaction state sequence to obtain the interaction stability weight; and spatiotemporal smoothing constraint refers to applying Kalman filtering or exponential weighted moving average to the semantic identity anchor stream based on the interaction stability weight, thereby eliminating the jitter of the semantic identity anchor stream in both spatial and temporal dimensions.

[0036] By jointly modeling the sequence of human key points and the sequence of interactive objects, and introducing a joint state transition matrix and motion vector prediction mechanism, we have achieved collaborative tracking and stable extraction of human images and interactive objects in the temporal dimension. This effectively reduces recognition errors caused by detection jitter, short-term occlusion, or rapid changes in posture. By constructing a human-object collaborative motion mode, we map low-level geometric motion features to high-level interactive state semantics. Furthermore, we introduce semantic identity anchor flow and interaction stability weights to impose spatiotemporal smoothing constraints on interactive semantics. This maintains the continuity and consistency of interactive identity and semantic expression in complex dynamic interaction scenarios, improving the robustness and interpretability of human-object interaction modeling.

[0037] Using the portrait key point sequence as structural constraints and the multidimensional style features as global guidance, the dynamic semantic identity anchor stream is embedded as a latent variable constraint in the latent space, and controlled diffusion inference is performed in the latent space to obtain the animation latent feature frame sequence.

[0038] Specifically, using the portrait keypoint sequence as structural constraints and the multidimensional style features as global guidance, the dynamic semantic identity anchor stream is embedded as a latent variable constraint into the latent space, and controlled diffusion inference is performed within the latent space to obtain the animation latent feature frame sequence, including: The key point sequence of the human face is topologically encoded to obtain a structural constraint feature sequence. The structural constraint feature sequence is projected into the latent space to obtain a structural constraint latent vector sequence. The multidimensional style features are projected into the latent space to obtain a global style prior distribution. The dynamic semantic identity anchor stream is embedded as a latent variable constraint in the latent space to obtain a sequence of semantic constraint latent conditions. In the latent space, the structural constraint latent vector sequence is used as a hard structural constraint, the global style prior distribution is used as a distribution guiding condition, and iterative noise reduction sampling and temporal consistency constraints are performed based on the cross attention mechanism combined with the semantic constraint latent condition sequence to obtain the animation latent feature frame sequence. The animation latent feature frame sequence is obtained by performing inter-frame smoothing based on cross-frame latent feature similarity.

[0039] The topological relationship encoding can be achieved by constructing a graph structure from the key points in the portrait keypoint sequence and encoding the graph structure using a graph attention network to obtain a structural constraint feature sequence. A pre-trained multilayer perceptron model can be used to project the structural constraint feature sequence into the latent space. The structural constraint latent vectors are a set of tensors representing the geometric relationships of the human skeleton, specifying where pixels must appear in space. An image encoder is used to project style features into the latent space. Hard structural constraints refer to imposing inviolable structural consistency constraints on the latent features at corresponding spatial locations in the diffusion network. The global style prior distribution is a probability distribution describing the overall statistical regularity of color, contrast, and brushstrokes in anime reference images. It can be implemented using classifier-free guidance (CFG). The formula for the iterative noise reduction sampling is as follows: in, It refers to the first The sequence of hidden feature frames in the animation of the step. It is the preset number The noise reduction coefficient corresponding to each step represents the proportion of the original signal retained. It refers to the first The sequence of hidden feature frames in the animation of the step. From step 1 to step 2 The proportion of signal retained by the class of steps. To predict the noise function, the noise components are predicted by the neural network. For learnable parameters, The constraints include a sequence of implicit vectors for structural constraints, a global style prior distribution, and a sequence of implicit conditions for semantic constraints. It refers to the first Step size weight, The noise is random Gaussian noise; the temporal consistency constraint refers to the introduction of temporal attention in each step of noise reduction to ensure that each iteration of noise reduction not only considers the latent feature constraints of the current frame, but also the latent feature constraints of the previous and next steps, to ensure that the movement of pixels conforms to the motion vector prediction; inter-frame smoothing refers to calculating the cosine similarity of the feature maps of two adjacent frames. If the features of a certain region change abruptly, interpolation or weight fusion methods are used to supplement the effective features of the previous frame into the current frame to achieve inter-frame smoothing.

[0040] By introducing a multi-condition controlled diffusion mechanism that incorporates structural constraints, style priors, and semantic anchors into the latent space, fine-grained control over the animation generation process is achieved. By utilizing latent structural constraint vectors constructed from portrait keypoint sequences, the stability of the human body's topological structure is ensured at the latent level, effectively avoiding structural distortion issues in the generated animation. By mapping multi-dimensional style features to a global style prior distribution, the diffusion sampling process is guided to always be constrained by the target animation style space, significantly improving the style consistency of the generated results. Simultaneously, by introducing dynamic semantic identity anchor streams as latent variable conditions, the diffusion model explicitly perceives the semantics of human-object interaction and its temporal continuity during the generation process. Combined with temporal consistency constraints and cross-frame smoothing mechanisms during the iteration process, it can improve the stability and coherence of the animation in the temporal dimension while ensuring the quality of a single frame, providing a high-quality and controllable latent feature foundation for subsequent animation decoding and rendering.

[0041] The animation feature frame sequence is subjected to three-dimensional causal decoding and temporal smoothing filtering to obtain a smooth animation frame sequence.

[0042] Specifically, the animation feature frame sequence is subjected to 3D causal decoding and temporal smoothing filtering to obtain a smooth animation frame sequence, including: Tensor quantization and spatiotemporal alignment are performed on the animation feature frame sequence to obtain a reconstructed feature sequence, and three-dimensional causal convolution is performed on the reconstructed feature sequence to obtain an aligned feature sequence. The alignment feature sequence is subjected to three-dimensional causal decoding to obtain the original animation frame sequence; Inter-frame optical flow calculations are performed on the original animation frame sequence to obtain a temporal optical flow residual sequence; Temporal noise filtering is performed on the original animation frame sequence based on the temporal optical flow residual sequence to obtain a smooth animation frame sequence.

[0043] The tensor reconstruction refers to reconstructing a discrete sequence of animation feature frames into a five-dimensional feature tensor with a unified spatiotemporal scale. Spatiotemporal alignment refers to aligning the feature tensor spatiotemporally based on timestamp consistency and spatial key region constraints. If frame order misalignment or missing frames exist, they are repaired through interpolation. Three-dimensional causal convolution refers to padding the aligned feature sequence in the temporal dimension. For example, using a temporal convolution kernel with a stride of 3 steps, two empty spaces are filled on the left side of the aligned feature sequence to ensure the output... Frame features are only determined by the first , as well as The frame is determined to ensure the causality of real-time generation, ensuring that future actions will not interfere with the current scene. 3D causal decoding refers to progressively upsampling the aligned feature sequence through a series of convolutions to generate corresponding animation frames. For example, for an aligned feature sequence with input dimensions of (512, 16, 64, 64), i.e., 512 channels, a sequence length of 16, a pixel height of 64, and a pixel width of 64, spatiotemporal joint residual learning through the ResNet3D Block module yields a residual feature sequence with dimensions of (512, 16, 64, 64). By performing two 3D causal upsampling operations, and after each upsampling, progressively reducing the number of channels from 512 to 256, and then to 128 through channel compression convolutions, a final upsampled feature sequence with dimensions of (128, 16, 256, 256) is obtained. This upsampled feature sequence is then convolved through an output convolutional layer to obtain a sequence with dimensions of (3, 16, 512, ...). The original animation frame sequence (512) can be used to calculate inter-frame optical flow using the Farneback algorithm or the Recurrent All-Pairs Field Transforms (RAFT) model.

[0044] Specifically, the original animation frame sequence is subjected to temporal noise filtering based on the temporal optical flow residual sequence to obtain a smooth animation frame sequence, including: Each original animation frame in the original animation frame sequence is selected as the target animation frame, and the corresponding pixel confidence map is calculated based on the temporal optical flow residual corresponding to the target animation frame in the temporal optical flow residual sequence. Local contrast and edge information are extracted from the target animation frame, and a dynamic filtering weight matrix is ​​constructed based on the local contrast, edge information and pixel confidence map. The previous original animation frame of the target animation frame is used as the preceding animation frame. The preceding animation frame is spatially remapped according to the temporal optical flow residual corresponding to the target animation frame. The spatially remapped preceding animation frame and the target animation frame are weighted and aggregated according to the dynamic filtering weight matrix to obtain the denoised animation frame. Based on the target animation frame, residual concatenation is performed on the denoised animation frame, and bilateral filtering is performed on the residual concatenated denoised animation frame to obtain a smooth animation frame. The smooth animation frames corresponding to all target animation frames in the original animation frame sequence are then aggregated into a smooth animation frame sequence.

[0045] The pixel confidence map is composed of the pixel confidence scores of each pixel in the target animation frame, and the corresponding pixel confidence score calculation formula is as follows: ,in, For pixel confidence, It is the symbol for an exponential function. For the temporal optical flow residual of the corresponding pixel, As a smoothing factor, when Greater than The confidence level decreases slowly when Less than If the confidence level collapses rapidly, the dynamic filtering weight matrix refers to the weighting of local contrast, edge information and pixel confidence level of each pixel region. Pixels with high local contrast, large edge information and low pixel confidence level are assigned smaller weights, while pixels with low local contrast, few edge information and high pixel confidence level are assigned larger weights.

[0046] By performing tensor quantization reconstruction and spatiotemporal alignment, discrete animation features can be unified into a stable spatiotemporal representation, providing consistent input for 3D causal convolution and decoding, effectively avoiding frame order misalignment and scale inconsistency issues. The combination of general 3D causal convolution and causal decoding ensures that the generation of animation frames depends only on historical information, ensuring that the system has real-time generation capabilities while maintaining motion continuity. By introducing a temporal noise filtering mechanism based on optical flow residuals, strong smoothing can be performed on regions with consistent motion, while weak smoothing can be performed on complex motion and occluded regions, thereby suppressing flicker and jitter while avoiding blur and motion blur.

[0047] Real-time style feedback and style correction are performed on the smooth animation frame sequence to obtain virtual character animation.

[0048] In detail, real-time style feedback and style correction are performed on the smooth animation frame sequence to obtain virtual character animation, including: Multidimensional style extraction is performed on each smooth animation frame in the smooth animation frame sequence to obtain smooth style features; Calculate the feature difference degree between the smooth style feature and the multidimensional style feature, and generate a style deviation gradient map based on the feature difference degree; A style compensation operator is generated based on the style deviation gradient map, and a second pixel remapping is performed on the smooth animation frame sequence based on the style compensation operator to obtain a style calibration frame sequence. The style calibration frame sequence is smoothed for style stability, and the smoothed style calibration frame sequence is then stitched together to form a virtual character animation.

[0049] The method for multidimensional style extraction is the same as the method for extracting multidimensional style features from the anime reference image in the above steps, and will not be repeated here. The feature difference refers to the feature difference between smooth style features and multidimensional style features in the latent space, which can be calculated by measuring the Gram matrix of the smooth style features and multidimensional style features in the latent space. The style statistical bias is obtained by using a matrix, and then backpropagated to the smoothed animation frame through a differentiable feature mapping relationship to obtain the response bias. The style bias gradient map is calculated by calculating the partial derivative of the response bias on the smoothed animation frame. The style bias gradient map can be mapped into a set of style compensation parameters using a lightweight multilayer perceptron to obtain the style compensation operator. The secondary pixel remapping refers to the implementation using functions such as the remap function in OpenCV or a differentiable sampler. The smoothed animation frame is interpolated using the style compensation operator, such as linear interpolation, and boundary processing such as reflection is used to generate a new image, thereby obtaining a style calibration frame sequence. The style stability smoothing refers to the use of a sliding window that stores the style compensation operators of the previous several frames to perform temporal filtering on the style compensation operator of the current frame when generating the style compensation operator, thereby preventing over-correction and sudden changes in the image.

[0050] By quantifying differences at the style feature level and limiting the correction to pixel remapping and parametric compensation, style consistency is enhanced without disrupting the original temporal motion continuity. Compared to direct style transfer or regenerating animation frames, style correction effectively avoids flickering, style drift, and structural distortion, thereby further converging the preceding temporal smoothing results at the visual style level. This ensures the style uniformity and visual stability of the final virtual character animation during long-term playback, improving the quality of animation generation.

[0051] Example 2: This invention discloses a real-time generation system for anime-style virtual character animations. The system includes a feature extraction module, an anchor point recognition module, a controlled diffusion module, a noise smoothing module, and a style correction module, wherein: The feature extraction module is used to extract multi-dimensional style features from anime reference images, wherein the multi-dimensional style features include line art features, color features, texture features, and character shape features. An anchor point recognition module is used to extract the sequence of key human figures and the sequence of interactive objects from real-time captured human figure videos, and generate a dynamic semantic identity anchor point stream based on the sequence of interactive objects. The controlled diffusion module is used to embed the dynamic semantic identity anchor stream as a latent variable constraint into the latent space, with the portrait key point sequence as a structural constraint and the multidimensional style features as a global guide, and to perform controlled diffusion inference in the latent space to obtain the animation latent feature frame sequence. The noise smoothing module is used to perform three-dimensional causal decoding and temporal smoothing filtering on the animation feature frame sequence to obtain a smooth animation frame sequence. The style correction module is used to provide real-time style feedback and style correction to the smooth animation frame sequence to obtain virtual character animation.

[0052] The processes described above with reference to the flowcharts in the embodiments disclosed in this invention can be implemented as computer software programs. The embodiments disclosed in this invention include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication component, and / or installed from a removable medium. When the computer program is executed by a central processing unit (CPU), it performs the functions defined in the methods of this application. It should be noted that the computer-readable medium described above in this application can be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections having one or more wire segments, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this application, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in connection with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wireless segments, wire segments, optical fibers, RF, etc., or any suitable combination thereof.

[0053] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0054] Those skilled in the art should understand that the embodiments of the present invention described above and shown in the accompanying drawings are merely examples and do not limit the present invention. The purpose of the present invention has been fully and effectively achieved. The functions and structural principles of the present invention have been shown and explained in the embodiments. Without departing from the stated principles, the implementation of the present invention may have any variations or modifications.

Claims

1. A method for real-time generation of anime-style virtual character animation, characterized in that, The method includes: Multidimensional style features are extracted from anime reference images, including line art features, color features, texture features, and character form features. Extract the sequence of key points of the human face and the sequence of interactive objects from the real-time captured human face video, and generate a dynamic semantic identity anchor stream based on the sequence of interactive objects; Using the portrait key point sequence as structural constraints and the multidimensional style features as global guidance, the dynamic semantic identity anchor stream is embedded as a latent variable constraint in the latent space, and controlled diffusion reasoning is performed in the latent space to obtain the animation latent feature frame sequence. The animation feature frame sequence is subjected to 3D causal decoding and temporal smoothing filtering to obtain a smooth animation frame sequence; Real-time style feedback and style correction are performed on the smooth animation frame sequence to obtain virtual character animation.

2. The method for real-time generation of anime-style virtual character animation according to claim 1, characterized in that, Multidimensional style features were extracted from anime reference images, including: The anime reference image is scaled and color space converted, and layered feature extraction is performed to obtain shallow and deep features. Edge detection and curve fitting are performed on the shallow features to obtain the line art features; Color clustering is performed on the shallow features to obtain clustered color clusters, and the spatial color probability distribution corresponding to the clustered color clusters is extracted. Color features are generated based on the clustered color clusters and the spatial color probability distribution. Using the line art features as spatial constraints and the color features as attribute constraints, patch matching and feature distribution calculations are performed on the anime reference image to obtain texture features; Anime characters are identified and separated from the line art features based on the deep features. Face alignment and skeletal proportion analysis are performed on the anime characters. Character morphological features are generated based on the face alignment results and skeletal proportion analysis results.

3. The method for real-time generation of anime-style virtual character animation according to claim 2, characterized in that, Using the line art features as spatial constraints and the color features as attribute constraints, patch matching and feature distribution calculation are performed on the anime reference image to obtain texture features, including: Based on the line art features, the spatial outline of the animation reference image is extracted. Using the spatial outline as the boundary, the animation reference image after color space conversion is decomposed into a set of tile patches with overlapping areas. The patch color features of each patch in the patch set are calculated based on color features, and the similarity of each patch color feature is calculated to obtain a color association matrix. Based on the patch color association matrix, color classification is performed on the color features of each patch to obtain a semantic color patch cluster set; Texture sampling is performed on the tile patches corresponding to each semantic color patch cluster in the semantic color patch cluster set to obtain a local texture feature set, and the texture correlation matrix corresponding to the local texture feature set is calculated. Based on the texture correlation matrix, statistical modeling is performed on the local texture feature set to obtain texture features.

4. The method for real-time generation of anime-style virtual character animation according to claim 1, characterized in that, Extracting key point sequences of human figures and sequences of interactive objects from real-time captured portrait videos, including: The real-time captured human video is split into a human video frame sequence. Multi-object detection is performed on the first video frame in the human video frame sequence to obtain a primary human bounding box and a primary object bounding box near the human bounding box. Human pose estimation is performed in the primary human body bounding box. Primary human key points are extracted based on the human pose estimation results. Instance segmentation and semantic encoding are performed on the primary object bounding box to extract primary interactive objects. A joint state transition matrix is ​​established based on the displacement of primary human key points and the centroid displacement of primary interactive objects in adjacent video frames in the human portrait video frame sequence, and a target motion vector is generated based on the joint state transition matrix. The motion vector prediction-based method extracts the key points of the human portrait video frames after the first video frame and the interactive objects one by one according to the target motion vector, so as to obtain the primary key point sequence and the primary object sequence. The primary keypoint sequence and primary object sequence are subjected to confidence position correction and temporal smoothing to obtain the portrait keypoint sequence and interactive object sequence.

5. The method for real-time generation of anime-style virtual character animation according to claim 1, characterized in that, Generate a dynamic semantic identity anchor stream based on the sequence of interactive objects, including: Based on the human portrait key point sequence, cross-frame association and visibility analysis are performed on the interactive object sequence to generate an object trajectory sequence with occlusion labels; The object trajectory sequence is subjected to attitude analysis to obtain dynamic object attitude features; Based on the object trajectory sequence and dynamic object posture features, a cooperative motion mode between the human portrait key point sequence and the interactive object sequence is constructed to obtain a human-object time-varying interactive state sequence. The occlusion decoupling and semantic consistency modeling of the time-varying human-object interaction state sequence are performed to obtain the semantic identity anchor stream; Interaction stability weights are extracted from the time-varying human-object interaction state sequence, and spatiotemporal smoothing constraints are applied to the semantic identity anchor stream based on the interaction stability weights to obtain a dynamic semantic identity anchor stream.

6. The method for real-time generation of anime-style virtual character animation according to claim 1, characterized in that, Using the portrait keypoint sequence as structural constraints and the multidimensional style features as global guidance, the dynamic semantic identity anchor stream is embedded as a latent variable constraint into the latent space, and controlled diffusion inference is performed within the latent space to obtain the animation latent feature frame sequence, including: The key point sequence of the human face is topologically encoded to obtain a structural constraint feature sequence. The structural constraint feature sequence is projected into the latent space to obtain a structural constraint latent vector sequence. The multidimensional style features are projected into the latent space to obtain a global style prior distribution. The dynamic semantic identity anchor stream is embedded as a latent variable constraint in the latent space to obtain a sequence of semantic constraint latent conditions. In the latent space, the structural constraint latent vector sequence is used as a hard structural constraint, the global style prior distribution is used as a distribution guiding condition, and iterative noise reduction sampling and temporal consistency constraints are performed based on the cross attention mechanism combined with the semantic constraint latent condition sequence to obtain the animation latent feature frame sequence. The animation latent feature frame sequence is obtained by performing inter-frame smoothing based on cross-frame latent feature similarity.

7. The method for real-time generation of anime-style virtual character animation according to claim 1, characterized in that, Performing 3D causal decoding and temporal smoothing filtering on the animation feature frame sequence yields a smooth animation frame sequence, including: Tensor quantization and spatiotemporal alignment are performed on the animation feature frame sequence to obtain a reconstructed feature sequence, and three-dimensional causal convolution is performed on the reconstructed feature sequence to obtain an aligned feature sequence. The alignment feature sequence is subjected to three-dimensional causal decoding to obtain the original animation frame sequence; Inter-frame optical flow calculations are performed on the original animation frame sequence to obtain a temporal optical flow residual sequence; Temporal noise filtering is performed on the original animation frame sequence based on the temporal optical flow residual sequence to obtain a smooth animation frame sequence.

8. The method for real-time generation of anime-style virtual character animation according to claim 1, characterized in that, Based on the temporal optical flow residual sequence, temporal noise filtering is performed on the original animation frame sequence to obtain a smooth animation frame sequence, including: Each original animation frame in the original animation frame sequence is selected as the target animation frame, and the corresponding pixel confidence map is calculated based on the temporal optical flow residual corresponding to the target animation frame in the temporal optical flow residual sequence. Local contrast and edge information are extracted from the target animation frame, and a dynamic filtering weight matrix is ​​constructed based on the local contrast, edge information and pixel confidence map. The previous original animation frame of the target animation frame is used as the preceding animation frame. The preceding animation frame is spatially remapped according to the temporal optical flow residual corresponding to the target animation frame. The spatially remapped preceding animation frame and the target animation frame are weighted and aggregated according to the dynamic filtering weight matrix to obtain the denoised animation frame. Based on the target animation frame, residual concatenation is performed on the denoised animation frame, and bilateral filtering is performed on the residual concatenated denoised animation frame to obtain a smooth animation frame. The smooth animation frames corresponding to all target animation frames in the original animation frame sequence are then aggregated into a smooth animation frame sequence.

9. The method for real-time generation of anime-style virtual character animation according to claim 8, characterized in that, Real-time style feedback and style correction are performed on the smooth animation frame sequence to obtain virtual character animation, including: Multidimensional style extraction is performed on each smooth animation frame in the smooth animation frame sequence to obtain smooth style features; Calculate the feature difference degree between the smooth style feature and the multidimensional style feature, and generate a style deviation gradient map based on the feature difference degree; A style compensation operator is generated based on the style deviation gradient map, and a second pixel remapping is performed on the smooth animation frame sequence based on the style compensation operator to obtain a style calibration frame sequence. The style calibration frame sequence is smoothed for style stability, and the smoothed style calibration frame sequence is then stitched together to form a virtual character animation.

10. A real-time generation system for anime-style virtual character animation, characterized in that, The system includes a feature extraction module, an anchor point recognition module, a controlled diffusion module, a noise smoothing module, and a style correction module, wherein: The feature extraction module is used to extract multi-dimensional style features from anime reference images, wherein the multi-dimensional style features include line art features, color features, texture features, and character shape features. An anchor point recognition module is used to extract the sequence of key human figures and the sequence of interactive objects from real-time captured human figure videos, and generate a dynamic semantic identity anchor point stream based on the sequence of interactive objects. The controlled diffusion module is used to embed the dynamic semantic identity anchor stream as a latent variable constraint into the latent space, with the portrait key point sequence as a structural constraint and the multidimensional style features as a global guide, and to perform controlled diffusion inference in the latent space to obtain the animation latent feature frame sequence. The noise smoothing module is used to perform three-dimensional causal decoding and temporal smoothing filtering on the animation feature frame sequence to obtain a smooth animation frame sequence. The style correction module is used to provide real-time style feedback and style correction to the smooth animation frame sequence to obtain virtual character animation.