Human action editing method and device based on sketch guidance, terminal and medium
By combining motion autoencoders and diffusion generative models, two-dimensional sketch data is aligned with three-dimensional human motion sequences in the latent feature space, and text prompts are used for editing. This solves the problem that sketches cannot correspond to three-dimensional human joint data, and improves the controllability and robustness of human motion editing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- PEKING UNIV SHENZHEN GRADUATE SCHOOL
- Filing Date
- 2026-05-22
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, sketches cannot establish a stable correspondence with 3D human joint data, making it difficult to introduce sketch information into human motion editing models, resulting in insufficient controllability and robustness of human motion editing.
A trained motion autoencoder is used to align the 2D sketch data and the initial 3D human motion sequence in the latent feature space to determine the latent features of the sketch. A trained diffusion generation model is then used to add noise and perform inverse denoising on the initial 3D human motion sequence based on guiding conditions. Combined with text prompt data, the sequence is decoded and reconstructed to determine the 3D human motion editing image.
It improves the controllability and robustness of human motion editing, achieves stable correspondence of sketch information in the human motion editing model, and improves interaction efficiency and editing accuracy.
Smart Images

Figure CN122244401A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image editing technology, and in particular to a sketch-guided method, apparatus, terminal, and medium for editing human motion. Background Technology
[0002] Human motion editing refers to modifying existing three-dimensional human motion sequences locally or entirely while maintaining the original motion temporal structure, in order to meet the user's new motion intentions or action constraint requirements.
[0003] Existing methods mostly use text descriptions to control conditions. Users input natural language commands, and the model modifies the original motion based on the semantics of the text. Although these methods are intuitive and easy to combine with related generative models, natural language has problems such as semantic ambiguity and imprecise expression. It is difficult to describe fine spatial poses and joint configurations, and prompts often need to be adjusted multiple times, resulting in low interaction efficiency.
[0004] Hand-drawn sketches, as an intuitive visual expression, can directly reflect the user's understanding of spatial structure and posture. However, sketches are usually two-dimensional, abstract, and noisy representations, making it impossible to establish a stable correspondence with three-dimensional human joint data. This also makes it difficult for existing technologies to introduce sketch information into human motion editing models to improve the controllability and robustness of human motion editing.
[0005] Therefore, existing technologies still need improvement and development. Summary of the Invention
[0006] The technical problem to be solved by the present invention is to provide a sketch-guided human motion editing method, device, terminal and medium to address the above-mentioned defects of the prior art. The aim is to solve the problem that sketches cannot establish a stable correspondence with three-dimensional human joint data in the prior art, which makes it difficult to introduce sketch information into the human motion editing model, so as to improve the controllability and robustness of human motion editing.
[0007] The technical solution adopted by this invention to solve the problem is as follows: In a first aspect, embodiments of the present invention provide a sketch-guided method for editing human motion, wherein the method includes: Acquire initial 3D human motion sequence and 2D sketch data, and use a trained motion autoencoder to align the 2D sketch data and the initial 3D human motion sequence in the latent feature space to determine the latent features of the sketch. Using the latent features of the sketch as guiding conditions, a trained diffusion generation model is used to add noise and perform inverse denoising on the initial three-dimensional human motion sequence based on the guiding conditions to determine the target image features. The guiding conditions also include text prompt data and the initial three-dimensional human motion sequence. The target image features are decoded and reconstructed to determine the three-dimensional human motion editing image.
[0008] In one implementation, the motion autoencoder includes a sketch encoder, a human motion encoder, and a shared decoder; the training method for the motion autoencoder includes: Construct a training set of sketches and human motion pairs, which includes standard 3D human motion images and corresponding standard sketch data; Standard sketch data is encoded using a sketch encoder to determine initial sketch features; The initial human motion features are determined by encoding standard three-dimensional human motion images using a human motion encoder. The initial sketch features and the initial human motion features are decoded and reconstructed using a shared decoder to determine the corresponding reconstructed sketch data and reconstructed human motion image. Alignment loss is calculated based on the reconstructed sketch data, the reconstructed human motion image, the initial sketch features, and the initial human motion features. The parameters of the motion autoencoder are then optimized based on the alignment loss.
[0009] In one implementation, calculating an alignment loss based on the reconstructed sketch data, the reconstructed human motion image, the initial sketch features, and the initial human motion features includes: A contrast loss function based on cosine similarity is used to calculate the contrast loss based on the initial sketch features and the initial human motion features; The reconstruction loss is calculated based on the reconstructed sketch data, the reconstructed human motion image, the initial sketch features, and the initial human motion features. The alignment loss is determined based on the contrast loss and the reconstruction loss.
[0010] In one implementation, calculating a reconstruction loss based on the reconstructed sketch data, the reconstructed human motion image, the initial sketch features, and the initial human motion features includes: The sketch reconstruction loss is determined based on the reconstructed sketch data and the standard sketch data; The human motion reconstruction loss is determined based on the reconstructed human motion image and the initial human motion features; The reconstruction loss is determined based on the sketch reconstruction loss and the human motion reconstruction loss.
[0011] In one implementation method, the latent features of the sketch are used as guiding conditions. A trained diffusion generation model is used to add noise and perform inverse denoising on the initial 3D human motion sequence based on the guiding conditions to determine the target image features, including: Obtain editing requirements and generate a time mask based on those requirements; Extract the human motion image to be edited from the initial three-dimensional human motion sequence according to the time mask; The trained diffusion model is used to add noise and perform inverse denoising on the human motion image to be edited based on guiding conditions to determine the target image features.
[0012] In one implementation, the training method for the diffusion generation model includes: Noise-containing latent features are obtained by forward diffusion noise addition based on standard 3D human motion images; Using the latent features of the sketch data corresponding to the standard sketch data as guiding conditions, the noise prediction network of the diffusion generation model is used to perform noise prediction and inverse denoising on the noisy latent features based on the guiding conditions, and the target image features corresponding to the standard three-dimensional human motion image are determined. The guiding conditions also include the standard three-dimensional human motion image and standard text prompt data. The diffusion generation model loss is calculated based on the noisy latent features, the sketch latent features, and the target image features, and the parameters of the diffusion generation model are optimized based on the diffusion generation model loss.
[0013] In one implementation, calculating the diffusion generation model loss based on the noisy latent features, the sketch latent features, and the target image features includes: Calculate the diffusion loss based on the aforementioned noisy potential features; Calculate the grass based on the latent features of the sketch and the features of the target image. Figure 1 Induced loss; Based on the diffusion loss and the grass Figure 1 Consistency loss determines the diffusion generation model loss.
[0014] Secondly, embodiments of the present invention also provide a sketch-guided human motion editing device, wherein the sketch-guided human motion editing device includes: The sketch alignment module is used to acquire the initial three-dimensional human motion sequence and two-dimensional sketch data, and to align the two-dimensional sketch data and the initial three-dimensional human motion sequence in the latent feature space using a trained motion autoencoder to determine the latent features of the sketch. The image editing module is used to use the latent features of the sketch as guiding conditions, and to use a trained diffusion generation model to add noise and inversely denoise the initial three-dimensional human motion sequence based on the guiding conditions to determine the target image features. The guiding conditions also include text prompt data and the initial three-dimensional human motion sequence. The feature decoding module is used to decode and reconstruct the features of the target image to determine the three-dimensional human motion editing image.
[0015] Thirdly, embodiments of the present invention also provide a terminal, the terminal including a memory and one or more processors; the memory stores one or more programs; the programs include instructions for executing the sketch-guided human motion editing method as described above; the processor is used to execute the programs.
[0016] Fourthly, embodiments of the present invention also provide a computer-readable storage medium storing a plurality of instructions, wherein the instructions are adapted to be loaded and executed by a processor to implement any of the sketch-guided human motion editing methods described above.
[0017] The beneficial effects of this invention are as follows: In this embodiment, a trained motion autoencoder aligns two-dimensional sketch data and an initial three-dimensional human motion sequence in a latent feature space to determine the sketch's latent features. Using these latent features as guiding conditions, a trained diffusion generation model adds noise and performs inverse denoising on the initial three-dimensional human motion sequence based on these guiding conditions to determine the target image features. The guiding conditions also include text prompt data and the initial three-dimensional human motion sequence. The target image features are then decoded and reconstructed to determine the three-dimensional human motion editing image. Therefore, this invention effectively solves the problem in existing technologies where sketches cannot establish a stable correspondence with three-dimensional human joint data, making it difficult to incorporate sketch information into human motion editing models to improve the controllability and robustness of human motion editing. Attached Figure Description
[0018] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 This is a flowchart illustrating the sketch-guided human motion editing method provided in an embodiment of the present invention.
[0020] Figure 2 This is an example diagram of two-dimensional sketch data provided in an embodiment of the present invention.
[0021] Figure 3 This is a schematic diagram comparing the motion editing of three-dimensional human motion images provided in an embodiment of the present invention.
[0022] Figure 4 This is a schematic diagram of the feature embedding and alignment results of two-dimensional sketch data and three-dimensional human motion images provided in an embodiment of the present invention.
[0023] Figure 5 This is a schematic diagram illustrating the generation of a sketch-guided human motion editing method provided in an embodiment of the present invention.
[0024] Figure 6 This is a schematic diagram of the internal modules of the sketch-guided human motion editing device provided in an embodiment of the present invention.
[0025] Figure 7 This is a schematic diagram of the terminal provided in the embodiment of the present invention. Detailed Implementation
[0026] This invention discloses a method, apparatus, terminal, and medium for editing human motion based on sketch guidance. To make the objectives, technical solutions, and effects of this invention clearer and more explicit, the invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only for explaining the invention and are not intended to limit the invention.
[0027] Those skilled in the art will understand that, unless specifically stated otherwise, the singular forms “a,” “an,” “the,” and “the” used herein may also include the plural forms. It should be further understood that the term “comprising” as used in this specification means the presence of the stated features, integers, steps, operations, elements, and / or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups thereof. It should be understood that when we say an element is “connected” or “coupled” to another element, it can be directly connected or coupled to the other element, or there may be intermediate elements. Furthermore, “connected” or “coupled” as used herein can include wireless connections or wireless coupling. The term “and / or” as used herein includes all or any units and all combinations of one or more associated listed items.
[0028] It will be understood by those skilled in the art that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. It should also be understood that terms such as those defined in general dictionaries should be understood to have the same meaning as in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless specifically defined as herein.
[0029] Hand-drawn sketches, as an intuitive visual expression, can directly reflect the user's understanding of spatial structure and posture. However, sketches are usually two-dimensional, abstract, and noisy representations, making it impossible to establish a stable correspondence with three-dimensional human joint data. This also makes it difficult for existing technologies to introduce sketch information into human motion editing models to improve the controllability and robustness of human motion editing.
[0030] To address the aforementioned shortcomings of existing technologies, this invention provides a sketch-guided human motion editing method. The method employs a trained motion autoencoder to align two-dimensional sketch data and an initial three-dimensional human motion sequence in a latent feature space, determining the sketch's latent features. Using these latent features as guiding conditions, a trained diffusion generation model is used to add noise and perform inverse denoising on the initial three-dimensional human motion sequence based on these guiding conditions, determining the target image features. The guiding conditions also include text prompt data and the initial three-dimensional human motion sequence. Finally, the target image features are decoded and reconstructed to determine the three-dimensional human motion editing image. Therefore, this method effectively solves the problem in existing technologies where sketches cannot establish a stable correspondence with three-dimensional human joint data, making it difficult to incorporate sketch information into human motion editing models to improve the controllability and robustness of human motion editing.
[0031] Exemplary method: like Figure 1 As shown, the method includes: Step S100: Obtain the initial three-dimensional human motion sequence and two-dimensional sketch data, and use a trained motion autoencoder to align the two-dimensional sketch data and the initial three-dimensional human motion sequence in the latent feature space to determine the latent features of the sketch.
[0032] In the image editing process, the system first receives the initial 3D human motion sequence and 2D sketch data input by the user. The initial 3D human motion sequence is temporal 3D data used to characterize the dynamic changes of the user's human body. It consists of multiple 3D human poses arranged chronologically, used to record or describe the evolution of the human body's position, posture, trajectory, and relative relationships between limbs in space over time. The initial 3D human motion sequence contains several 3D human motion images to be edited. The 2D sketch data corresponds to some of the 3D human motion images in the initial 3D human motion sequence. It consists of sparse outlines, lines, region boundaries, or local structures generated by the user through hand-drawing or other methods, used to express the user's editing intentions regarding the shape, layout, structure, posture, or region division of the 3D human motion images to be edited.
[0033] Due to significant differences in data dimension, structural form, and information density between 2D sketch data and 3D human motion sequences, directly establishing a mapping relationship between the two is quite difficult. To address this issue, this method proposes a motion autoencoder to learn a shared latent representation between sketches and human motion. This motion autoencoder includes a sketch encoder, a human motion encoder, and a shared decoder. The sketch encoder extracts the spatial structural features of the 2D sketch data, while the human motion encoder extracts the spatial-temporal features of the human joints. The outputs of both are mapped to the same low-dimensional latent space, and the shared decoder performs reconstruction constraints.
[0034] After acquiring the initial 3D human motion sequence and 2D sketch data input by the user, the 2D sketch data and the initial 3D human sequence are mapped to a unified latent feature space through a trained motion autoencoder. Matching and alignment are performed in the latent feature space, so that the sketch features and human joint features representing the same pose have high consistency in the latent feature space.
[0035] In one implementation, the training method for the motion autoencoder includes: Step S101: Construct a sketch and human motion pairing training set, wherein the sketch and human motion pairing training set includes standard three-dimensional human motion images and corresponding standard sketch data; Step S102: Encode the standard sketch data using a sketch encoder to determine the initial sketch features; Step S103: Encode the standard three-dimensional human motion image using a human motion encoder to determine the initial human motion features; Step S104: Decode and reconstruct the initial sketch features and the initial human motion features using a shared decoder to determine the corresponding reconstructed sketch data and reconstructed human motion image; Step S105: Calculate the alignment loss based on the reconstructed sketch data, the reconstructed human motion image, the initial sketch features, and the initial human motion features, and update the parameters of the motion autoencoder based on the alignment loss.
[0036] Existing publicly available human motion datasets generally lack hand-drawn sketches. Figure 1 To address the issue of one-to-one correspondence, before training the motion autoencoder, a sketch and human motion pairing training set is constructed to support the training and inference of the motion autoencoder and subsequent diffusion generative model. Specifically, the construction method for the sketch and human motion pairing training set includes: selecting 3D human motion sequences from an existing standard 3D human motion dataset as the basic data source; extracting key motion frames or representative poses from the selected 3D human motion sequences as standard 3D human motion images in the sketch and human motion pairing training set; and obtaining the corresponding 2D sketch data based on the standard 3D human motion images as standard sketch data in the sketch and human motion pairing training set, thus forming the sketch and human motion pairing training set.
[0037] The acquisition of 2D sketch data corresponding to standard 3D human motion images can be achieved by having manual annotators draw 2D human posture sketches based on the corresponding poses as the basic sketch data. Figure 2The hand-drawn data shown is not subject to strict restrictions on drawing style, line thickness, or stroke continuity during the manual drawing process. This introduces jitter, simplification, and irregularities common in real user drawings, resulting in sketches with greater diversity and realism. After obtaining a certain scale of high-quality hand-drawn sketch samples (i.e., basic sketch data), the sketch data is further enhanced using a procedural extension method based on a generative model, resulting in extended sketch data, such as... Figure 2 The synthetic data shown is presented in the following manner. Specifically, the method involves: projecting the 3D human joint coordinates (which can be extracted from standard 3D human motion or basic sketch data) onto a 2D plane to generate an initial skeleton outline; smoothing and curving the initial skeleton outline to simulate continuous hand-drawn lines; and then redrawing the initial skeleton image using a style transfer-based or image-to-image generative model to obtain extended sketch data, making it visually close to a human hand-drawn sketch. Through this method, sketch samples with random perturbations and style variations can be generated on a large scale while maintaining consistency in posture structure. The basic sketch data and extended sketch data can be used as 2D sketch data corresponding to standard 3D human motion images.
[0038] After constructing a training set of sketch and human motion pairs, the motion autoencoder is trained using standard 3D human motion images and standard sketch data matched in the training set. The training process is as follows: the sketch encoder encodes the standard sketch data to obtain initial sketch features; the human motion encoder encodes the standard 3D human motion images to obtain initial human motion features; the shared decoder reconstructs the initial sketch features and initial human motion features, resulting in reconstructed sketch data (reconstructed data based on the initial sketch features) and reconstructed human motion images (reconstructed data based on the initial human motion features); the alignment loss is calculated based on the reconstructed sketch data, reconstructed human motion images, initial sketch features, and initial human motion features; the parameters of the motion autoencoder are optimized based on the alignment loss so that the optimized motion autoencoder can align the input 3D human motion sequence and 2D sketch data in a unified latent feature space.
[0039] In one implementation, calculating the alignment loss based on the reconstructed sketch data, the reconstructed human motion image, the initial sketch features, and the initial human motion features includes: Step S1051: Calculate the contrast loss based on the initial sketch features and the initial human motion features using a contrast loss function based on cosine similarity; Step S1052: Calculate the reconstruction loss based on the reconstructed sketch data, the reconstructed human motion image, the initial sketch features, and the initial human motion features; Step S1053: Determine the alignment loss based on the contrast loss and the reconstruction loss.
[0040] Specifically, the alignment loss consists of two parts: contrast loss and reconstruction loss. The contrast loss employs a cosine similarity-based contrast loss function, calculated based on the initial sketch features and initial human motion features. Assume the standard sketch data is... The corresponding standard 3D human motion image is Then the sketch encoder The initial sketch features obtained based on standard sketch data encoding are: Human motion encoder The initial human motion features obtained based on standard 3D human motion image coding are: The encoding process is represented as follows: .
[0041] The cosine similarity-based contrastive loss function is expressed as: , in, For temperature coefficient, N For batch size, For the first A standard three-dimensional human motion image, For similarity functions, To compare the losses.
[0042] The reconstruction loss is calculated based on the reconstructed sketch data, the reconstructed human motion image, the initial sketch features, and the initial human motion features. The specific calculation process is as follows: determine the sketch reconstruction loss based on the reconstructed sketch data and the standard sketch data; determine the human motion reconstruction loss based on the reconstructed human motion image and the initial human motion features; and determine the reconstruction loss based on the sketch reconstruction loss and the human motion reconstruction loss.
[0043] The reconstruction loss can be calculated as follows: , in: , This represents the reconstruction of human motion images. This indicates the reconstruction of the sketch data. Indicates a shared decoder. For reconstruction loss.
[0044] Based on the reconstruction loss and contrast loss described above, a target alignment loss function for the motion autoencoder is constructed, and the alignment loss is calculated using this target alignment loss function. , As the weight parameter, in the target alignment loss function, reconstruction constraints are performed through reconstruction loss to ensure that sketch features representing the same pose and human joint features have high consistency in the latent space. On this basis, contrastive learning constraints are introduced to further enhance the alignment relationship between sketch and human motion in the latent space, so that the sketch drawn by the user can be stably mapped to a latent representation with semantic consistency of human motion, thereby providing reliable spatial constraints for subsequent motion editing.
[0045] Step S200: Using the latent features of the sketch as guiding conditions, the trained diffusion generation model is used to add noise and perform inverse denoising on the initial three-dimensional human motion sequence based on the guiding conditions to determine the target image features. The guiding conditions also include text prompt data and the initial three-dimensional human motion sequence.
[0046] After completing the shared feature learning, the learned sketch latent features (the aligned output of the motion autoencoder based on 2D sketch data) are used as guiding conditions. A trained diffusion generation model is then used to edit the initial 3D human motion sequence based on these guiding conditions to obtain the target image features. Optionally, the guiding conditions can use only the sketch latent features, or they can incorporate feature vectors corresponding to text prompt data, the initial 3D human motion sequence, etc., in addition to the sketch latent features.
[0047] In one implementation, the latent features of the sketch are used as guiding conditions. A trained diffusion generation model is then used to add noise and perform inverse denoising on the initial 3D human motion sequence based on these guiding conditions to determine the target image features, including: Step S201: Obtain editing requirements and generate a time mask based on the editing requirements; Step S202: Extract the human motion image to be edited from the initial three-dimensional human motion sequence according to the time mask; Step S203: Using the trained diffusion model based on guiding conditions, add noise and perform reverse denoising on the human motion image to be edited to determine the target image features.
[0048] Before inputting the initial 3D human motion sequence into the diffusion generation model, local time segments of the initial 3D human motion sequence are masked, and only the parts that need modification are generated, thus maintaining the continuity and naturalness of the unedited motion segments. Let the initial 3D human motion sequence... Represented as: , in, The total number of motion frames. For the number of frames A 3D human motion image at that time. A time mask is generated based on editing requirements. The mask value for the edited frame is 0, and the mask value for the retained frame is 1. This is the masked image of the human motion to be edited. for: .
[0049] In one implementation, the training method for the diffusion generation model includes: Noise-containing latent features are obtained by forward diffusion noise addition based on standard 3D human motion images; Using the latent features of the sketch data corresponding to the standard sketch data as guiding conditions, the noise prediction network of the diffusion generation model is used to perform noise prediction and inverse denoising on the noisy latent features based on the guiding conditions, and the target image features corresponding to the standard three-dimensional human motion image are determined. The guiding conditions also include the standard three-dimensional human motion image and standard text prompt data. The diffusion generation model loss is calculated based on the noisy latent features, the sketch latent features, and the target image features, and the parameters of the diffusion generation model are optimized based on the diffusion generation model loss.
[0050] Specifically, the diffusion generation model is trained using a constructed training set of paired sketches and human motion images. Standard 3D human motion images from the paired sketch and human motion training set are input into the diffusion generation model, and forward diffusion with noise is performed based on the standard 3D human motion images to obtain noisy latent features.
[0051] During the diffusion process, the model at time step Noise perturbation applied to standard 3D human motion images: , in, This represents one of the latent features corresponding to the human motion image to be edited, the initial 3D human motion features, or the standard 3D human motion image, respectively. The specific feature is determined based on the application scenario of the diffusion generation model (training scenario or actual application scenario). Indicates time step The noisy potential features after adding noise This represents predefined noise scheduling parameters. This represents Gaussian noise.
[0052] The diffusion-generative model uses sketch latent features or sketch latent features, standard text cue data, and an unmasked initial 3D human motion image as guiding conditions to predict the noise term: , in, This represents standard text prompt data or text prompt data. Indicated by A noise prediction neural network with parameters. This represents the set of learnable parameters for a noise prediction network. This represents the initial 3D human motion image or the standard 3D human motion image (the human motion image to be edited) that has been masked.
[0053] Using prediction noise, from noisy latent features recover The target image features are gradually recovered. During the diffusion denoising and inverse denoising processes of the diffusion generation model, the diffusion generation model loss is calculated based on the noisy latent features, sketch latent features, and target image features at each time step. The parameters of the diffusion generation model are then optimized based on the diffusion generation model loss.
[0054] In one implementation, calculating the diffusion generation model loss based on the noisy latent features, the sketch latent features, and the target image features includes: Calculate the diffusion loss based on the aforementioned noisy potential features; Calculate the grass based on the latent features of the sketch and the features of the target image. Figure 1 Induced loss; Based on the diffusion loss and the grass Figure 1 Consistency loss determines the diffusion generation model loss.
[0055] The loss in the diffusion generation model includes diffusion loss and grass Figure 1 The loss of efficacy consists of two parts, represented as follows: , in, For the spread loss, grass Figure 1 Sexual damage, These are the weight parameters.
[0056] Diffusion loss is the standard loss function used in the training of diffusion generation models to measure the difference between the noise predicted by the neural network and the actual noise added during the forward diffusion process. Diffusion loss is calculated by estimating the mean square error between the predicted noise and the actual noise using the expectations of real samples sampled from the data distribution, randomly sampled diffusion time steps, and random Gaussian noise. This error is then used to optimize the parameters of the noise prediction neural network, gradually enabling it to accurately predict the noise corresponding to each diffusion time step. Based on diffusion loss, additional... Figure 1 Consistency loss enhances the spatial constraint capability of sketch latent features at keyframes (the 3D human motion image corresponding to the sketch latent features). Figure 1 The calculation of loss of integrity is expressed as follows: , Indicates grass Figure 1 Sexual damage, Indicates the first obtained from the sketch Keyframe information, Indicates the action of generation. Indicates the number of keyframes. Information indicating the keyframe index.
[0057] Furthermore, considering that a single-frame sketch cannot fully express the temporal semantics of motion, a textual semantic enhancement mechanism can be introduced. A visual-language model infers the action semantics implicit in the 2D sketch data and supplements or corrects the original textual prompts, ensuring semantic consistency between the textual and sketch conditions. In the specific implementation, the 2D sketch data, the original textual description, and the 3D human motion sequence during the editing period are input into the semantic inference model to generate an enhanced textual description, which is then used as the textual condition input for the diffusion model. This mechanism effectively avoids the motion incoherence problem caused by semantic conflicts between the sketch and the text, making the generated results more natural in terms of temporal continuity and action logic.
[0058] Step S300: Decode and reconstruct the features of the target image to determine the three-dimensional human motion editing image.
[0059] For the target image features output by the diffusion generation model, they are first transformed into spatiotemporal feature tensors through linear mapping and dimensionality reshaping. Then, the temporal motion features are decoupled through 3D convolution and gated recurrent units. Spatial resolution can be improved and spatiotemporal features can be fused through transposed convolution and interpolation. Subsequently, the coordinates of 3D human joints are regressed and smoothed by fully connected layers. Based on a skinned multi-human linear model, the joints are reconstructed into a 3D human mesh. Finally, the mesh is rendered into a 2D image sequence by a 3D renderer. After post-processing such as Gaussian filtering and color calibration, a 3D human motion editing image is obtained.
[0060] Based on the above technical solution, this invention realizes an intuitive, robust and high-precision method for human motion editing, which has advantages in motion controllability, semantic consistency and user interaction experience. Figure 3 A comparative illustration of motion editing in 3D human motion images. Figure 3 (a) is an unedited 3D human motion image. Figure 3Image (b) shows the edited 3D human motion image. Existing methods for this editing process create a skeletal structure, set key poses for the skeleton at different time points, bind the target's mesh model to the skeleton, enabling the skin to deform accordingly during skeletal movement, and then use interpolation calculations to smoothly transition the skeletal motion between keyframes, ultimately generating a smooth character animation sequence. This method, however, involves hand-drawing the desired motion, and then automatically generates the remaining parts based on the hand-drawn motion, making it a simpler solution than existing methods.
[0061] Figure 4 The results show the alignment of 2D sketch data and 3D human motion images. Out of 32 samples, the number of non-aligned samples is very small. Figure 5 This is a sample of action editing. By combining text prompts (or text prompts with enhanced text) and two-dimensional sketch data, it is possible to accurately edit existing actions and generate results that meet the user's intent.
[0062] Based on the above embodiments, the present invention also provides a sketch-guided human motion editing device, such as... Figure 6 As shown, the device includes: The sketch alignment module 01 is used to acquire the initial three-dimensional human motion sequence and two-dimensional sketch data, and to use a trained motion autoencoder to align the two-dimensional sketch data and the initial three-dimensional human motion sequence in the latent feature space to determine the latent features of the sketch. Image editing module 02 is used to use the latent features of the sketch as guiding conditions, and to use a trained diffusion generation model to add noise and reverse denoise the initial three-dimensional human motion sequence based on the guiding conditions to determine the target image features. The guiding conditions also include text prompt data and the initial three-dimensional human motion sequence. The feature decoding module 03 is used to decode and reconstruct the features of the target image to determine the three-dimensional human motion editing image.
[0063] Based on the above embodiments, the present invention also provides a terminal, the principle block diagram of which can be as follows: Figure 6 As shown, the terminal includes a processor, memory, network interface, and display screen connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface is used to communicate with external terminals via a network connection. When the computer program is executed by the processor, it implements a sketch-guided human motion editing method. The display screen can be an LCD screen or an e-ink screen.
[0064] Those skilled in the art will understand that Figure 7 The schematic diagram shown is merely a partial structural diagram related to the present invention and does not constitute a limitation on the terminal to which the present invention is applied. A specific terminal may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0065] In one implementation, the terminal's memory stores one or more programs, and these programs are configured to be executed by one or more processors, and the programs contain instructions for performing a sketch-guided human motion editing method.
[0066] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided by this invention can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), Rambus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
[0067] In summary, this invention discloses a sketch-guided human motion editing method, apparatus, terminal, and medium. The method aligns two-dimensional sketch data and an initial three-dimensional human motion sequence in a latent feature space using a trained motion autoencoder to determine latent sketch features. These latent sketch features are then used as guiding conditions. A trained diffusion generation model is employed to add noise and perform inverse denoising on the initial three-dimensional human motion sequence based on these guiding conditions to determine target image features. The guiding conditions also include text prompt data and the initial three-dimensional human motion sequence. Finally, the target image features are decoded and reconstructed to determine the three-dimensional human motion editing image. Therefore, this method effectively solves the problem in existing technologies where sketches cannot establish a stable correspondence with three-dimensional human joint data, making it difficult to incorporate sketch information into human motion editing models to improve the controllability and robustness of human motion editing.
[0068] It should be understood that the application of the present invention is not limited to the examples above. Those skilled in the art can make improvements or modifications based on the above description, and all such improvements and modifications should fall within the protection scope of the appended claims.
Claims
1. A sketch-guided method for editing human motion, characterized in that, The method includes: Acquire initial 3D human motion sequence and 2D sketch data, and use a trained motion autoencoder to align the 2D sketch data and the initial 3D human motion sequence in the latent feature space to determine the latent features of the sketch. Using the latent features of the sketch as guiding conditions, a trained diffusion generation model is used to add noise and perform inverse denoising on the initial three-dimensional human motion sequence based on the guiding conditions to determine the target image features. The guiding conditions also include text prompt data and the initial three-dimensional human motion sequence. The target image features are decoded and reconstructed to determine the three-dimensional human motion editing image.
2. The sketch-guided human motion editing method according to claim 1, characterized in that, The motion autoencoder includes a sketch encoder, a human motion encoder, and a shared decoder; the training method for the motion autoencoder includes: Construct a training set of sketches and human motion pairs, which includes standard 3D human motion images and corresponding standard sketch data; Standard sketch data is encoded using a sketch encoder to determine initial sketch features; The initial human motion features are determined by encoding standard three-dimensional human motion images using a human motion encoder. The initial sketch features and the initial human motion features are decoded and reconstructed using a shared decoder to determine the corresponding reconstructed sketch data and reconstructed human motion image. Alignment loss is calculated based on the reconstructed sketch data, the reconstructed human motion image, the initial sketch features, and the initial human motion features. The parameters of the motion autoencoder are then optimized based on the alignment loss.
3. The sketch-guided human motion editing method according to claim 2, characterized in that, The alignment loss is calculated based on the reconstructed sketch data, the reconstructed human motion image, the initial sketch features, and the initial human motion features, including: A contrast loss function based on cosine similarity is used to calculate the contrast loss based on the initial sketch features and the initial human motion features; The reconstruction loss is calculated based on the reconstructed sketch data, the reconstructed human motion image, the initial sketch features, and the initial human motion features. The alignment loss is determined based on the contrast loss and the reconstruction loss.
4. The sketch-guided human motion editing method according to claim 3, characterized in that, The reconstruction loss is calculated based on the reconstructed sketch data, the reconstructed human motion image, the initial sketch features, and the initial human motion features, including: The sketch reconstruction loss is determined based on the reconstructed sketch data and the standard sketch data; The human motion reconstruction loss is determined based on the reconstructed human motion image and the initial human motion features; The reconstruction loss is determined based on the sketch reconstruction loss and the human motion reconstruction loss.
5. The sketch-guided human motion editing method according to claim 1, characterized in that, Using the latent features of the sketch as guiding conditions, a trained diffusion generation model is employed to add noise and perform inverse denoising on the initial 3D human motion sequence based on these guiding conditions, thereby determining the target image features, including: Obtain editing requirements and generate a time mask based on those requirements; Extract the human motion image to be edited from the initial three-dimensional human motion sequence according to the time mask; The trained diffusion model is used to add noise and perform inverse denoising on the human motion image to be edited based on guiding conditions to determine the target image features.
6. The sketch-guided human motion editing method according to claim 2, characterized in that, The training method for the diffusion generation model includes: Noise-containing latent features are obtained by forward diffusion noise addition based on standard 3D human motion images; Using the latent features of the sketch data corresponding to the standard sketch data as guiding conditions, the noise prediction network of the diffusion generation model is used to perform noise prediction and inverse denoising on the noisy latent features based on the guiding conditions, and the target image features corresponding to the standard three-dimensional human motion image are determined. The guiding conditions also include the standard three-dimensional human motion image and standard text prompt data. The diffusion generation model loss is calculated based on the noisy latent features, the sketch latent features, and the target image features, and the parameters of the diffusion generation model are optimized based on the diffusion generation model loss.
7. The sketch-guided human motion editing method according to claim 6, characterized in that, The diffusion generation model loss is calculated based on the noisy latent features, the sketch latent features, and the target image features, including: Calculate the diffusion loss based on the aforementioned noisy potential features; Calculate the sketch consistency loss based on the latent features of the sketch and the features of the target image; The diffusion generation model loss is determined based on the diffusion loss and the sketch consistency loss.
8. A sketch-guided human motion editing device, characterized in that, The device includes: The sketch alignment module is used to acquire the initial three-dimensional human motion sequence and two-dimensional sketch data, and to align the two-dimensional sketch data and the initial three-dimensional human motion sequence in the latent feature space using a trained motion autoencoder to determine the latent features of the sketch. The image editing module is used to use the latent features of the sketch as guiding conditions, and to use a trained diffusion generation model to add noise and inversely denoise the initial three-dimensional human motion sequence based on the guiding conditions to determine the target image features. The guiding conditions also include text prompt data and the initial three-dimensional human motion sequence. The feature decoding module is used to decode and reconstruct the features of the target image to determine the three-dimensional human motion editing image.
9. A terminal, characterized in that, The terminal includes a memory and one or more processors; the memory stores one or more programs; the programs contain instructions for executing the sketch-guided human motion editing method as described in any one of claims 1-7; the processors are used to execute the programs.
10. A computer-readable storage medium storing a plurality of instructions thereon, characterized in that, The instructions are loaded and executed by the processor to implement the steps of the sketch-guided human motion editing method according to any one of claims 1-7.