A dance motion generation method and device

By acquiring the semantic descriptions and conditional information of the user's input limbs, and using a diffusion model and U-Net architecture to generate dance movements, the problem of insufficient fine-grained control of contextual information in text and music-generated dance models is solved, thus improving the quality of dance movements.

CN122244488APending Publication Date: 2026-06-19BEIJING JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING JIAOTONG UNIV
Filing Date
2026-01-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing text and music-generated dance models are limited by the scarcity of high-quality text-music-action triplet data, resulting in insufficient fine-grained control over contextual information and the generation of dance movements that cannot meet the needs of practical applications.

Method used

By acquiring the semantic description and conditional information of the user's input limbs, feature encoding is performed to construct an initial dance movement with noise. Then, a diffusion model is used for denoising to generate the target dance movement. Finally, feature fusion is performed by combining the U-Net architecture and cross-modal attention mechanism.

Benefits of technology

It significantly improves the quality of dance movements, achieves fine-grained control at the body part level, and meets the actual application needs of users.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244488A_ABST
    Figure CN122244488A_ABST
Patent Text Reader

Abstract

This application provides a method and apparatus for generating dance movements. The method includes: acquiring semantic description information of limbs in a target dance movement to be generated, input by a user, and conditional information used to generate the target dance movement; performing feature encoding on the semantic description information to obtain multi-dimensional joint motion features of the dancer in the target dance movement; constructing a noisy initial dance movement based on the multi-dimensional joint motion features; inputting the initial dance movement and conditional information into a dance movement generation model to obtain the target dance movement. The dance movement generation model is used to denoise the initial dance movement based on the diffusion principle according to the conditional information to obtain the target dance movement. The dance movement generation model is trained on a diffusion model using noisy dance movement samples, conditional information samples, and clean dance movement labels corresponding to the noisy dance movement samples as input. Using the method of this application can improve the quality of the generated dance movements.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of motivational choreography technology, and in particular to a method and apparatus for generating dance movements. Background Technology

[0002] In recent years, music-to-dance motion generation technology has evolved from traditional methods to deep learning. Early methods were mainly based on motion retrieval and graph matching techniques, which spliced ​​together dance motion fragments that were similar to the music beat by retrieving them from a motion database. However, these methods lacked flexibility and were difficult to adapt to changing music rhythms.

[0003] To further improve the quality of generated dance movements, the Text and Music to Dance (TM2D) model attempts to break through the limitations of a single music modality. It integrates text descriptions and music features for a dual-modal approach, utilizing a Contrastive Language-Image Pre-training (CLIP) text encoder and a music feature extractor, and achieving feature fusion through a cross-modal attention mechanism. Finally, it generates dance movements based on a diffusion model. However, the TM2D model is limited by the scarcity of high-quality text-music-movement triplet data, resulting in insufficient fine-grained control over contextual information. Consequently, the quality of the generated dance movements still fails to meet the needs of practical applications. Summary of the Invention

[0004] This application provides a method and apparatus for generating dance movements, which addresses the shortcomings of existing text and music-generated dance models that are limited by the scarcity of high-quality text-music-movement triple data and have insufficient fine-grained control over contextual information. This method can significantly improve the quality of the generated target dance movements and better meet users' dance movement generation needs.

[0005] The first aspect of this application provides a method for generating dance movements, including: Obtain semantic description information of the limbs in the target dance movement to be generated, input by the user, and condition information on which the target dance movement is generated; The semantic description information is feature-encoded to obtain the multi-dimensional joint motion features of the dancer in the target dance movement; Based on the aforementioned multidimensional joint motion characteristics, an initial dance movement with noise is constructed; The initial dance movement and the conditional information are input into the dance movement generation model to obtain the target dance movement. The dance movement generation model is used to denoise the initial dance movement based on the conditional information and the diffusion principle to obtain the target dance movement. The dance movement generation model is obtained by training the diffusion model with noisy dance movement samples, conditional information samples, and clean dance movement labels corresponding to the noisy dance movement samples as input.

[0006] A second aspect of this application provides a dance movement generation device, comprising: The acquisition module is used to acquire semantic description information of the limbs in the target dance movement to be generated, which is input by the user, and condition information on which the target dance movement is generated. The encoding module is used to perform feature encoding on the semantic description information to obtain the multi-dimensional joint motion features of the dancer in the target dance movement; A construction module is used to construct an initial dance movement with noise based on the multidimensional joint motion characteristics; The input module is used to input the initial dance movement and the conditional information into the dance movement generation model to obtain the target dance movement. The dance movement generation model is used to denoise the initial dance movement based on the conditional information and the diffusion principle to obtain the target dance movement. The dance movement generation model is obtained by training the diffusion model with noisy dance movement samples, conditional information samples, and clean dance movement labels corresponding to the noisy dance movement samples as input.

[0007] A third aspect of this application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement a dance motion generation method as described in the first aspect.

[0008] The fourth aspect of this application provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements a dance motion generation method as described in the first aspect.

[0009] The fifth aspect of this application provides a computer program product, including a computer program that, when executed by a processor, implements the dance motion generation method described in the first aspect.

[0010] In this application, the semantic description information of the limbs in the target dance movement to be generated, as input by the user, and the conditional information on which the target dance movement is generated are first obtained. Next, the semantic description information is feature-encoded to obtain the multi-dimensional joint motion features of the dancer in the target dance movement. Then, based on the multi-dimensional joint motion features, a noisy initial dance movement is constructed. Finally, the initial dance movement and conditional information are input into the dance movement generation model to obtain the target dance movement. The dance movement generation method of this application can fully utilize detailed contextual information such as the semantic description information of the limbs and the multi-dimensional joint motion features to enhance the control of the limbs in the target dance movement. That is, it can achieve fine-grained control at the body part level during the generation of the target dance movement, effectively solving the problem of insufficient fine-grained control of contextual information in existing technologies. Therefore, it can significantly improve the quality of the generated target dance movement, thereby better meeting the actual application needs of users. Attached Figure Description

[0011] To more clearly illustrate the technical solutions in this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0012] Figure 1 This is a flowchart illustrating a dance motion generation method according to an embodiment of this application.

[0013] Figure 2 This is a schematic diagram of the structure of the dance motion generation model shown in the embodiments of this application.

[0014] Figure 3 This is a structural block diagram of a dance motion generation device provided in this application. Detailed Implementation

[0015] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0016] The following is combined with Figures 1 to 3 This application describes the method for generating dance movements.

[0017] To address the problem that existing dance motion generation methods suffer from insufficient fine-grained control over contextual information, resulting in limited quality of generated dance motions and failing to meet practical application requirements, this application provides a novel dance motion generation method. The execution subject of this method can be a dance motion generation device or an electronic device. The electronic device can be a mobile electronic device or a non-mobile electronic device. For example, mobile electronic devices can be mobile phones, tablets, laptops, handheld computers, ultra-mobile personal computers (UMPCs), etc., while non-mobile electronic devices can be servers, personal computers (PCs), etc., and this application does not specifically limit the types of devices used.

[0018] Figure 1 This is a flowchart illustrating a dance motion generation method according to an embodiment of this application. (Refer to...) Figure 1 The method of this application may include the following steps: Step S101: Obtain the semantic description information of the limbs in the target dance movement to be generated by the user input, and the condition information on which the target dance movement is generated.

[0019] In this embodiment, the dance movements are a sequence of actions, specifically including the dancer's body postures at multiple consecutive points in time. The dance movements consist of multiple frames, and the data content of one frame represents the dancer's body posture at a single point in time. The body posture can be described using the motion characteristics of multiple joints of the dancer's body.

[0020] Semantic descriptions of limbs refer to fine-grained textual descriptions of body parts, such as hands raised overhead or the right leg bent and raised backward. These semantic descriptions are typically used to describe the dancer's body posture in keyframes within a target dance movement.

[0021] In this embodiment, the conditional information is equivalent to the constraint information for the complete generation process of the target dance movement. The conditional information may include one or more of the following: accompaniment audio file, dancer's identity information, and style information of the target dance movement. The style information can be determined based on the dance segment number of the target dance movement. Of course, in addition to including one or more of the information listed above, the conditional information may also include other types of information as needed.

[0022] Step S102: Encode the semantic description information to obtain the multi-dimensional joint motion features of the dancer in the target dance movement.

[0023] In this embodiment, multidimensional joint motion features refer to the individual motion features of multiple joints of a dancer, described based on multiple dimensions. These multiple joints include root joints, which can be set according to actual needs, such as the pelvic nodes or hip centers of the dancer's skeletal model.

[0024] The multidimensional joint motion features include at least relative joint rotation parameters and relative joint position parameters based on the root joint. The relative joint position parameters represent the relative three-dimensional coordinate positions of the joints (excluding the root joint) relative to their parent joints. The relative joint rotation parameters represent the rotational orientation of the joints (excluding the root joint) relative to their parent joints.

[0025] In one implementation, a pre-trained PoseScript model can be used to parse semantic description information to obtain the dancer's relative joint rotation parameters and relative joint position parameters. The specific processing steps of the PoseScript model are as follows: The semantic description information is parsed to obtain multiple sets of relative rotation parameters for multiple (e.g., 22) joints based on the root joint; then, multiple (e.g., 12) candidate body poses are provided to the user, each corresponding to a set of relative joint rotation parameters. The user is then received to select the candidate body pose that best matches their creative goals or choreography requirements through interactive means. The set of relative joint rotation parameters corresponding to this selected candidate body pose is used as the relative joint rotation parameters for subsequent target dance movement generation. Next, based on the relative joint rotation parameters selected in the previous step, a forward kinematics algorithm is applied to calculate the positions of the remaining joints relative to the root joint, obtaining the relative joint position parameters.

[0026] In practice, in addition to using the PoseScript model, other models with similar functions can also be used to obtain multidimensional joint motion features.

[0027] Step S103: Based on the multidimensional joint motion characteristics, construct the initial dance movements with noise.

[0028] In this embodiment, the data dimension of the initial dance movement with noise is the same as that of the target dance movement, for example, including the same number of frames, and the feature dimension of each frame is the same.

[0029] Step S104: Input the initial dance movement and conditional information into the dance movement generation model to obtain the target dance movement. The dance movement generation model is used to denoise the initial dance movement based on the diffusion principle according to the conditional information to obtain the target dance movement. The dance movement generation model is trained on the diffusion model by taking noisy dance movement samples, conditional information samples and clean dance movement labels corresponding to the noisy dance movement samples as input.

[0030] In this embodiment, the dance motion generation model is essentially a diffusion model decoder that implements a reverse denoising process. This embodiment can be based on the U-Net architecture to construct the dance motion generation model. Specifically, the dance motion generation model may include a downsampling encoder, an intermediate residual temporal module, an upsampling decoder, and a convolutional-based output layer.

[0031] When generating the target dance movement, the initial dance movement is input into a downsampling encoder, which extracts and reduces its dimensions to obtain a multi-scale latent feature representation. Next, the multi-scale latent feature representation and conditional information are input into an intermediate residual temporal module. This module performs temporal dependency modeling and cross-modal attention fusion based on the conditional information to obtain contextual features incorporating the conditional information. Then, the contextual features are input into an upsampling decoder, which performs upsampling and feature reconstruction to obtain decoded features with the same dimensions as the initial dance movement. Finally, the decoded features are convolutionally mapped through the output layer, projecting them from the feature space back into the original dance movement data space to obtain the denoised target dance movement.

[0032] In this embodiment, the semantic description information of the limbs in the target dance movement to be generated, input by the user, and the conditional information used to generate the target dance movement are first obtained. Next, the semantic description information is feature-encoded to obtain the multi-dimensional joint motion features of the dancer in the target dance movement. Then, based on the multi-dimensional joint motion features, a noisy initial dance movement is constructed. Finally, the initial dance movement and conditional information are input into the dance movement generation model to obtain the target dance movement. The dance movement generation method of this embodiment can fully utilize detailed contextual information such as the semantic description information of the limbs and the multi-dimensional joint motion features to enhance the control of the limbs in the target dance movement. That is, it can achieve fine-grained control at the body part level during the generation of the target dance movement, effectively solving the problem of insufficient fine-grained control of contextual information in existing technologies. Therefore, it can significantly improve the quality of the generated target dance movement, thereby better meeting the actual application needs of users.

[0033] In conjunction with the above embodiments, in one implementation, step S103 may include: Step S1031: Generate keyframes in the target dance movement based on the multidimensional joint motion characteristics.

[0034] In this embodiment, the number of keyframes to be generated can be determined based on the semantic description information of the limbs input by the user. For example, if the semantic description information of the limbs is "two seconds after raising both hands above the head, raise the right leg backward with the knee bent", it can be determined that two keyframes need to be generated. The first keyframe corresponds to the body posture of raising both hands above the head, and the second keyframe corresponds to the body posture of raising the right leg backward with the knee bent.

[0035] In this embodiment, the feature dimension of each keyframe is the same as that of each frame in the target dance motion, for example, 272. The feature dimension of each frame in the target dance motion is used to describe two types of information: global motion components and local motion components. Global motion components include, but are not limited to, relative root rotation parameters (representing the change in rotation dimension of the root joint in the current frame relative to the root joint in the previous frame), relative root position parameters (representing the change in displacement of the root joint in the current frame relative to the root joint in the previous frame in three-dimensional space), absolute root rotation parameters (representing the absolute rotational posture of the root joint in the current frame in the global coordinate system), and absolute root position parameters (representing the absolute three-dimensional spatial coordinates of the root joint in the current frame in the global coordinate system). Local motion components include relative joint position parameters (representing the relative 3D coordinate positions of joints other than the root joint relative to their parent joint in the current frame), relative joint rotation parameters (representing the rotational posture of joints other than the root joint relative to their parent joint in the current frame), global joint velocity (representing the instantaneous motion velocity of all joints, including the root joint, in global space), and foot contact state (representing whether the four key points of the left and right feet (toes and heels) are in contact with the ground in the current frame).

[0036] In actual implementation, the feature dimensions occupied by the global motion component and the local motion component can be set according to actual needs. For example, the global motion component can occupy 13 feature dimensions, and the local motion component can occupy 259 feature dimensions.

[0037] When generating the first keyframe, a blank frame is first obtained. This blank frame has the exact same format as the frames in the target dance movement, meaning it contains the same feature dimensions, which are used to describe global and local motion components. Next, the PoseScript model is used to parse the dancer's hands raised overhead, obtaining the dancer's multi-dimensional joint motion features, including relative joint rotation parameters and relative joint position parameters. Then, the feature dimensions corresponding to the relative joint rotation parameters in the blank frame are filled using the relative joint rotation parameters, and the feature dimensions corresponding to the relative joint position parameters in the blank frame are filled using the relative joint position parameters. In other words, the dancer's relative joint rotation and position parameters are mapped to their corresponding positions in the blank frame. Next, the remaining unfilled positions are masked, for example, marked as 0. Areas marked as 0 belong to the areas to be processed in subsequent generation processes. The blank frame after completing these processing steps is the keyframe.

[0038] The principle of generating the second keyframe is exactly the same as that of generating the first keyframe, and will not be repeated here in this embodiment.

[0039] Step S1032: Construct a dance movement template based on keyframes. The dance movement template includes keyframes and blank frames other than keyframes. The data dimension of the dance movement template is the same as the data dimension of the target dance movement.

[0040] In this embodiment, the dance motion template includes keyframes and blank frames, with blank frames being frames that need to be filled with data. The dance motion template is equivalent to an incomplete version of the target dance motion data.

[0041] Step S1033: Add noise to the dance movement template to obtain the initial dance movement with noise.

[0042] In this step, Gaussian noise can be added to the dance move template, and the dance move template with Gaussian noise added will become the initial dance move.

[0043] In this embodiment, keyframes in the target dance movement are generated by using multi-dimensional joint motion features. This allows fine-grained features at the level of the dancer's body parts to be integrated into the initial dance movement. As a result, when generating the target dance movement in the future, the semantic description information of the limbs and detailed contextual information such as multi-dimensional joint motion features can be fully utilized to enhance the control of the limbs in the target dance movement. This effectively solves the problem of insufficient fine-grained control of contextual information in the prior art.

[0044] In conjunction with the above embodiments, in one implementation, step S1033 may include: Construct a mask matrix. The data dimension of the mask matrix is ​​the same as the data dimension of the target dance movement. The element value at the position corresponding to the key frame in the mask matrix is ​​the first value, and the element value at the position corresponding to the blank frame is the second value. Generate standard Gaussian noise with the same data dimensions as the target dance movement template; Based on dance move templates, mask matrices, and standard Gaussian noise, noisy initial dance moves are generated using the following formula: in, This indicates an initial dance move with noise. Represents the mask matrix, This represents a template for dance moves. Indicates standard Gaussian noise. This indicates element-wise multiplication.

[0045] In this embodiment, the binary value tensor is first defined. As a mask matrix, where, This represents the total number of frames in the target dance movement. This represents the feature dimension of a single frame in the target dance movement.

[0046] In this embodiment, the element value at the position corresponding to the keyframe is specifically the value of each feature dimension corresponding to the keyframe, with the first value being 1. The element value at the position corresponding to the blank frame is specifically the value of each feature dimension corresponding to the blank frame, with the second value being 0.

[0047] In this embodiment, by constructing a mask matrix, noise can be added to the dance motion template to obtain an initial dance motion with noise, thereby ensuring that the subsequent dance motion generation model can successfully generate the target dance motion.

[0048] In one implementation, the condition information includes the accompaniment audio file, the dancer's identity information, and the style information of the target dance movement. The style information can be based on the dance segment number of the target dance movement. To determine.

[0049] In this case, inputting the initial dance movement and the conditional information into the dance movement generation model can include: The initial dance moves and the mask matrix are concatenated along the feature dimension to obtain the motion concatenation features. ; The pre-trained Jukebox model is used to extract features from the backing track audio file, obtaining initial audio features. These initial features are then input into a conditional projection layer for dimensionality mapping transformation, aligning them to the target dimension and forming a conditional label sequence suitable for model processing. A conditional label is a basic unit processed in the Transformer; one conditional label is a segment of audio features obtained by slicing the original audio features. Absolute position encoding is applied to the conditional label sequence to obtain a conditional label sequence with positional information. This positional conditional label sequence is then input into a conditional Transformer model for deep feature extraction. The output of the conditional Transformer model is then transposed to obtain the target audio features corresponding to the backing track audio file. Since the Transformer architecture is essentially a parallel data processing system, it cannot recognize the sequential order of the sequence itself. Therefore, it is necessary to apply absolute positional encoding, which is equivalent to giving each audio segment feature a clear timestamp, such as frame 1, frame 2, etc., so as to obtain a conditionally labeled sequence with positional information, enabling the model to understand the content of the music and accurately perceive the temporal logic of the music over time. The dancer's identity information is encoded using an nn.Embedding embedding layer network to obtain the dancer's embedding vector. The style information of the target dance movements is encoded through the nn.Embedding embedding layer network to obtain the choreography embedding vector. Next, the dancers are embedded into vectors respectively. and choreography embedding vector Repeated expansion along the sequence dimension, thereby aligning with the target audio features. The sequence dimension is the same as that of the condition tag sequence, that is, the number of condition tags is the same.

[0050] Concatenate target audio features along feature dimensions Dancer embedding vector and choreography embedding vectors The final multimodal conditional feature representation is formed: ; Action splicing features , Input a dance motion generation model, and the dance motion generation model outputs the generated result of the target dance motion. ,in, The time step can be input by the user or determined by the dance motion generation model based on pre-configuration.

[0051] In conjunction with the above embodiments, in one implementation, the condition information includes an accompaniment audio file for the target dance movement. In this case, this application also provides a method for determining the optimal insertion position of keyframes in the target dance movement. Accordingly, step S1032, constructing a dance movement template based on keyframes, includes: Step A1: Obtain the beat data of the accompaniment audio file. The beat data includes multiple beat time points.

[0052] In this embodiment, the user-input original accompaniment audio file is first loaded at a target sampling rate (e.g., 48000Hz). Then, the original accompaniment audio file is preprocessed, specifically including: First, checking if the length of the original accompaniment audio file meets a preset requirement. If it exceeds the preset length, the first part is truncated; if it is less than the preset length, zeros are padded at the end to ensure the final accompaniment audio file has the preset length. Second, the sampling rate of the audio file obtained in the first step is unified to a preset sampling rate (e.g., 16kHz). Then, librosa.util.normalize is used to normalize the music data, thereby standardizing the music amplitude range to the [-1, 1] interval. After completing the second step, the preprocessed original accompaniment audio file is obtained. This preprocessed original accompaniment audio file is the accompaniment audio file of this application.

[0053] Next, the librosa.beat.beat_track tool is used to calculate the global beat velocity (Beats Per Minute, BPM) of the backing track audio file. Then, based on the global beat velocity, the beat times (i.e., beat positions) in the music are detected, and each beat time point (beat position) is returned in seconds. .

[0054] Step A2: Identify the blank frames corresponding to each beat time point in the blank dance movement template as candidate blank frames. The data dimension of the blank dance movement template is the same as the data dimension of the target dance movement. The data dimension of the target dance movement in this application is... In the blank dance template, each frame is a blank frame.

[0055] In step A2, the target frame rate is first set (e.g., 60fps), through... The beat times are converted into the corresponding frame indices in the blank dance movement template. Based on these indices, the corresponding frames are located, and these frames are collectively referred to as candidate blank frames. All candidate blank frames can be represented as follows: ,in, Indicates the first The frame corresponding to each beat time point in the blank dance movement template. .

[0056] Step A3: Determine the matching degree between the audio features of the corresponding audio segments in the accompaniment audio file for each candidate blank frame and the multidimensional joint motion features of the keyframe, and determine the candidate blank frame with the highest matching degree as the target blank frame.

[0057] Step A3 is used to determine the optimal insertion position of the keyframe within the blank dance motion template (or target dance motion). Specifically, for each keyframe, its multidimensional joint motion features can be calculated, along with the matching degree between these features and the audio features of the corresponding audio segments in the accompaniment audio file for each candidate blank frame. The candidate blank frame with the highest matching degree is then determined as the target blank frame. The position of the target blank frame is the optimal insertion position of the keyframe within the target dance motion.

[0058] Step A4: Replace the target blank frame with a keyframe, and obtain the dance motion template based on the replaced blank dance motion template.

[0059] Steps A1-A4 above describe the process of inserting a blank dance motion template into a single keyframe. In actual implementation, the principle of inserting a blank dance motion template into each keyframe is the same.

[0060] It's important to note that if there are multiple keyframes, a greedy allocation strategy can be used when inserting a keyframe: After calculating the matching degree between the multi-dimensional joint motion features of the keyframe and the audio features of the corresponding audio segments of each candidate blank frame, arrange them in descending order of matching degree, select the first unused candidate blank frame with the highest matching degree as the target blank frame, and mark this target blank frame as used. This method avoids the reuse of candidate blank frames, prevents the final generated target dance movement from losing keyframes, and ensures the quality of the generated target dance movement.

[0061] This embodiment allows keyframes to be inserted into the optimal position within the target dance movement, enabling the dancer's body posture represented by the keyframes to be displayed at the most appropriate beat time, thus effectively improving the quality of the generated target dance movement.

[0062] In conjunction with the above embodiments, in one implementation, step A3, determining the matching degree between the audio features of the corresponding audio segments of each candidate blank frame in the accompaniment audio file and the multidimensional joint motion features of the keyframes, includes: Audio features are input into a feature alignment model, which maps the audio features to the feature space of multi-dimensional joint motion features, thus obtaining audio features aligned with the multi-dimensional joint motion features. The feature alignment model is obtained by training an initial feature alignment model based on a neural network using a contrastive learning loss function based on paired audio feature samples and multi-dimensional joint motion feature samples. Determine the matching degree between the audio features of each candidate blank frame aligned with the multidimensional joint motion features and the multidimensional joint motion features of the keyframe.

[0063] In this embodiment, for each candidate blank frame, an audio segment belonging to that frame can be selected from the accompaniment audio file to obtain the original audio features of that audio segment. Next, regarding L2 normalization is performed, and then the normalized result is processed by a Wav2CLIP encoder. Encoding is performed to obtain audio features. Next, the audio features are input into a feature alignment model, which maps the audio features to the feature space containing the multi-dimensional joint motion features, resulting in audio features aligned with the multi-dimensional joint motion features. .

[0064] Next, for keyframes The pose embedding features are generated by the MotionCLIP encoder and then subjected to L2 normalization to obtain... .

[0065] Next, calculate and The cosine similarity between the frames is used to obtain the semantic relevance of the pose embedding features of the keyframes to the musical context at each beat time point. The formula for calculating the cosine similarity is as follows: In this embodiment, the final determined target blank frame is ,in, This is a variable used to store the value result (target blank frame).

[0066] This embodiment provides an adaptive keyframe localization method. By combining contrastive learning and beat awareness technology, it can significantly optimize the insertion position of keyframes in the target dance movement by determining the beat time point with the highest matching degree of keyframes in the accompaniment audio file. This ensures that keyframes can always be inserted into the optimal position in the target dance movement, thus significantly improving the alignment accuracy between semantic description information and the rhythm of the accompaniment audio file, thereby improving the quality of the generated target dance movement.

[0067] In one embodiment, based on the above embodiments, the initial feature alignment model includes a first submodule, a second submodule, a third submodule, a nonlinear activation unit, and a residual connection unit. The first submodule includes a first multilayer perceptron, the second submodule includes a second multilayer perceptron and a first normalization layer, and the third submodule includes a third multilayer perceptron and a second normalization layer.

[0068] Accordingly, the feature alignment model is trained through the following steps: Step B1: Input the audio feature sample into the first submodule, and perform feature projection processing through the first multilayer perceptron to obtain the first intermediate feature.

[0069] In this step, the music segment sequence for each data point in the dance motion sample dataset is first encoded using a pre-trained Wav2CLIP encoder to obtain encoded audio features, which serve as audio feature samples. Similarly, for the dance motion sequence in each data point in the dance motion sample dataset, a pre-trained MotionCLIP encoder is used to process the dance motion sequence. Each frame in Dance move data Relative joint rotation parameters in multidimensional joint motion characteristics The encoded frames are then combined on a timeline to obtain a dance sequence. semantic feature encoding Among them, dance movement data The multidimensional joint motion features mentioned earlier are equivalent to the multidimensional joint motion feature samples mentioned earlier. The feature alignment model in this application is specifically based on paired audio feature samples and semantic feature encoding. The features are obtained through training. After completing the above steps, the audio feature samples are input into the first submodule. The first multilayer perceptron performs linear feature projection processing on the audio feature samples, mapping them from the original feature dimension to the preset hidden layer dimension, thereby obtaining the first intermediate features.

[0070] Step B2: Input the first intermediate feature into the second submodule, perform dimensionality upscaling through the second multilayer perceptron, and normalize the features output by the second multilayer perceptron through the first normalization layer to obtain the second intermediate feature.

[0071] Next, the first intermediate feature is input into the second submodule, and the second multilayer perceptron performs dimensionality upscaling on the first intermediate feature, for example, expanding the dimension to four times the dimension of the hidden layer, to increase the sparsity and expressive power of the feature. Then, the features output by the second multilayer perceptron are normalized through the first normalization layer to stabilize the feature distribution and accelerate model convergence, thereby obtaining the high-dimensional second intermediate feature.

[0072] Step B3: Input the second intermediate feature into the third submodule, perform dimensionality reduction processing through the third multilayer perceptron, and normalize the features output by the third multilayer perceptron through the second normalization layer to obtain the third intermediate feature.

[0073] Next, the second intermediate feature is input into the third submodule, where it undergoes dimensionality reduction processing by the third multilayer perceptron, mapping it back to the original hidden layer dimensions. Then, the features output by the third multilayer perceptron are normalized by the second normalization layer to obtain the third intermediate feature.

[0074] Step B4: Perform nonlinear transformation on the third intermediate feature through a nonlinear activation unit, and perform residual fusion between the output of the nonlinear activation unit and the first intermediate feature through a residual connection unit to obtain the predicted audio feature aligned with the multidimensional joint motion feature sample.

[0075] To enhance the nonlinear expressive power of the feature alignment model while preserving original information, this step inputs the third intermediate feature into a nonlinear activation unit (e.g., the GELU function) for nonlinear transformation. Next, a skip connection operation is performed using a residual connection unit, specifically adding the output of the nonlinear activation unit element-wise to the first intermediate feature obtained in step B1. This residual fusion mechanism effectively avoids the gradient vanishing problem.

[0076] Predicted audio features are audio features that, after alignment processing, are close to multidimensional joint motion feature samples in semantic space.

[0077] Step B5: Determine the information noise contrast estimation loss between the predicted audio features and the relative joint rotation feature samples in the multidimensional joint motion feature samples, and update the parameters of the initial feature alignment model based on the information noise contrast estimation loss.

[0078] Among them, the relative joint rotation feature samples are the dance movement sequences mentioned above. semantic feature encoding .

[0079] After obtaining the predicted audio features, the Information Noise Contrastive Estimation (InfoNCE) loss function is used to calculate the Information Noise Contrastive Estimation loss between the predicted audio features and the relative joint rotation feature samples. The Information Noise Contrastive Estimation loss function aims to bring positive sample pairs (matched audio feature samples and relative joint rotation feature samples) closer together in the feature space, while simultaneously widening the distance between negative sample pairs (mismatched audio feature samples and relative joint rotation feature samples).

[0080] Next, based on the calculated information noise contrast estimation loss, the gradient is calculated using the backpropagation algorithm, and the parameters in the initial feature alignment model are updated.

[0081] Step B6: Obtain the feature alignment model based on the updated initial feature alignment model.

[0082] By repeating steps B1-B5 above, iterative training is performed on a large number of samples until the model converges or reaches the preset number of training rounds. The initial feature alignment model after training is completed is the feature alignment model.

[0083] In one implementation, the dimensions of the first, second, and third multilayer perceptron layers can be set according to actual needs, as long as the dimensions of the first and third multilayer perceptron layers are the same, and the dimension of the second multilayer perceptron layer is greater than the dimensions of both the first and third multilayer perceptron layers. For example, the dimension of the first multilayer perceptron layer can be 512, the dimension of the second multilayer perceptron layer can be 1024, and the dimension of the third multilayer perceptron layer can be 512.

[0084] In one implementation, audio feature samples and multidimensional joint motion feature samples can be combined. The training batch, consisting of n sequence pairs, is optimized using the InfoNCE loss function, the mathematical expression of which is: in, Represents the InfoNCE loss function. Indicates the first in the training batch Normalized features of each audio feature sample Indicates the first in the training batch Normalized features of relative joint rotation features in a multidimensional joint motion feature sample. This represents the function for calculating cosine similarity. This represents a temperature parameter used to scale logical values. This indicates the sequence pairs contained in the training batch.

[0085] During the comparative training process, the weight parameters of the MotionCLIP encoder and the Wav2CLIP encoder need to be frozen, and only the parameters in the initial feature alignment model need to be updated.

[0086] In this embodiment, by constructing a multi-level perceptron architecture that includes feature dimensionality-changing projection and normalization processing, high-order semantic information of audio feature samples can be effectively extracted. Simultaneously, with the residual connection mechanism, the nonlinear expressive power of features can be enhanced while effectively avoiding the gradient vanishing problem in network training and preserving original details. Furthermore, this embodiment utilizes information-noise contrastive estimation loss for contrastive learning optimization, forcibly narrowing the distribution distance between audio features and motion features. This enables precise alignment of cross-modal features in the latent space, significantly improving the matching degree and coordination between the generated target dance movements and the accompanying audio files.

[0087] The structure of the dance motion generation model used in this application can be as follows: Figure 2 As shown. Figure 2 This is a schematic diagram of the structure of a dance motion generation model shown in an embodiment of this application. In one implementation, the dance motion generation model includes an intermediate residual temporal module, which comprises a first module, a second module, and a third module. The first module includes a self-attention layer and a first feature linear modulation layer; the second module includes a cross-attention layer and a second feature linear modulation layer; and the third module includes a feedforward network layer and a third feature linear modulation layer, as shown below. Figure 2 As shown.

[0088] In this embodiment, the intermediate residual timing module includes an attention-based Transformer module, and the first module, the second module, and the third module are located within the attention-based Transformer module.

[0089] Accordingly, step S104 may include: Step S1041: Determine the diffusion time step of the dance motion generation model.

[0090] Step S1042: Input the initial dance movement into the first module, capture the temporal dependency features of the initial dance movement through the self-attention layer, and perform linear transformation modulation on the temporal dependency features based on the diffusion time step through the first feature linear modulation layer.

[0091] In this embodiment, inputting the initial dance movement into the first module means indirectly inputting the initial dance movement into the first module. Since the initial dance movement is first input into a downsampling encoder when generating the target dance movement, and then subjected to feature extraction and dimensionality reduction processing by the downsampling encoder to obtain a multi-scale latent feature representation, step S1042, inputting the initial dance movement into the first module, actually means inputting the initial dance movement, which exists as a multi-scale latent feature representation, into the first module.

[0092] In this step, the first module also includes a normalization layer. The first module first performs layer normalization operation on the input content through the normalization layer to achieve normalization processing. Then, through the self-attention layer, the normalized features are subjected to self-attention operation to capture the temporal dependent features in the initial dance movement. Next, through the first feature linear modulation layer, the temporal dependent features are linearly transformed and modulated based on the diffusion time step, and residual connections are performed with the input of the first module to maintain the stability of the input features.

[0093] In this embodiment, the intermediate residual timing module may further include a linear layer for linearly processing the initial dance movements and inputting the linearly processed initial dance movements into the first module, corresponding to... Figure 2 The position entered in the text.

[0094] Step S1043: Input the output of the first module and the initial dance movement into the second module, calculate the interactive attention features between the input content and the conditional information through the cross attention layer, and perform linear transformation modulation on the interactive attention features based on the time step through the second feature linear modulation layer.

[0095] In this step, the second module also includes a normalization layer. The second module first performs a normalization operation on the input content through the normalization layer to achieve normalization processing. Then, through the cross-attention layer, a cross-attention operation is performed on the normalized output features to obtain the interactive attention features between the input content and the conditional information. Next, through the second feature linear modulation layer, the interactive attention features are linearly transformed and modulated based on the time step, and residual connections are performed with the input of the second module to maintain the stability of the input features.

[0096] Step S1044: Input the output results of the second module and the first module into the third module, perform nonlinear feature transformation through the feedforward network layer, and perform linear transformation modulation on the output features of the feedforward network layer based on the time features through the third feature linear modulation layer.

[0097] In this step, the third module also includes a normalization layer. The third module first performs a normalization operation on the input content through the normalization layer to achieve normalization processing. Then, through the feedforward network layer, a nonlinear feature transformation is performed on the normalized output features. Next, through the third feature linear modulation layer, a linear transformation modulation is performed on the output features of the feedforward network layer based on the time step.

[0098] Step S1045: Perform residual connection between the output of the third module and the initial dance movement to obtain the output feature sequence of the intermediate residual time sequence module.

[0099] In this embodiment, the output feature sequence of the intermediate residual timing module is a context feature that incorporates conditional information.

[0100] Step S1046: Decode the output feature sequence to obtain the target dance movement.

[0101] In this step, the output feature sequence is first input into the upsampling decoder. The upsampling decoder performs upsampling and feature reconstruction to obtain decoded features with the same dimensions as the initial dance movement. Then, the output layer performs convolutional mapping on the decoded features to project them from the feature space back to the original dance movement data space, resulting in the denoised target dance movement.

[0102] In this embodiment, by constructing an intermediate residual temporal module with a three-level cascaded structure, the internal temporal dependencies of dance movement features are accurately captured using a self-attention mechanism, ensuring the continuity of the movements. A cross-attention mechanism enables deep interaction between dance movement features and conditional information, significantly enhancing the matching degree between the target dance movement and the conditional information. Secondly, each module incorporates a diffusion time step for linear feature modulation, dynamically injecting time-aware information throughout the feature extraction and transformation process, enabling adaptive and fine-tuning of feature distribution. Furthermore, combined with the nonlinear transformation of the feedforward network and the global residual connection mechanism, the dance movement generation model of this application can effectively solve the gradient vanishing problem and fully preserve movement details while strengthening the model's deep feature representation capabilities, significantly improving the quality of the generated target dance movements.

[0103] In conjunction with the above embodiments, in one implementation, the linear transformation modulation formula used in the first characteristic linear modulation layer, the second characteristic linear modulation layer, and the third characteristic linear modulation layer is as follows: ; ; ; in, This represents the characteristics after linear transformation modulation; Indicates the initial dance movements; Embedded features representing time steps; Indicates the scaling adjustment factor; Indicates the translation adjustment factor; This represents the element-wise multiplication operation; , as well as These represent different linear transformation modulation operations.

[0104] In this embodiment, each module combines diffusion time step for feature linear modulation, dynamically injecting time-aware information throughout the feature extraction and transformation process, which enables adaptive and fine control of feature distribution and significantly improves the quality of the final generated target dance movement.

[0105] The following will combine Figure 2 The training process of the dance motion generation model is described, including the following steps: Step C1: Collect a dance movement sample dataset. This dataset includes multiple sample data points, all of which have the same sequence length (i.e., ...). Frame). Each sample data can be represented as ( , , , ).in, Represents a sequence of musical segments; Indicates a sequence of dance movements; This indicates the dancer's identity information; This represents the dance segment number of the dance movement sequence, corresponding to style information. The data collection method for the dance movement sample dataset is consistent with the open-source music-dance movement dataset AIST.

[0106] Step C2: Represent the dance movement sequence from step C1 as a two-dimensional tensor. ,in, This indicates the number of frames in a dance sequence. This represents the feature dimension of each frame. Next, each frame... Dance move data Decomposed into global motion components With local motion components The joint representation of, i.e. .

[0107] Global motion components Includes: relative root rotation parameters Relative root position parameters Absolute root rotation parameters Absolute root position parameters Its compositional relationship is as follows: .

[0108] Local motion components Includes: relative joint position parameters Relative joint rotation parameters Global joint velocity Foot contact state Its compositional relationship is as follows: .

[0109] Step C3: Based on the dance movement sequence of each sample data in the dance movement sample dataset obtained in C2. Randomly select m frames as keyframes, and use these keyframes to build key dance movement data. .

[0110] Step C4: Define the binary value tensor As a mask matrix In this mask matrix In the process, the element value at the position corresponding to the keyframe selected in C3 will be set to 1, and the values ​​at the other positions will be set to 0.

[0111] Step C5: For each sample data point in the dance movement sample dataset in C2, select the music segment sequence. Extract music clip sequences using a pre-trained Jukebox model. The initial audio features are obtained and transformed through a conditional projection layer to obtain a conditional label sequence. Absolute positional encoding is applied to the conditional label sequence to obtain a conditional label sequence with positional information. This conditional label sequence with positional information is then input into a conditional Transformer model for deep feature extraction. Finally, the output of the conditional Transformer model is transposed to obtain a music clip sequence. Corresponding target audio features .

[0112] Step C6: For each sample in the dance movement sample dataset, identify the dancer's ID number. and dance segment number The dancer's identity information is encoded through the nn.Embedding embedding layer network to obtain the dancer's embedding vector. The style information of the target dance movements is encoded through the nn.Embedding embedding layer network to obtain the choreography embedding vector. Next, the dancers are embedded into vectors respectively. and choreography embedding vector Repeated expansion along the sequence dimension, thus... The sequence dimension is the same as that of the condition tag sequence, that is, the number of condition tags is the same.

[0113] Step C7: Concatenate along the feature dimension Dancer embedding vector and choreography embedding vector This forms the final multimodal conditional feature representation: This multimodal conditional feature representation is the conditional information sample.

[0114] Step C8: Construct an interpolation diffusion model, which includes a forward diffusion process and a backward denoising process. The forward diffusion process adds noise to the ground data to obtain standard Gaussian noise. The forward diffusion process is scheduled using a predefined variance. ,exist Upward data at discrete time steps (data Gaussian noise is gradually introduced into the dance movement sequences (representing the dance movement sample dataset). Specifically, from... arrive The transition probability is defined as: in, Let represent the identity matrix. By employing reparameterization techniques, a closed-form solution for sampling at any given time step t is derived, enabling efficient generation of noise samples at any intermediate stage of the diffusion process. ; ; ; Step C9: [The mask matrix is ​​then...] Standard Gaussian noise motion samples were obtained through a forward diffusion process combining element-wise multiplication and an interpolation diffusion model. With key dance move data Noisy dance motion samples were obtained. The formula is expressed as follows: Step C10, With mask matrix By stitching along the channel dimension, the action stitching features are obtained. ].

[0115] Action splicing features ] and multimodal conditional feature representation: enter Figure 2 The dance movement generation model shown obtains the prediction results of the target dance movement.

[0116] Next, based on the prediction results and the corresponding true values, the following loss function is used (minimizing the mean squared error between the generated results and the true values). Calculate the loss value: in, This represents the true value, which is the sequence of dance movements in the dance movement sample dataset obtained in step C2; This represents a sample of dance movements with noise. Indicates a time step; indicates a sample of conditional information. Represents the actual data distribution; [ ] indicates a uniform distribution; This represents the operation based on the dance motion generation model. The predicted results for the target dance movements.

[0117] In summary, the method of this application has the following technical effects: First, by introducing the PoseScript model to generate key body poses, it breaks through the dependence of traditional methods on large-scale text-music-motion triple data and realizes fine-grained semantic control at the body part level.

[0118] Second, we designed an adaptive keyframe localization method that combines contrastive learning and beat awareness techniques. By determining the beat time point in the accompaniment audio file where the keyframe has the highest matching degree, we can significantly optimize the insertion position of the keyframe in the target dance movement. This ensures that the keyframe can always be inserted into the optimal position in the target dance movement, thus significantly improving the alignment accuracy between semantic description information and the rhythm of the accompaniment audio file.

[0119] Third, the motivational choreography method (a creative method that uses core postures as the core of generation and constructs a unified and thematic dance sequence through systematic transformation and development) is adopted. Through the core body posture generation mechanism guided by text, the semantic intention of dance creation is effectively captured, so that the generated dance movements have better thematic consistency and artistic expression.

[0120] Fourth, by using spatiotemporal masking matrices, diffusion models, and FiLM modulation mechanisms, the system architecture is made flexible and scalable, and multimodal information is efficiently fused, providing reliable technical support for high-quality dance generation.

[0121] The following describes a dance movement generation device provided in this application. The dance movement generation device described below can be referred to in correspondence with the dance movement generation method described above. Figure 3 This is a structural block diagram of a dance movement generation device provided in this application. (Refer to...) Figure 3 The dance motion generation device 300 of this application may include: The acquisition module 301 is used to acquire the semantic description information of the limbs in the target dance movement to be generated input by the user, and the condition information on which the target dance movement is generated. Encoding module 302 is used to perform feature encoding on the semantic description information to obtain the multi-dimensional joint motion features of the dancer in the target dance movement; The construction module 303 is used to construct an initial dance movement with noise based on the multidimensional joint motion characteristics. The input module 304 is used to input the initial dance movement and the condition information into the dance movement generation model to obtain the target dance movement. The dance movement generation model is used to denoise the initial dance movement based on the condition information and the diffusion principle to obtain the target dance movement. The dance movement generation model is obtained by training the diffusion model with noisy dance movement samples, condition information samples, and clean dance movement labels corresponding to the noisy dance movement samples as input.

[0122] According to the dance motion generation device 300 of this application, the construction module 303 is specifically used for: generating keyframes in the target dance motion based on the multi-dimensional joint motion features; constructing a dance motion template based on the keyframes, wherein the dance motion template includes the keyframes and blank frames other than the keyframes, and the data dimension of the dance motion template is the same as the data dimension of the target dance motion; and adding noise to the dance motion template to obtain the initial dance motion with noise.

[0123] According to the dance motion generation device 300 of this application, the construction module 303 is specifically used for: constructing a mask matrix, wherein the data dimension of the mask matrix is ​​the same as the data dimension of the target dance motion, the element value at the position corresponding to the keyframe in the mask matrix is ​​a first value, and the element value at the position corresponding to the blank frame is a second value; generating standard Gaussian noise with the same data dimension as the target dance motion template; and generating the initial dance motion with noise based on the dance motion template, the mask matrix, and the standard Gaussian noise using the following formula: in, This refers to the initial dance movement with noise. Represents the mask matrix, This refers to the dance movement template. This represents the standard Gaussian noise. This indicates element-wise multiplication.

[0124] According to the dance motion generation device 300 of this application, the condition information includes an accompaniment audio file of the target dance motion. The construction module 303 is specifically used for: acquiring the beat data of the accompaniment audio file, the beat data including multiple beat time points; determining the blank frames corresponding to each beat time point in a blank dance motion template as candidate blank frames, the data dimension of the blank dance motion template being the same as the data dimension of the target dance motion, and each frame in the blank dance motion template being a blank frame; determining the matching degree between the audio features of each candidate blank frame corresponding to the audio segment in the accompaniment audio file and the multidimensional joint motion features of the keyframe, and determining the candidate blank frame with the highest matching degree as the target blank frame; The target blank frame is replaced with the key frame, and the dance movement template is obtained based on the replaced blank dance movement template.

[0125] According to the dance motion generation device 300 of this application, the construction module 303 is specifically used for: inputting the audio features into a feature alignment model, mapping the audio features to the feature space where the multidimensional joint motion features are located through the feature alignment model, and obtaining audio features aligned with the multidimensional joint motion features, wherein the feature alignment model is obtained by training an initial feature alignment model based on a neural network using a contrastive learning loss function based on paired audio feature samples and multidimensional joint motion feature samples; and determining the matching degree between the audio features aligned with the multidimensional joint motion features of each candidate blank frame and the multidimensional joint motion features of the key frame.

[0126] According to the dance motion generation device 300 of this application, the initial feature alignment model includes a first sub-module, a second sub-module, a third sub-module, a nonlinear activation unit, and a residual connection unit. The first sub-module includes a first multilayer perceptron, the second sub-module includes a second multilayer perceptron and a first normalization layer, and the third sub-module includes a third multilayer perceptron and a second normalization layer. The feature alignment model is trained through the following steps: The audio feature samples are input into the first submodule, and feature projection processing is performed by the first multilayer perceptron to obtain the first intermediate features. The first intermediate feature is input into the second submodule, and then subjected to dimensionality upscaling by the second multilayer perceptron. The features output by the second multilayer perceptron are then normalized by the first normalization layer to obtain the second intermediate feature. The second intermediate feature is input into the third sub-module, and dimensionality reduction is performed by the third multilayer perceptron. The features output by the third multilayer perceptron are then normalized by the second normalization layer to obtain the third intermediate feature. The third intermediate feature is nonlinearly transformed by the nonlinear activation unit, and the output of the nonlinear activation unit is residually fused with the first intermediate feature by the residual connection unit to obtain the predicted audio feature aligned with the multidimensional joint motion feature sample. Determine the information noise contrast estimation loss between the predicted audio features and the relative joint rotation feature samples in the multidimensional joint motion feature samples, and update the parameters of the initial feature alignment model based on the information noise contrast estimation loss; The feature alignment model is obtained based on the updated initial feature alignment model.

[0127] According to the dance motion generation device 300 of this application, the dance motion generation model includes an intermediate residual temporal module, which includes a first module, a second module, and a third module; the first module includes a self-attention layer and a first feature linear modulation layer, the second module includes a cross-attention layer and a second feature linear modulation layer, and the third module includes a feedforward network layer and a third feature linear modulation layer; the input module 304 is specifically used for: determining the diffusion time step of the dance motion generation model; inputting the initial dance motion into the first module, capturing the temporal dependency features of the initial dance motion through the self-attention layer, and performing linear transformation modulation on the temporal dependency features based on the diffusion time step through the first feature linear modulation layer; and inputting the first module's... The output result and the initial dance movement are input into the second module. The interaction attention features between the input content and the conditional information are calculated through the cross-attention layer, and the interaction attention features are linearly transformed and modulated based on the time step through the second feature linear modulation layer. The output results of the second module and the first module are input into the third module. Nonlinear feature transformation is performed through the feedforward network layer, and the output features of the feedforward network layer are linearly transformed and modulated based on the time feature through the third feature linear modulation layer. The output result of the third module is residually connected with the initial dance movement to obtain the output feature sequence of the intermediate residual temporal module. The output feature sequence is decoded to obtain the target dance movement.

[0128] According to the dance motion generation device 300 of this application, the linear transformation modulation formulas used by the first feature linear modulation layer, the second feature linear modulation layer, and the third feature linear modulation layer are as follows: ; ; ; in, This represents the characteristics after linear transformation modulation; Indicates the initial dance movements; Embedded features representing time steps; Indicates the scaling adjustment factor; ∠ represents the translation adjustment factor; ⊙ represents the element-wise multiplication operation; σ, ζw, and ζb represent different linear transformation modulation operations.

[0129] According to the dance motion generation device 300 of this application, the condition information includes at least one of the following: accompaniment audio file, dancer's identity information, and style information of the target dance motion.

[0130] This application also provides an electronic device that may include a processor, a communications interface, a memory, and a communication bus, wherein the processor, communications interface, and memory communicate with each other via the communication bus. The processor can invoke logical instructions in the memory to execute the dance motion generation method described above.

[0131] Furthermore, the logical instructions in the aforementioned memory can be implemented as software functional units and sold or used as independent products, and can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0132] On the other hand, this application also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer is able to execute a dance motion generation method provided by the above methods.

[0133] In another aspect, this application also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, is implemented to perform a dance motion generation method provided by the above methods.

[0134] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0135] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0136] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A method for generating dance movements, characterized in that, include: Obtain semantic description information of the limbs in the target dance movement to be generated, input by the user, and condition information on which the target dance movement is generated; The semantic description information is feature-encoded to obtain the multi-dimensional joint motion features of the dancer in the target dance movement; Based on the aforementioned multidimensional joint motion characteristics, an initial dance movement with noise is constructed; The initial dance movement and the conditional information are input into the dance movement generation model to obtain the target dance movement. The dance movement generation model is used to denoise the initial dance movement based on the conditional information and the diffusion principle to obtain the target dance movement. The dance movement generation model is obtained by training the diffusion model with noisy dance movement samples, conditional information samples, and clean dance movement labels corresponding to the noisy dance movement samples as input.

2. The dance movement generation method according to claim 1, characterized in that, The step of constructing an initial dance movement with noise based on the multidimensional joint motion characteristics includes: Based on the multi-dimensional joint motion features, keyframes in the target dance movement are generated; A dance movement template is constructed based on the keyframes. The dance movement template includes the keyframes and blank frames other than the keyframes. The data dimension of the dance movement template is the same as the data dimension of the target dance movement. The dance movement template is subjected to noise processing to obtain the initial dance movement with noise.

3. The dance movement generation method according to claim 2, characterized in that, The step of adding noise to the dance movement template to obtain the initial dance movement with noise includes: Construct a mask matrix, wherein the data dimension of the mask matrix is ​​the same as the data dimension of the target dance movement, and the element value at the position corresponding to the key frame in the mask matrix is ​​a first value, and the element value at the position corresponding to the blank frame is a second value; Generate standard Gaussian noise with the same data dimension as the target dance movement template; Based on the dance move template, the mask matrix, and the standard Gaussian noise, the initial noisy dance move is generated using the following formula: in, This refers to the initial dance movement with noise. Represents the mask matrix, This refers to the dance movement template. This represents the standard Gaussian noise. This indicates element-wise multiplication.

4. The dance movement generation method according to claim 2, characterized in that, The condition information includes the accompaniment audio file for the target dance movement, and the step of constructing a dance movement template based on the keyframes includes: Obtain the beat data of the accompaniment audio file, wherein the beat data includes multiple beat time points; The blank frames corresponding to each beat time point in the blank dance movement template are determined as candidate blank frames. The data dimension of the blank dance movement template is the same as the data dimension of the target dance movement. Each frame in the blank dance movement template is a blank frame. Determine the matching degree between the audio features of the audio segments corresponding to each candidate blank frame in the accompaniment audio file and the multidimensional joint motion features of the key frame, and determine the candidate blank frame with the highest matching degree as the target blank frame; The target blank frame is replaced with the key frame, and the dance movement template is obtained based on the replaced blank dance movement template.

5. The dance movement generation method according to claim 4, characterized in that, The determination of the matching degree between the audio features of the audio segments corresponding to each of the candidate blank frames in the accompaniment audio file and the multidimensional joint motion features of the keyframes includes: The audio features are input into the feature alignment model, and the audio features are mapped to the feature space where the multidimensional joint motion features are located through the feature alignment model to obtain the audio features aligned with the multidimensional joint motion features. The feature alignment model is obtained by training an initial feature alignment model based on a neural network using a contrastive learning loss function based on paired audio feature samples and multidimensional joint motion feature samples. Determine the matching degree between the audio features of each candidate blank frame that are aligned with the multidimensional joint motion features and the multidimensional joint motion features of the key frame.

6. The dance movement generation method according to claim 5, characterized in that, The initial feature alignment model includes a first sub-module, a second sub-module, a third sub-module, a nonlinear activation unit, and a residual connection unit. The first sub-module includes a first multilayer perceptron, the second sub-module includes a second multilayer perceptron and a first normalization layer, and the third sub-module includes a third multilayer perceptron and a second normalization layer. The feature alignment model is trained through the following steps: The audio feature sample is input into the first submodule, and the first multilayer perceptron is used for feature projection processing to obtain the first intermediate feature. The first intermediate feature is input into the second submodule, and then subjected to dimensionality upscaling by the second multilayer perceptron. The features output by the second multilayer perceptron are then normalized by the first normalization layer to obtain the second intermediate feature. The second intermediate feature is input into the third sub-module, and dimensionality reduction is performed by the third multilayer perceptron. The features output by the third multilayer perceptron are then normalized by the second normalization layer to obtain the third intermediate feature. The third intermediate feature is nonlinearly transformed by the nonlinear activation unit, and the output of the nonlinear activation unit is residually fused with the first intermediate feature by the residual connection unit to obtain the predicted audio feature aligned with the multidimensional joint motion feature sample. Determine the information noise contrast estimation loss between the predicted audio features and the relative joint rotation feature samples in the multidimensional joint motion feature samples, and update the parameters of the initial feature alignment model based on the information noise contrast estimation loss; The feature alignment model is obtained based on the updated initial feature alignment model.

7. The dance movement generation method according to claim 1, characterized in that, The dance motion generation model includes an intermediate residual temporal module, which includes a first module, a second module, and a third module. The first module includes a self-attention layer and a first feature linear modulation layer. The second module includes a cross-attention layer and a second feature linear modulation layer. The third module includes a feedforward network layer and a third feature linear modulation layer. The step of inputting the initial dance movement and the conditional information into the dance movement generation model to obtain the target dance movement includes: Determine the diffusion time step of the dance motion generation model; The initial dance movement is input into the first module, the temporal dependency features of the initial dance movement are captured by the self-attention layer, and the temporal dependency features are linearly transformed and modulated based on the diffusion time step by the first feature linear modulation layer. The output of the first module and the initial dance movement are input into the second module. The interaction attention features between the input content and the conditional information are calculated through the cross attention layer. The interaction attention features are then linearly transformed and modulated based on the time step through the second feature linear modulation layer. The output results of the second module and the first module are input into the third module, and nonlinear feature transformation is performed through the feedforward network layer. Based on the time feature, the output feature of the feedforward network layer is linearly transformed and modulated through the third feature linear modulation layer. The output of the third module is residually concatenated with the initial dance movement to obtain the output feature sequence of the intermediate residual time sequence module; The output feature sequence is decoded to obtain the target dance movement.

8. The dance movement generation method according to claim 7, characterized in that, The linear transformation modulation formulas used in the first characteristic linear modulation layer, the second characteristic linear modulation layer, and the third characteristic linear modulation layer are as follows: ; ; ; in, This represents the characteristics after linear transformation modulation; Indicates the initial dance movement; Embedded features representing time steps; Indicates the scaling adjustment factor; ∠ represents the translation adjustment factor; ⊙ represents the element-wise multiplication operation; σ, ζw, and ζb represent different linear transformation modulation operations.

9. The dance movement generation method according to any one of claims 1-8, characterized in that, The condition information includes at least one of the following: the accompaniment audio file, the dancer's identity information, and the style information of the target dance movement.

10. A dance movement generation device, characterized in that, include: The acquisition module is used to acquire semantic description information of the limbs in the target dance movement to be generated, which is input by the user, and condition information on which the target dance movement is generated. The encoding module is used to perform feature encoding on the semantic description information to obtain the multi-dimensional joint motion features of the dancer in the target dance movement; A construction module is used to construct an initial dance movement with noise based on the multidimensional joint motion characteristics; The input module is used to input the initial dance movement and the conditional information into the dance movement generation model to obtain the target dance movement. The dance movement generation model is used to denoise the initial dance movement based on the conditional information and the diffusion principle to obtain the target dance movement. The dance movement generation model is obtained by training the diffusion model with noisy dance movement samples, conditional information samples, and clean dance movement labels corresponding to the noisy dance movement samples as input.