Dance video generation method based on multi-modal music driving and frequency-space dual-flow decomposition
By employing a multimodal music-driven and frequency-spatial dual-stream decomposition method, the generation stability issues of existing dance generation methods in motion synchronization, visual detail separation, and occlusion scenarios are resolved. This approach enables high-fidelity dance video generation, reduces motion synchronization errors, and improves generation stability in visual detail loss and occlusion scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- 湖南马栏山视频先进技术研究院有限公司
- Filing Date
- 2025-03-31
- Publication Date
- 2026-06-23
AI Technical Summary
Existing music-driven dance generation methods have errors in capturing the complex temporal relationship between music beats and dance movements. Single-stream generation frameworks cannot effectively separate spatial pose motion from frequency domain details, resulting in motion lag, misbeats, edge blurring, and texture distortion. Furthermore, global optimization strategies lack dynamic evaluation of local joint confidence, leading to limb breaks or reverse joint anomalies in the generated pose sequences when occluded.
This study employs a multimodal music-driven approach and a frequency-spatial dual-stream decomposition method. It achieves spatiotemporal alignment of music and visual features through gated cross-modal attention, uses a part-specific Transformer decoder to predict joint motion trajectories, optimizes the frequency domain consistency between high-frequency movements and music beats through wavelet decomposition, generates body region masks by combining graph convolutional networks, separates low-frequency energy maps and high-frequency residual features using Butterworth filter banks, constructs a parallel diffusion framework for spatial and frequency domain flows, optimizes video sequences by combining optical flow consistency constraints, and enhances high-frequency details by employing Laplacian pyramid reconstruction and subpixel convolution.
It achieves high fidelity in dance video generation, reduces motion synchronization error to 118ms, solves the problem of visual detail loss, improves generation stability in occluded scenes, and can still generate reasonable ergonomic movements in dance videos with 50% limb occlusion.
Smart Images

Figure CN120238708B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision and artificial intelligence content generation technology, specifically to a method for generating dance videos based on multimodal music-driven and frequency-spatial dual-stream decomposition. Background Technology
[0002] Existing music-driven dance generation methods suffer from three main drawbacks: First, single-modal feature alignment strategies struggle to capture the complex temporal relationship between music beats and dance movements, leading to delayed or out-of-beat movements. For example, existing CNN-LSTM-based models exhibit synchronization errors exceeding 200ms under strong rhythmic music. Second, single-stream generation frameworks cannot effectively separate spatial pose motion from frequency domain details, especially during rapid limb movements, which can result in edge blurring and texture distortion. Test data shows that traditional methods achieve a PSNR of only 28.4dB in clothing folds. Third, global optimization strategies lack dynamic evaluation of local joint confidence. When the input image is occluded (e.g., an arm is covered by an object), the generated pose sequence is prone to limb breakage or reverse joint abnormalities.
[0003] The background description provided herein is for the purpose of generally presenting the context of this disclosure. Unless otherwise indicated herein, the material described in this section is not prior art to the claims of this application and should not be acknowledged as prior art by virtue of its inclusion in this section. Summary of the Invention
[0004] To address the aforementioned technical problems in related technologies, this invention proposes a dance video generation method based on multimodal music driving and frequency-spatial dual-stream decomposition, comprising:
[0005] S1. Obtain the beat sequence and music style features of the music; and use gated cross-modal attention to achieve spatiotemporal alignment of music-visual features to output multimodal aligned features;
[0006] S2. A segmented Transformer decoder is used to predict joint motion trajectories, and the frequency domain consistency between high-frequency movements and music beats is optimized by wavelet decomposition to obtain a global pose sequence; the segmented Transformer is specifically a four-head attention mechanism that processes different body parts respectively.
[0007] S3. The global pose sequence is processed based on graph convolutional network to generate a body region mask. The non-body regions in the reference image are processed using the Butterworth filter bank to separate the low-frequency energy map and high-frequency residual features.
[0008] S4. Construct a parallel diffusion framework for spatial flow and frequency domain flow, and optimize the original video sequence by combining optical flow consistency constraints to maintain temporal smoothness and obtain the optimized video sequence. The spatial flow is guided by AdaIN pose, and the frequency domain flow is a wavelet modulation network.
[0009] S5. Reconstruct cross-scale features using the Laplacian pyramid and enhance high-frequency details using sub-pixel convolution to generate high-fidelity dance videos.
[0010] Furthermore, step S1 specifically includes:
[0011] S11. Perform dual feature encoding on the music waveform signal and use the Librosa toolkit to extract the binarized beat sequence. A Transformer encoder using a Jukebox pre-trained model is used to extract music style features.
[0012] S12. Extract the low-level texture features and high-level semantic features of the reference image in the video. Then, concatenate the temporal attention results with the low-level texture features upsampled by bilinear interpolation using a gated cross-modal attention mechanism to output multimodal alignment features. A beat-gated cross-modal attention mechanism is constructed based on beat features. Dynamically adjust the weights for music-visual feature fusion; σ represents the Sigmoid activation function.
[0013] Furthermore, in step S12, the concatenation of the temporal attention result with the underlying texture features using a gated cross-modal attention mechanism to output multimodal aligned features specifically involves: using a learnable parameter matrix... Project music style features into query vectors Visual high-level semantic features are mapped to keys Sum
[0014] Based on rhythm features Generate dynamic gating factors via γ t Modulate the attention weights for each frame Finally, the temporal attention results are concatenated with the underlying features to output multimodal aligned features.
[0015] Furthermore, step S2 specifically includes:
[0016] S21. Align features across multiple modalities. Spatiotemporal window division is performed to obtain A sequence of overlapping windows {W k =F align [8k-L+1:8k]}, each window is compressed to 256 channel dimensions using a 1×1 convolution. Where T represents the total number of video frames, and H / 4 and W / 4 are the spatial dimensions after downsampling;
[0017] S22. Predict joint motion trajectories using segmented Transformers. The segmented Transformer specifically employs a four-head attention mechanism to process different body parts; where the query vector... Position-encoded features With projection matrix Generate key-value pairs K i V i From context features through The mapping is obtained; the four-head attention results are fused by a fully connected layer and the output is the pose prediction within the window.
[0018] S23. Discrete wavelet transform decomposes the attitude sequence of the window sequence into low-frequency components. and high frequency components And calculate the high-frequency components after short-time Fourier transform and their correlation with the musical beat characteristics. Frequency domain consistency loss L freq A weighted average is calculated for all window prediction results including frame t. Obtain the global attitude sequence
[0019] Furthermore, step S3 specifically includes the following steps:
[0020] S31. Model the human body topology using a graph convolutional network to assess the pose. Generate body region mask
[0021] S32. Weight the reference image using the body region mask. right Acquiring low-frequency components Non-body areas After processing by a fourth-order Butterworth high-pass filter bank, the horizontal LH is separated. t Vertical HL t With diagonal HH t High-frequency components in three directions, low-frequency components LL t Low-frequency energy maps are generated through 3×3 convolution and layer normalization. High-frequency components are spliced through channels Cat(LH) t LH t HH t The data is compressed to 32 dimensions using 1×1 convolution, and then normalized by instances to finally generate a high-frequency residual map.
[0022] Furthermore, step S4 specifically includes the following steps:
[0023] S41. Process the initial video sequence based on AdaIN to obtain the first optimized video sequence;
[0024] S42. Using wavelet modulation networks to analyze frequency domain features Process and obtain...
[0025] S43. Perform time-series consistency optimization on the first optimized video sequence using optical flow to obtain a time-optimized video sequence.
[0026] Furthermore, AdaIN's attitude projection network employs a 3-layer MLP structure.
[0027] Furthermore, the depthwise separable convolution kernel of the wavelet modulation network has a size of 3×3 and an inflation rate of 2.
[0028] Furthermore, step S5 specifically includes the following steps:
[0029] S51, based on the optimized video sequence Construct a four-level Laplace pyramid;
[0030] S52. By cross-scale attention fusion, the features of each level of the Laplacian pyramid are mapped to a unified dimension to obtain projected features. Construct query vectors key vector Sum value vector Cross-scale association weights are calculated using a multi-head attention mechanism. Synthetic features
[0031] S53. Calculate the Laplace residual between the original frame and the upsampled low-frequency component. Detail information is recovered using the improved subpixel convolutional network Decov(·).
[0032] Furthermore, the improved subpixel convolutional network first expands the number of channels to 3×r through 1×1 convolutional layers. 2 =12, weight matrix Initialize using He and perform operations. Subsequently, the PixelShuffle operation was applied to increase the spatial resolution of the feature map to 2H×2W and reduce the channel dimension to 3, forming intermediate features. To further enhance the ability to express details, a three-layer convolutional structure is used: the first layer is a 3×3 convolution to extract local texture features; the second layer is a 5×5 dilated convolution to expand the receptive field; and the third layer is a 1×1 convolution to generate a detail weight map. Finally, the detail enhancement result is calculated by fusing skip connections with residual features. Where ⊙ denotes channel-wise multiplication, and the feature fused across scales after bicubic interpolation downsampling to the original resolution. Proportional synthesis, output Truncate to the range [0, 255].
[0033] This invention extracts multi-granular music features using a composite encoder (Librosa+Jukebox) and employs a beat-gated attention mechanism to ensure that key dance movements such as raising hands and kicking legs are strictly aligned with the music's downbeats. Verification using a test dataset shows that the synchronization error is reduced to 118ms. To address the issue of lost visual details, a frequency-spatial dual-stream decomposition architecture is proposed. A Butterworth filter bank decouples the reference image into a low-frequency energy map (encoding the overall body movement trend) and a high-frequency residual (preserving clothing texture and lighting details). A dual-stream diffusion mechanism optimizes global pose and local details respectively. To improve generation stability in occluded scenes, a joint confidence prediction module is introduced. A temporal sliding window weighted fusion strategy dynamically corrects the motion trajectories of abnormal joints, enabling the generation of ergonomically sound movements even with 50% limb occlusion. Attached Figure Description
[0034] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0035] Figure 1 This is a schematic diagram of a dance video generation method based on multimodal music driving and frequency-space dual-stream decomposition provided in an embodiment of the present invention. Detailed Implementation
[0036] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention are within the scope of protection of the present invention.
[0037] Example 1
[0038] refer to Figure 1 This embodiment discloses a dance video generation method based on multimodal music driving and frequency-space dual-stream decomposition, including:
[0039] S1. Obtain the beat sequence and music style features of the music; and use gated cross-modal attention to achieve spatiotemporal alignment of music-visual features to output multimodal aligned features;
[0040] This embodiment extracts music features using Librosa beat detection and Jukebox style encoding; specifically, it includes the following steps:
[0041] S11. Perform dual feature encoding on the music waveform signal and use the Librosa toolkit to extract the binarized beat sequence; use the Transformer encoder of the Jukebox pre-trained model to extract music style features;
[0042] Specifically, in this embodiment, the original audio can be decoded to obtain time-series data that constitutes a music waveform signal.
[0043] For the original music waveform signal Perform dual feature encoding, where T m This represents the number of audio sampling points, where F = 128 is the Mel-spectral dimension. Binary beat sequences are extracted using the Librosa toolkit. Each element indicates whether the corresponding video frame contains a beat. Simultaneously, a Transformer encoder using a Jukebox pre-trained model extracts the musical style features of the original music waveform signal. The musical style features were then resampled using cubic spline interpolation. Music style features aligned to video frame rate
[0044] S12. Extract the low-level texture features and high-level semantic features of the reference image in the video, and concatenate the temporal attention result with the low-level texture features through a gated cross-modal attention mechanism to output multimodal alignment features; wherein a beat-gated cross-modal attention mechanism is constructed based on beat features. Dynamically adjust the weights for music-visual feature fusion; where σ represents the Sigmoid activation function;
[0045] Specifically, the step of concatenating the temporal attention result with the underlying texture features through a gated cross-modal attention mechanism to output multimodal aligned features involves: using a learnable parameter matrix... Project music style features into query vectors Visual high-level semantic features are mapped to keys Sum
[0046] Based on rhythm features Generate dynamic gating factors via γ t Modulate the attention weights for each frame Finally, the temporal attention results are concatenated with the low-level features upsampled by bilinear interpolation to output multimodal aligned features. in This is used to generate dynamic weighting factors for beat-gated cross-modal attention mechanisms, modulating the fusion strength of music-visual features.
[0047] Using a reference image, a 3×3 convolutional network is used to extract low-level texture features. and high-level semantic features of residual block coding Constructing a beat-gated cross-modal attention mechanism Dynamically adjust the weights for music-visual feature fusion, where These are learnable gating parameters.
[0048] For reference image Low-level texture features are extracted using a 3×3 convolutional layer. High-level semantic features are then further encoded using residual blocks consisting of two 3×3 convolutional layers and skip connections.
[0049] The cross-modal alignment stage uses a learnable parameter matrix. Project music style features into query vectors Visual high-level feature mapping as key Sum
[0050] Based on rhythm features Generate dynamic gating factors via γ t Modulate the attention weights for each frame Finally, the temporal attention results are concatenated with the low-level features upsampled by bilinear interpolation to output multimodal aligned features. In this process, d = 256 is the feature projection dimension, σ represents the Sigmoid activation function, and the residual block operation is defined as F... out =Conv(ReLU(Conv(F) in )))+F in Spatial downsampling is achieved through a convolution operation with a stride of 2.
[0051] S2. A segmented Transformer decoder is used to predict joint motion trajectories, and the frequency domain consistency between high-frequency movements and music beats is optimized by wavelet decomposition to obtain a global pose sequence; the segmented Transformer is specifically a four-head attention mechanism that processes different body parts respectively.
[0052] Step S2 specifically includes the following steps:
[0053] S21. Align features across multiple modalities. Spatiotemporal window division is performed to obtain A sequence of overlapping windows {W k =F align [8k-L+1:8k]}, each window is compressed to 256 channel dimensions using a 1×1 convolution. Where T represents the total number of video frames, and H / 4 and W / 4 are the spatial dimensions after downsampling;
[0054] First, align the multimodal features. Spatiotemporal windowing is performed, where T represents the total number of video frames, and H / 4 and W / 4 are the downsampled spatial dimensions. Temporal features are then segmented using a sliding window slicing operation. A sequence of overlapping windows of length L = 16 {W k =F align [8k-L+1:8k]}, each window is compressed to 256 channel dimensions using a 1×1 convolution.
[0055] S22. Predict joint motion trajectories using segmented Transformers. The segmented Transformer specifically employs a four-head attention mechanism to process different body parts; where the query vector... Position-encoded features With projection matrix Generate key-value pairs K i V i From context features through The mapping is obtained; the four-head attention results are fused by a fully connected layer and the output is the pose prediction within the window.
[0056] In the improved Transformer decoding stage, a four-head attention mechanism is used to process different body parts separately, where the query vector... Position-encoded features With projection matrix Generate key-value pairs K i V i From context features through The mapping is obtained. The pose prediction within the output window is obtained after fusing the four-head attention results through a fully connected layer. The three-dimensional coordinates of the 25 key points include their x and y positions and confidence level c.
[0057] S23. Discrete wavelet transform decomposes the attitude sequence of the window sequence into low-frequency components. and high frequency components And calculate the high-frequency components after short-time Fourier transform and their correlation with the musical beat characteristics. Frequency domain consistency loss L freq A weighted average is calculated for all window prediction results including frame t. Obtain the global attitude sequence Where C k ∈[0,1] L×25 The global pose sequence includes the three-dimensional coordinates of the joints and their confidence scores, representing the joint confidence scores.
[0058] The attitude sequence is decomposed into low-frequency components using discrete wavelet transform. and high frequency components And calculate the high-frequency components after short-time Fourier transform and their correlation with the musical beat characteristics. Frequency domain consistency loss L freq Frequency domain consistency loss calculates the L1 norm difference between the action spectrum and the music spectrum within the beat region: specifically, the amplitude spectrum of the high-frequency components of the action and the amplitude spectrum of the music beat features are subtracted point-by-point in the mask-covered area, while the weights in the uncovered areas are reset to zero. The loss function is expressed as follows: Where M beat For the beat mask, S motion and S music The STFT results are for motion and music, respectively. This loss function forces the frequency domain energy distribution of high-frequency limb movements (such as hand tremors and footstep frequencies) to align with the music signal at the corresponding moments of the music beat.
[0059] The STFT uses a 512-point frame length and a Hamming window, with a beat area mask M. mask By using the threshold τ = 0.7 Binarization generation, beat region mask M mask It is primarily used to optimize the synchronization between high-frequency motion components and musical beats in the frequency domain. Its core function is to binarize and label the time-frequency regions corresponding to musical beats, focusing on the spectral alignment of these regions during frequency domain consistency loss calculation. This forces the frequency domain energy distribution of high-dynamic movements such as hand tremors and footstep rhythms to strictly match the characteristics of the musical beats. The confidence prediction module processes the pose sequence through a one-dimensional convolutional network with a kernel size of 3, and outputs the joint confidence C after Sigmoid activation. k ∈[0,1] L×25 Finally, a weighted average is calculated for all window prediction results including frame t. Obtain the global attitude sequence
[0060] In this embodiment, the window slicing strategy uses a multiple relationship between L=16 and step size 8 to ensure action continuity. The Haar wavelet basis functions are in orthogonal form. and Achieve frequency domain decomposition.
[0061] S3. The global pose sequence is processed based on graph convolutional network to generate a body region mask. The non-body regions in the reference image are processed using the Butterworth filter bank to separate the low-frequency energy map and high-frequency residual features.
[0062] Step S3 specifically includes the following steps:
[0063] S31. Model the human body topology using a graph convolutional network to assess the pose. Generate body region mask
[0064] Body region masks are generated by modeling human topological relationships using Graph Convolutional Networks (GCNs). A predefined adjacency matrix A∈{0,1} is used based on the SMPL human anatomical structure. 25×25 If joints i and j are physically connected (e.g., shoulder-elbow), then A ij =1, through normalized adjacency matrix (where D is the degree matrix, D ii (Representing the number of adjacent nodes of joint i) captures joint dependencies. The first layer graph convolution operation converts the pose coordinates P... t With weight matrix Multiplication, followed by ReLU activation, outputs a 64-dimensional feature. The second layer of computation uses weights. Generate spatial attention score The body region mask is obtained after Sigmoid activation and bilinear interpolation upsampling. Precisely segment the main action area.
[0065] S32. Weight the reference image using the body region mask. right Acquiring low-frequency components Non-body areas After processing by a fourth-order Butterworth high-pass filter bank, the horizontal LH is separated. t Vertical HL t With diagonal HH t High-frequency components in three directions, low-frequency components LL t Low-frequency energy maps are generated through 3×3 convolution and layer normalization. High-frequency components are spliced through channels Cat(LH) t HL t ,HH t The data is compressed to 32 dimensions using 1×1 convolution, and then normalized by instances to finally generate a high-frequency residual map.
[0066] Then comes the wavelet domain decomposition stage: a mask-weighted reference image. Low-frequency components are generated using 3×3 average pooling (AvgPool) with a step size of 2. Capture overall motion energy; non-body areas Then it passes through a fourth-order Butterworth high-pass filter bank (transfer function) Processing to separate horizontal LH t Vertical HL t With diagonal HH t The high-frequency components in three directions have cutoff frequencies set to 0.1 times the Nyquist frequency to match the spectral characteristics of dance movements. The low-frequency component LL... t Low-frequency energy maps are generated through 3×3 convolution (Xavier initialization) and layer normalization (LayerNorm). Stable global feature distribution; high-frequency components are spliced through channels Cat(LH) t HL t ,HH t The image is compressed to 32 dimensions using 1×1 convolution, then instance normalization (InstanceNorm) is applied to preserve local contrast, and finally a high-frequency residual map is generated with a fusion weight α = 0.3.
[0067] S4. Construct a parallel diffusion framework for spatial flow and frequency domain flow, and optimize the original video sequence by combining optical flow consistency constraints to maintain temporal smoothness and obtain the optimized video sequence. The spatial flow is guided by AdaIN pose, and the frequency domain flow is a wavelet modulation network.
[0068] Step S4 specifically includes the following steps:
[0069] S41. Process the initial video sequence based on AdaIN to obtain the first optimized video sequence;
[0070] Based on the initial video sequence (where T is the total number of frames, and H=1024 and W=768 are the frame resolutions) and frequency domain characteristics Spatial stream processing employs a dynamic adaptive instance normalization mechanism, mathematically expressed as V. t (k) =(1-α) k )·V t (k-1) +α k ·AdaIN(V t (k-1) ,P t ), where k∈{1,...,K} represents the number of diffusion iterations (default K=50), α k=0.1+0.8(k-1) / (K-1) achieves parameterized linear growth, and the AdaIN operation is defined as μ p ,σ p =MLP(P t ), Where μ v ,σ v This refers to the statistics of the input video frames. In this embodiment, when processing the initial video sequence based on AdaIN, the input of its pose projection network is the spliced features of the frequency domain features and the pose sequence. Affine parameters are generated through a 3-layer MLP to dynamically modulate the feature distribution of the video frames, ensuring that the global motion is aligned with the music beat.
[0071] S42. Using wavelet modulation networks to analyze frequency domain features The process is performed to obtain the low-frequency energy map of cross-modal modulation and the high-frequency residual features of time-frequency enhancement. The modulated low-frequency energy map and the high-frequency residual features are then input into the spatial stream.
[0072] Frequency domain flow detail injection is achieved through a wavelet modulation network. in This indicates a depthwise separable convolution operation. SpatialDropout randomly masks spatial regions with probability p = 0.2 during the training phase.
[0073] S43. An optimized video sequence is obtained by achieving attitude-driven optical flow fusion optimization based on PWC-Ne optical flow estimation network and differentiable bilinear interpolation; wherein PWC-net calculates the motion field of adjacent frames in the video sequence, and differentiable bilinear interpolation is used to calculate the attitude sequence through attitude-driven optical flow.
[0074] The temporal consistency constraint module uses a pre-trained PWC-Net optical flow estimation network φ(·) to calculate the motion field of adjacent frames. Attitude-driven optical flow is achieved through differentiable bilinear interpolation. Finally, construct the optical flow consistency loss. The TV regularization coefficient λ = 0.05 is used to smooth abnormal motion. The video frames after dual-stream fusion are output after 5 iterations of optimization.
[0075] In this embodiment, the AdaIN attitude projection network uses a 3-layer MLP structure (256-128-64 nodes) to extract attitude parameters; the depthwise separable convolution kernel size of the wavelet modulation network is 3×3, and the dilation rate is set to 2 to expand the receptive field; the attitude difference in optical flow loss calculation is smoothly transitioned through quaternion interpolation, and the interpolation weight coefficient β = 0.7 is obtained through end-to-end learning. The sources of each parameter are explained as follows: V t (k) Let α represent the intermediate video frame in the k-th diffusion iteration. kThe linear growth strategy ensures that image fidelity is preserved in the initial stage; Affine parameters generated for attitude conditions; The normalized coordinate transformation matrix is generated by mapping joint displacements to the image plane. It is initialized with camera parameters and fine-tuned through backpropagation.
[0076] S5. Reconstruct cross-scale features using the Laplacian pyramid and enhance high-frequency details using sub-pixel convolution to generate high-fidelity dance videos.
[0077] Specifically, the high-frequency residual features Enhanced high-frequency features of wavelet modulation network output The input to the Laplacian pyramid is after cross-scale attention fusion, where With the upsampled low-frequency component LL t Multi-scale features are generated by splicing. The subpixel convolutional network generates a detail weight map, which enhances the texture details of the original frame's Laplacian residual channel by channel, and finally synthesizes a high-fidelity video sequence.
[0078] Step S5 specifically includes the following steps:
[0079] S51, based on the optimized video sequence Construct a four-level Laplace pyramid;
[0080] Based on the optimized video sequence Construct a four-level pyramid of Laplace, with the first level being the pyramid itself. Generated by bicubic interpolation downsampling, satisfying V t l+1 =PyrDown(V t l And the downsampling kernel matrix It is a two-dimensional separable form with weights of [1,4,6,4,1] / 16.
[0081] S52. By cross-scale attention fusion, the features of each level of the Laplacian pyramid are mapped to a unified dimension to obtain projected features. Construct query vectors key vector Sum value vector Cross-scale association weights are calculated using a multi-head attention mechanism. Synthetic features
[0082] By employing cross-scale attention fusion, features from each level are mapped to a unified dimension d=256 through 3×3 convolution to obtain projected features. Then construct the query vector key vector Sum value vector Where the projection matrix These are learnable parameters. Cross-scale association weights are calculated using a multi-head attention mechanism. Final synthetic features in The hierarchy importance coefficients are initialized to [0.4, 0.3, 0.2, 0.1] and optimized through backpropagation;
[0083] S53. Calculate the Laplace residual R between the original frame and the upsampled low-frequency component. t =V t (5) -PyrUp(V t 1 The improved subpixel convolutional network Deconv(·) recovers detail information.
[0084] The high-frequency detail compensation stage calculates the Laplace residual R between the original frame and the upsampled low-frequency components. t =V t (5) -PyrUp(V t 1 This method recovers detail information through an improved subpixel convolutional network, Deconv(·), whose core operation is D. t =Conv3×3(PixelShuffle(R t W sub )),in The sub-pixel convolution kernel (magnification factor r = 2) is used to ultimately output the frame. The values are truncated to the range [0, 255].
[0085] The low-frequency component in this embodiment is the low-frequency component obtained in step S32.
[0086] In this embodiment, the Laplacian pyramid is constructed using a five-layer Gaussian kernel with a standard deviation σ = 1.6. Cross-scale attention employs a four-head parallel computing mechanism. The sub-pixel convolutional network has a three-layer structure with a channel count of [32, 64, 32]. The symbol PyrUp(·) represents bilinear interpolation upsampling, and its interpolation coefficient matrix is... The origins of the mathematical symbols for each entity are explained below: V t l Generated from the initial video frame through l downsampling, K down The weight distribution conforms to the Gaussian second derivative property; β l,m The dependence strength of level l on level m is reflected by parameterized modeling through multi-head attention; R t To characterize the detail differences between the original image and the low-frequency reconstruction, the computation process uses zero-padding to handle boundaries; Wsub Channel dimension 3×r 2 Ensure that the number of output channels matches the number of inputs after PixelShuffle.
[0087] The technical process for constructing a four-level Laplacian pyramid from the video sequence in step S5 is as follows:
[0088] With the initial video frame As the 0th level of the pyramid, V t 0 A progressive downsampling operation is performed using a two-dimensional separable convolution kernel. Specifically, the downsampling kernel matrix K... down From the horizontal direction of the core k h =[1,4,6,4,1] / 16 and the core in the vertical direction The structure is constructed by halving the resolution through convolution operations with a stride of 2. For the generation of the l-th level pyramid, firstly, V... t l-1 Reflection padding (padding=2) is applied to handle boundary effects, followed by one-dimensional convolution along the horizontal direction to compute intermediate features F. t l,inter =Conv1D(V t l-1 ,k h Then, perform convolution with the same kernel on the intermediate features along the vertical direction to obtain the downsampling result V. t l =Conv1D(F t l,inter ,k v This process is repeated iteratively three times, generating a four-level pyramid with progressively decreasing resolution: Level 1 Level 2 Level 3 The equivalent standard deviation σ of the five-layer Gaussian kernel is 1.6, which approximates the continuous Gaussian function through kernel coefficient discretization. Generation. After each pyramid level is constructed, it is restored to the original resolution through bicubic interpolation upsampling, and the Laplacian residual is calculated. PyrUp(·) uses a coefficient matrix Bilinear interpolation is performed to ultimately form a low-frequency main structure (V). t 3 ) and multi-level high-frequency details A complete pyramid representation.
[0089] The specific implementation process of high-frequency detail compensation in step S5 is as follows:
[0090] First, the optimized original video frames With the upsampled low-frequency component PyrUp(V) t1 Perform pixel-by-pixel interpolation to generate the Laplacian residual matrix. The upsampling operation is performed using a bilinear interpolation kernel, and the interpolation coefficient matrix is defined as follows: Boundary effects are addressed through reflection filling. Residual features are input into an improved sub-pixel convolutional network, which first expands the number of channels to 3×r using 1×1 convolutional layers. 2 =12 (magnification factor r=2), weight matrix Initialize using He and perform operations. Subsequently, the PixelShuffle operation was applied to increase the spatial resolution of the feature map to 2H×2W and reduce the channel dimension to 3, forming intermediate features. To further enhance the ability to express details, a three-layer convolutional structure is used: the first layer is a 3×3 convolution (32 input / output channels, ReLU activation) to extract local texture features; the second layer is a 5×5 dilated convolution (dilation rate 2, 64 channels) to expand the receptive field; and the third layer is a 1×1 convolution (32 channels, Sigmoid activation) to generate a detail weight map. Finally, by fusing skip connections with residual features, the detail enhancement result D is calculated. t =D w ⊙F t shuffle +(1-D w )⊙Conv3×3(F t shuffle ), where ⊙ represents channel-by-channel multiplication, bicubic interpolation downsampling to the original resolution, and cross-scale fusion of features. Proportional synthesis, output Truncate to the range [0, 255].
[0091] This embodiment integrates music multimodal analysis and image frequency domain decomposition techniques to construct an end-to-end generative framework: extracting music beat sequences based on the Librosa tool. Combining 128-dimensional style features encoded by the Jukebox model Innovative design of beat-gated attention mechanism The fusion weights of visual and musical features are dynamically adjusted to correspond strong beat points with large-scale body movements. In the image processing stage, a human body region mask is generated using a graph convolutional network. The image is decomposed into a low-frequency energy map using the Butterworth filter bank. and high-frequency residuals The low-frequency component encodes the overall motion energy of the body, while the high-frequency component preserves clothing texture and lighting details. These two components, through a dual-flow diffusion mechanism, drive spatial posture and detail enhancement, respectively. This technology can be widely applied in scenarios such as real-time virtual idol creation, personalized digital entertainment creation, and film and television special effects pre-visualization.
[0092] Furthermore, the beat-gated cross-modal attention mechanism in this embodiment dynamically adjusts the visual-music feature fusion weights through music beat features to achieve millisecond-level action-beat synchronization;
[0093] The frequency-space dual-stream decomposition architecture uses a Butterworth filter bank to separate the low-frequency energy map and high-frequency residual features, and optimizes attitude motion and detail texture respectively.
[0094] The local mask generation technique based on graph convolution accurately segments body regions through graph convolutional networks with human topological constraints, thereby improving the generation robustness in occluded scenes.
[0095] A joint confidence-driven weighted fusion strategy, combining a time-domain sliding window and confidence assessment, dynamically corrects abnormal joint motion trajectories.
[0096] The dual-stream diffusion collaborative optimization framework uses parallel iterations of the spatial stream (AdaIN attitude guidance) and the frequency domain stream (wavelet modulation network) to balance global rationality and local fidelity.
[0097] A cross-scale attention fusion mechanism achieves efficient fusion and detail enhancement of multi-scale features through Laplacian pyramid reconstruction and sub-pixel convolution.
[0098] It should be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Furthermore, in the accompanying drawings of the device embodiments provided by this invention, the connection relationships between modules indicate that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines. Those skilled in the art can understand and implement this without any creative effort.
[0099] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for generating dance videos based on multimodal music-driven and frequency-spatial dual-stream decomposition, characterized in that: include: S1. Obtain the beat sequence of the music. and musical style characteristics Extract the low-level texture features of the reference image and high-level semantic features The system generates a query vector Q using music style features, a key K and a value V using high-level visual semantic features, and a dynamic gating factor using beat sequences. Calculate the results of temporal attention and the time-series attention results The multimodal alignment features are concatenated with the upsampled underlying texture features to obtain the multimodal alignment features. ; S2, for the multimodal alignment features Spatiotemporal windowing is performed, and a part-specific Transformer decoder is used to predict the pose sequence within the window. ; For the pose sequence within the window Low-frequency components are obtained by performing discrete wavelet decomposition. and high frequency components The pose prediction is optimized based on the frequency domain consistency loss of the high-frequency components and music beat features, and the window prediction results are weighted and averaged to obtain the global pose sequence. ; S3. Generate a body region mask by processing the global pose sequence based on graph convolutional network, and use the Butterworth filter bank to process the body region in the reference image to extract low-frequency energy map, and extract high-frequency residual features from non-body regions. S4. Use a wavelet modulation network to process the low-frequency energy map obtained in step S3. and high-frequency residual characteristics The process yields the low-frequency energy map after cross-modal modulation and the time-frequency enhanced high-frequency residual characteristics. The original video sequence was optimized by combining spatial flow, frequency flow, and optical flow consistency constraints to obtain the optimized video sequence. ; Step S4 specifically includes the following steps: S41, processing the initial video sequence based on AdaIN to obtain the first optimized video sequence; S42, using a wavelet modulation network to process the frequency domain features. , Processing is performed to obtain the low-frequency energy map of cross-modal modulation and the high-frequency residual features of time-frequency enhancement. The modulated low-frequency energy map and the high-frequency residual features are respectively input into the spatial stream; S43, the first optimized video sequence is subjected to time consistency optimization through optical flow to obtain the time-optimized video sequence; S5, Based on the optimized video sequence Construct a four-level Laplacian pyramid, and perform cross-scale attention fusion on the features of each level of the pyramid to obtain cross-scale fused features. ; Calculate the Laplace residual Detail enhancement results recovered through subpixel convolutional networks and will Features with cross-scale fusion Synthesize proportionally to output high-fidelity dance videos. ,in This is the first low-frequency layer of the Laplace pyramid. The upsampling results are used for this purpose.
2. The method according to claim 1, characterized in that: Step S1 specifically includes: S11. Perform dual feature encoding on the music waveform signal and use the Librosa toolkit to extract the binarized beat sequence. The Transformer encoder, which uses a Jukebox pre-trained model, extracts music style features, where T represents the total number of video frames. S12. Extract the low-level texture features and high-level visual semantic features of the reference image in the video. Concatenate the temporal attention result with the low-level texture features upsampled by bilinear interpolation using a gated cross-modal attention mechanism to output multimodal alignment features. Specifically: generate a query vector Q using music style features, generate a key K and value V using high-level visual semantic features, and generate a dynamic gating factor using the beat sequence. The temporal attention results are calculated; a beat-gated cross-modal attention mechanism is constructed based on beat features. Dynamically adjust the weights for music-visual feature fusion; among which, This represents the Sigmoid activation function. Here, t represents the learnable gating parameter, and t is the time corresponding to the t-th frame after aligning with the video frame rate.
3. The method according to claim 2, characterized in that, In step S12, the concatenation of the temporal attention result with the underlying texture features using a gated cross-modal attention mechanism to output multimodal aligned features specifically involves: using a learnable parameter matrix... Project music style features into query vectors Visual high-level semantic features are mapped to keys Sum ; Based on rhythm features Generate dynamic gating factor ,pass Modulate the attention weights for each frame Finally, the temporal attention results are concatenated with the underlying texture features to output multimodal aligned features, where d is the feature projection dimension. These are the transpose matrices of the query and the key, respectively.
4. The method according to claim 3, characterized in that: Step S2 specifically includes: S21. Align features across multiple modalities. Spatiotemporal window division is performed to obtain A sequence of overlapping windows Each window is compressed to 256 channel dimensions using a 1×1 convolution. ,in Indicates the total number of frames in the video. and L represents the spatial dimension after downsampling, and L is the length of each window; KT is the number of spatiotemporal windows. S22. Predict joint motion trajectories using segmented Transformers. The segmented Transformer specifically employs a four-head attention mechanism to handle different body parts; where the query vector... Position-encoded features With projection matrix Generate key-value pairs From context features through The mapping is obtained; the four-head attention results are fused by a fully connected layer and the output is the pose prediction within the window. Where i is each position in the input sequence; and where position encoding features are... Due to the compressed window feature Contextual features obtained by combining positional encoding Due to the compressed window feature get; S23. Discrete wavelet transform decomposes the attitude sequence of the window sequence into low-frequency components. and high frequency components And calculate the high-frequency components after short-time Fourier transform and their correlation with the musical beat characteristics. Frequency domain consistency loss , for those containing the The weighted average of all window prediction results for the frame is calculated. The global pose sequence is obtained. ,in This represents the weighting coefficient of the k-th spatiotemporal window at time step t.
5. The method according to claim 4, characterized in that: Step S3 specifically includes the following steps: S31. Model the human body topology using a graph convolutional network to assess the pose. Generate body region mask ; S32. Weight the reference image using the body region mask. ,right Acquiring low-frequency components Non-body areas Then, after processing by a fourth-order Butterworth high-pass filter bank, the horizontal... ,vertical and diagonal High-frequency components in three directions, low-frequency components Low-frequency energy maps are generated through 3×3 convolution and layer normalization. High-frequency components are spliced through channels. After being compressed to 32 dimensions by 1×1 convolution and then normalized by instances, a high-frequency residual map is finally generated. .
6. The method according to claim 5, characterized in that: AdaIN's attitude projection network uses a 3-layer MLP structure.
7. The method according to claim 6, characterized in that: The depthwise separable convolution kernel of the wavelet modulation network has a size of 3×3 and an inflation rate of 2.
8. The method according to claim 7, characterized in that: Step S5 specifically includes the following steps: S51, based on the optimized video sequence Construct a four-level Laplace pyramid; S52. By cross-scale attention fusion, the features of each level of the Laplacian pyramid are mapped to a unified dimension to obtain projected features. Construct query vectors Key vector Sum value vector Cross-scale association weights are calculated using a multi-head attention mechanism. Synthetic features ; S53. Calculate the low-frequency layer of the first-level pyramid of the original frame and upsampling. Laplace residual Through improved subpixel convolutional networks Restore detailed information; where the input to the subpixel convolutional network is the Laplacian residual. Its output is the detail enhancement result. The improved subpixel convolutional network first expands the number of channels to 12 through a 1×1 convolutional layer, and the weight matrix... Initialize using He and perform operations. Subsequently, the PixelShuffle operation was applied to increase the spatial resolution of the feature map to [value missing]. The channel dimension is reduced to 3, forming intermediate features. To further enhance the ability to express details, a three-layer convolutional structure is used: the first layer is a 3×3 convolution to extract local texture features; the second layer is a 5×5 dilated convolution to expand the receptive field; and the third layer is a 1×1 convolution to generate a detail weight map. Finally, the detail enhancement results are calculated by fusing skip connections with residual features. ,in This indicates channel-wise multiplication, bicubic interpolation downsampling to the original resolution, and then fusion with cross-scale features. Proportional synthesis, output The expression is truncated to the range [0, 255], where m is the head index in multi-head attention; t represents the time step or position index; and l represents the Laplacian pyramid scale index. This is the query projection matrix corresponding to the l-th scale; and These two matrices are used to generate the key and value, respectively; It is a matrix The transpose of ; d is the projection dimension of the query vector and the key vector in the attention mechanism; These are the weighting coefficients associated with each head.