A marine forecast digital human video automatic generation method, medium and system
By constructing a parallel scheduling algorithm based on task conflict graphs and audio-video synchronous encoding, the problem of low GPU resource utilization in the generation of digital human videos for marine forecasting was solved, and the video generation latency was reduced and the broadcast was made near real-time.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIHAI FORECASTING CENT OF STATE OCEANIC ADMINISTRATION ((QINGDAO MARINE FORECASTING STATION OF STATE OCEANIC ADMINISTRATION) (QINGDAO MARINE ENVIRONMENT MONITORING CENT OF STATE OCEANIC ADMINISTRATION))
- Filing Date
- 2026-04-20
- Publication Date
- 2026-06-23
Smart Images

Figure CN122069415B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of marine forecasting digital human technology, specifically, it relates to a method, medium and system for automatically generating marine forecasting digital human videos. Background Technology
[0002] The automatic generation technology for digital human videos in marine forecasting integrates multiple technical aspects, including numerical oceanographic model output processing, visualization rendering, text-to-speech synthesis, and digital human-driven processes. In current marine forecasting systems, rendering subtasks such as digital human face rendering, marine background vector field rendering, chart animation rendering, and subtitle synthesis are typically executed on the GPU in a serial or simple parallel manner. The lack of systematic modeling of data dependencies and memory sharing relationships between these subtasks leads to frequent conflicts and waiting during task scheduling, making it difficult to fully utilize GPU resources.
[0003] In current multi-GPU rendering scheduling practices, the common approach is to allocate rendering tasks one by one based on task priority or a fixed order. This approach ignores the complexity of the topological dependencies between tasks and cannot systematically eliminate parallel conflict bubbles in multi-task concurrent scenarios.
[0004] In existing technologies, due to the large number of rendering subtasks and their complex dependencies, serial or simple parallel scheduling methods cannot accurately identify and synchronously allocate conflict-free task sets to different GPUs for execution. This results in high overall video generation latency, making it difficult to meet the release window constraints for near real-time marine early warning broadcasts. In other words, existing technologies suffer from low GPU resource utilization and excessively long video generation latency due to data dependencies and memory contention in the multimodal rendering tasks during the generation of marine forecast digital human videos. Summary of the Invention
[0005] In view of this, the present invention provides a method, medium and system for automatically generating digital human videos for marine forecasting, which can solve the technical problems in the prior art of low GPU resource utilization and excessive video generation latency caused by data dependence and memory contention in the multimodal rendering task during the generation of digital human videos for marine forecasting.
[0006] The present invention is implemented as follows: The first aspect of the present invention provides a method for automatically generating digital human videos for marine forecasting, comprising the following steps:
[0007] Obtain the forecast field file output by the numerical ocean model, extract the ocean element field, perform a difference operation with the previous forecast field to obtain the difference forecast field, and push the difference forecast field to the message queue.
[0008] The message queue triggers the visualization microservice to perform local re-rendering on the differential forecast field, merges the pre-rendered keyframe template with the local re-rendering result, and generates a dynamic image frame sequence of the ocean background. At the same time, it triggers the text generation microservice to output the forecast text.
[0009] The forecast text is input into the text-to-speech synthesis service to obtain the Mel spectrum sequence and phoneme-aligned timestamp. The Mel spectrum sequence, phoneme-aligned timestamp, digital human face 3D key point sequence, ocean background dynamic image frame and speaker identity are embedded into the vector input acoustic-visual collaborative cross-modal alignment diffusion network to output frame-by-frame lip shape key point sequence.
[0010] The rendering subtasks are divided into conflict-free parallel batches based on a multi-GPU rendering task conflict-free parallel scheduling algorithm based on graph coloring problem mapping. Digital human face rendering, ocean background vector field rendering, chart animation rendering and caption compositing are executed in parallel on multiple GPUs according to batch number, and precise synchronization between batches is achieved through CUDA streaming concurrency mechanism.
[0011] Calculate the lip consistency score, and adjust the number of denoising steps of the acoustic-visual collaborative cross-modal alignment diffusion network based on the lip consistency score by calling the dynamic step size adjustment function. Then, encode each rendering result and the speech in audio-video synchronization, generate a digital human prediction video, and push it to the release queue.
[0012] The system monitors the arrival timestamps of differential forecast fields in the message queue and the completion timestamps of videos in the publishing queue, calculates end-to-end latency, and reduces the rendering resolution of the visualization microservice or the streamline density of the ocean background vector field rendering when the end-to-end latency exceeds the latency threshold, in order to ensure that the forecast video is published within the publishing window.
[0013] The difference forecast field refers to the grid-by-grid difference field between the current forecast field and the previous forecast field. Local re-rendering is triggered only in grid areas where the absolute value of the difference exceeds the change threshold. The change threshold is obtained by performing percentile statistical analysis on the difference distribution of historical forecast fields.
[0014] Specifically, the determination of the change threshold involves performing percentile analysis on the absolute value of the grid point difference of the difference field over several consecutive days, and taking the 85th percentile value as the change threshold.
[0015] The message queue is implemented using a distributed stream processing middleware based on the publish-subscribe pattern. Each microservice acts as an independent subscriber and asynchronously consumes the forecast field data in the message queue, decoupling the serial process into a parallel pipeline.
[0016] The pre-rendered keyframe template refers to a background layer pre-generated offline for the typical spatial distribution of the ocean element field, and in the real-time processing stage, only the local area corresponding to the differential forecast field is incrementally overlaid and rendered.
[0017] The acoustic-visual collaborative cross-modal aligned diffusion network uses a variant of the denoised diffusion probability model as its backbone, and the noise prediction network adopts a dual-stream U-Net architecture. The audio stream branch takes the Mel spectrum sequence as input, and the visual stream branch takes the digital human face 3D key point sequence as input. The two branches are fused at the bottleneck layer of the U-Net through a cross-modal cross-attention mechanism.
[0018] Specifically, the cross-modal cross-attention mechanism involves using audio stream features as the Query matrix, visual stream features as the Key matrix, and the Value matrix to perform forward cross-attention. Simultaneously, visual stream features are used as the Query matrix, audio stream features as the Key matrix, and the Value matrix to perform reverse cross-attention. The two attention outputs are then added together and sent to the decoder.
[0019] In the acoustic-visual collaborative cross-modal alignment diffusion network, a topological regularization layer is introduced to perform non-degenerative constraints on the triangular mesh formed by facial key points. The speaker identity embedding vector is mapped by a fully connected layer and then injected into the normalized parameters of each U-Net layer in an affine transformation manner.
[0020] Specifically, the multi-GPU rendering task conflict-free parallel scheduling algorithm based on graph coloring problem mapping models the rendering subtasks as a set of vertices in a graph. Edges are added between subtasks that have data dependencies or share video memory resources to form a task conflict graph. An improved DSatur heuristic coloring algorithm is executed on the task conflict graph, and sets of vertices with the same color form conflict-free parallel batches.
[0021] Specifically, the DSatur heuristic coloring algorithm prioritizes coloring the vertex with the highest saturation at each step. Saturation is defined as the number of colors that have been used by adjacent vertices for that vertex. When saturations are the same, the vertex degree is used as a secondary sorting criterion to gradually assign each vertex the smallest color number that does not conflict with adjacent vertices.
[0022] Specifically, the dynamic step size adjustment function adjusts the number of denoising steps according to the range of the lip consistency score. After adjustment, it re-infers and updates the lip consistency score, and repeats the process until the lip consistency score reaches the qualified threshold or the number of adjustments reaches the upper limit of the number of adjustments.
[0023] The lip shape consistency score is obtained by normalizing and inverting the average Euclidean distance between the frame-by-frame predicted lip shape keypoint sequence and the reference keypoint sequence. The normalized baseline value is obtained by statistically analyzing the average Euclidean distance of all frame pairs on the validation set.
[0024] Specifically, reducing the rendering resolution involves gradually lowering the rendering resolution of the ocean background dynamic image frame according to the resolution levels, recalculating the prediction latency after each reduction, until the prediction latency is lower than the latency threshold; if the resolution is reduced to the lowest level but still exceeds the latency threshold, the streamline density of the ocean background vector field rendering is simultaneously reduced gradually according to the step size ratio.
[0025] Specifically, the training of the acoustic-visual collaborative cross-modal alignment diffusion network involves using a weighted sum of mean squared error loss, dynamic time warping soft alignment loss, and topological regularization loss as the total loss, and employing a cosine annealing learning rate strategy for multiple rounds of training. The checkpoint with the highest lip consistency score on the validation set is then used as the final model weight.
[0026] The statistical analysis of the change threshold uses data from 30 consecutive days. The base value for the number of noise reduction steps is 50 steps, the adjustment step size is 10 steps, the upper limit of the number of adjustments is 3 times, the latency threshold is 120 seconds, the streamline density reduction step size is 20%, and the resolution levels are 4K, 2K, and 1080P respectively.
[0027] A second aspect of the present invention provides a computer-readable storage medium storing program instructions that, when executed in a computer, are used to perform the above-described method for automatically generating digital human videos for marine forecasting.
[0028] A third aspect of the present invention provides an automatic generation system for digital human videos for marine forecasting, comprising the aforementioned computer-readable storage medium, wherein the system is a computer, the computer-readable storage medium is disposed within the system, and the system is provided with a microprocessor for executing program instructions stored in the computer-readable storage medium.
[0029] This invention models rendering subtasks as task conflict graphs and employs an improved DSatur heuristic coloring algorithm based on graph coloring problem mapping to perform conflict-free parallel scheduling of multi-GPU rendering tasks. It divides the set of conflict-free tasks with the same color into parallel batches and combines the CUDA streaming concurrency mechanism to achieve strict parallelism within batches and synchronization between batches. This solves the technical problem of low GPU resource utilization caused by data dependency and memory contention in multimodal rendering tasks.
[0030] Graph coloring methods transform the task scheduling problem into a combinatorial optimization problem. The DSatur heuristic strategy prioritizes coloring the vertices with the highest saturation, ensuring that the task with the tightest conflict constraints receives an independent color group first. This fundamentally avoids the phenomenon of conflict-free tasks being forced to wait sequentially, thus systematically improving the utilization of multiple GPUs.
[0031] In summary, this invention solves the technical problems mentioned in the background art, such as low GPU resource utilization and excessively long video generation latency caused by data dependence and memory contention in the multimodal rendering task during the generation of digital human videos for marine forecasting. Attached Figure Description
[0032] Figure 1 This is a flowchart of the method of the present invention.
[0033] Figure 2 This is a schematic diagram showing the frame-by-frame lip keypoint displacement and phoneme boundary alignment relationship output by the acoustic-visual co-modal alignment diffusion network.
[0034] Figure 3 This diagram illustrates the end-to-end latency distribution and resolution adaptive degradation triggering during a 7-day continuous test.
[0035] Figure 4 This is a screenshot of the digital human prediction video webpage in Example 2. Detailed Implementation
[0036] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below.
[0037] like Figure 1 The diagram shown is a flowchart of a method for automatically generating digital human videos for marine forecasting, provided by the first aspect of this invention. This method includes the following steps:
[0038] S01. Obtain the NetCDF format forecast field file output by the numerical ocean model, extract the sea surface temperature, current field, wave height and other element fields, perform difference operation with the previous forecast field to obtain the difference forecast field, and push the difference forecast field to the message queue.
[0039] S02. The message queue triggers the visualization microservice to perform local re-rendering on the differential forecast field, merges the pre-rendered keyframe template with the local re-rendering result, and generates a dynamic image frame sequence of the ocean background. At the same time, it triggers the text generation microservice to output the forecast text.
[0040] S03. Input the forecast text into the text-to-speech synthesis service to obtain the Mel spectrum sequence and phoneme alignment timestamp. Embed the Mel spectrum sequence, phoneme alignment timestamp, digital human face 3D key point sequence, ocean background dynamic image frame and speaker identity into the vector input of the acoustic-visual collaborative cross-modal alignment diffusion network to output the frame-by-frame lip shape key point sequence.
[0041] S04. The rendering subtasks in the task conflict graph are divided into conflict-free parallel batches according to the multi-GPU rendering task conflict-free parallel scheduling algorithm based on graph coloring problem mapping. Digital human face rendering, ocean background vector field rendering, chart animation rendering and subtitle synthesis are executed in parallel on multiple GPUs according to the batch number. Precise synchronization between batches is achieved through CUDA streaming concurrency mechanism.
[0042] S05. Calculate the lip consistency score, and adjust the number of denoising steps of the acoustic-visual collaborative cross-modal alignment diffusion network by calling the dynamic step size adjustment function according to the score range. Then, encode each rendering result and the speech in audio-video synchronization to generate a digital human prediction video and push it to the release queue.
[0043] S06. Monitor the arrival timestamp of the differential forecast field in the monitoring message queue and the completion timestamp of the video in the publishing queue, calculate the end-to-end latency, and when the end-to-end latency exceeds the threshold, reduce the rendering resolution of the visualization microservice or reduce the streamline density of the ocean background vector field rendering to ensure that the forecast video is published within the publishing window.
[0044] The difference forecast field refers to the grid-by-grid difference field between the current forecast field and the previous forecast field. Local re-rendering is triggered only in grid areas where the absolute value of the difference exceeds a preset change threshold. The preset change threshold is obtained by statistical analysis of the difference distribution of historical forecast fields. Specifically, the absolute value of the grid difference across the entire field of the hourly difference field for 30 consecutive days is analyzed as a percentile, and the 85th percentile value is taken as the threshold, so that about 15% of the significantly changing areas are identified as areas that need to be re-rendered.
[0045] The message queue is implemented using a distributed stream processing middleware based on the publish-subscribe pattern. Each microservice acts as an independent subscriber and asynchronously consumes the forecast field data in the message queue. The processing delay of any microservice does not block the parallel progress of other microservices, thereby decoupling the serial process into a parallel pipeline.
[0046] The pre-rendered keyframe template refers to a background layer that is pre-generated offline for the typical spatial distribution of ocean element fields. During the real-time processing stage, only the local area corresponding to the differential forecast field is incrementally overlaid and rendered to avoid re-rendering the entire field each time, thereby reducing the amount of real-time rendering computation.
[0047] The specific structure of the acoustic-visual collaborative cross-modal aligned diffusion network is as follows: A variant of the denoising diffusion probability model serves as the backbone, and the noise prediction network employs a dual-stream U-Net architecture. The audio stream branch takes the Mel spectrum sequence as input and extracts temporal acoustic features through multi-layer convolution and multi-head self-attention encoders. The visual stream branch takes the 3D keypoint sequence of a digital human face as input and extracts facial geometric temporal features through a graph convolutional network and positional encoding. The outputs of the two encoders are fused at the bottleneck layer in the U-Net through a cross-modal cross-attention mechanism. The audio stream features serve as the query matrix, and the visual stream features serve as the key and value matrices. Simultaneously, reverse cross-attention is performed using the visual stream features as the query matrix and the audio stream features as the key and value matrices. The sum of the two attention outputs is then fed into the subsequent decoder. In each denoising step of the inverse diffusion process, phonemes are... The aligned timestamp embedding is used as a time-step conditional vector, which is concatenated with the denoising step index and input into the noise prediction network to synchronize the denoising iteration with the phoneme boundary. In each denoising step, phoneme constraint correction is performed on the predicted lip shape key point positions. A topology regularization layer is introduced into the network to perform non-degenerate constraints on the triangular mesh formed by facial key points to prevent degradation of the lower geometric topology of the extreme lip shape. The speaker identity embedding vector is fused at the conditional input, mapped by a fully connected layer, and injected into the normalized parameters of each U-Net layer in an affine transformation manner to achieve unified driving of different digital human images. The dynamic image frames of the ocean background are processed by a lightweight convolutional encoder to extract background context features, which are additively integrated into the U-Net decoder to ensure that the output lip shape key points maintain the consistency of lighting and style with the background scene. The network output is the frame-by-frame facial 3D key point displacement, which is superimposed on the basic pose of the digital human to obtain the frame-by-frame lip shape key point sequence.
[0048] The steps for establishing the training dataset of the acoustic-visual collaborative cross-modal alignment diffusion network specifically include: collecting real broadcast videos in marine forecasting scenarios, extracting corresponding Mel spectrum sequences, phoneme alignment timestamps, and facial 3D key point sequences; using manual annotation tools to finely correct phoneme boundaries; generating pseudo-labels for the facial 3D key point sequences using a 3D face reconstruction algorithm; collecting diverse identity samples from no fewer than 50 broadcasters and extracting speaker identity embedding vectors; performing domain text-to-speech synthesis data augmentation on the marine forecast text to ensure coverage of marine professional terminology pronunciation; and finally constructing a training sample set containing audio, key points, timestamps, background frames, and identity vector quintuples.
[0049] The specific steps for training the acoustic-visual collaborative cross-modal alignment diffusion network include: calculating the difference between the predicted keypoint displacement and the actual keypoint displacement using mean squared error loss; measuring the alignment degree between the predicted lip shape sequence and the phoneme timestamp using dynamic time warping soft alignment loss, with the weights of the dynamic time warping soft alignment loss determined by grid search on the validation set; constraining the non-degeneracy of triangulation using topological regularization loss; weighted summing of the three losses to obtain the total loss; and using a cosine annealing learning rate strategy for multiple rounds of training, evaluating the lip shape consistency score on the validation set in each round, and taking the checkpoint with the highest lip shape consistency score as the final model weight.
[0050] The acoustic-visual collaborative cross-modal alignment diffusion network brings the following technical effects to the solution: the cross-modal cross-attention mechanism enables fine-grained alignment between audio temporal rhythms and facial geometric movements in the feature space; the phoneme timestamp constraint corrects lip shape key points in each step of the back-diffusion, making the generated lip shape boundary highly consistent with the phoneme switching moment; the topology regularization layer ensures that the lower mesh of the extreme lip shape does not degenerate, and the output geometry is geometrically continuous and stable; the speaker identity embedding injection enables a single model to uniformly drive multiple digital human images, avoiding separate training for each image; the background context fusion ensures that the lip shape driven results are consistent with the ocean background in terms of lighting and style, improving the naturalness and professional credibility of the broadcast video as a whole.
[0051] The specific description of the dynamic step size adjustment function is as follows: the lip shape consistency score is denoted as... , The denoising reference is obtained by normalizing and inverting the average Euclidean distance between the frame-by-frame predicted lip shape keypoint sequence and the reference keypoint sequence. The normalized baseline value is obtained by calculating the average Euclidean distance of all frame pairs on the validation set. The number of denoising steps is denoted as... , The baseline value is set to 50, and is determined through experiments on the validation set using lip consistency scores as an indicator, iterating through integer step sizes ranging from 20 to 100 steps; when hour, Add 10 steps; when hour, Keep the base value unchanged; when hour, Reduce 10 steps; when hour, Reduce by 20 steps; re-infer and update after each adjustment. The value is executed in a loop until... Alternatively, the number of adjustments may be increased to the maximum of 3 times, which is determined by the 95th percentile of the maximum number of rounds required for convergence on the test set.
[0052] The principle and specific implementation of the multi-GPU rendering task conflict-free parallel scheduling algorithm based on graph coloring problem mapping are as follows: Rendering subtasks such as digital human face rendering, ocean background vector field rendering, chart animation rendering, subtitle synthesis, and audio / video synchronization encoding are modeled as a set of vertices in a graph. If there is a data dependency or shared GPU memory resources between two subtasks, an edge is added between them to form a task conflict graph. An improved DSatur heuristic coloring algorithm is executed on the task conflict graph. In each step, the DSatur algorithm prioritizes coloring the vertex with the highest saturation. Saturation is defined as the number of colors occupied by adjacent vertices. When saturation is the same, the vertex degree is used as a secondary sorting criterion, gradually assigning each vertex the smallest color number that does not conflict with adjacent vertices. After coloring is completed... A set of vertices of the same color constitutes a conflict-free parallel batch. Tasks within a batch have no data dependencies and do not share GPU memory, allowing them to be safely allocated to different GPUs for parallel execution. The batch execution order is determined by color number from smallest to largest. Combined with the CUDA streaming concurrency mechanism, strict parallelism of tasks within a batch is achieved. Synchronization barriers are inserted between batches to ensure the correctness of data dependencies. The technical effects of this algorithm are as follows: by transforming the task scheduling problem into a combinatorial optimization problem through graph coloring, the DSatur heuristic strategy obtains a near-optimal coloring scheme in polynomial time, systematically eliminating parallel conflict waiting bubbles in the rendering pipeline, maximizing the utilization of multiple GPUs, and significantly improving the overall throughput efficiency of video generation, especially when there are many rendering subtasks with complex dependencies.
[0053] The end-to-end latency threshold is set to 120 seconds. This threshold is determined based on the requirements of the near real-time marine early warning broadcasting service. It is obtained by statistically analyzing the measured latency of each link in the historical forecast release process, taking the 95th percentile of the sum of the latency of each link, and leaving a 10% margin.
[0054] The specific adjustment method for reducing rendering resolution or streamlining density is as follows: when the end-to-end latency exceeds the threshold, the rendering resolution of the ocean background dynamic image frame is reduced by one level, with the rendering resolution levels being 4K, 2K, and 1080P respectively. After each reduction, the prediction latency is recalculated until the prediction latency is lower than the threshold. If the resolution has been reduced to 1080P and still exceeds the threshold, the streamlining density of the ocean background vector field rendering is gradually reduced in steps of 20%. The lower limit of the streamlining density is determined by expert review as the minimum density that satisfies the readability of the forecast information. The threshold for each level and the step size of the streamlining density are determined iteratively by conducting multiple rounds of experimental measurements on the rendering latency under different configurations in a simulation test environment, with the goal of satisfying the end-to-end latency constraint.
[0055] The specific implementation of step S01 is as follows. Technicians first read the NetCDF format forecast field file from the numerical oceanographic model system, extracting oceanographic element field data such as sea surface temperature, current field, and wave height field. These element fields are stored in a regular latitude and longitude grid. Then, a grid-by-grid difference operation is performed between the current forecast field and the previous forecast field to obtain the difference forecast field. Only grid regions in the difference forecast field whose absolute difference value exceeds a change threshold are marked as local areas requiring re-rendering. The change threshold is determined by performing percentile statistical analysis on the absolute values of the grid differences across the entire field over 30 consecutive hours, taking the 85th percentile value. Typical reference values are approximately 0.5℃ (sea surface temperature) or 0.1m (wave height). After marking, the difference forecast field data is serialized and pushed to a message queue for asynchronous consumption by downstream microservices.
[0056] The specific implementation of step S02 is as follows. The message queue is implemented using a distributed stream processing middleware based on the publish-subscribe pattern. Each microservice acts as an independent subscriber, asynchronously consuming messages. The visualization microservice and the text generation microservice start in parallel without blocking each other. After receiving the differential forecast field, the visualization microservice only performs re-rendering on local grid areas where the difference exceeds the change threshold, avoiding full rendering of the entire field each time. The pre-rendered keyframe template is a background layer pre-generated offline for the typical spatial distribution of the ocean element field. During the real-time processing stage, the local re-rendering results are merged into the pre-rendered keyframe template in an incremental overlay manner to generate a complete sequence of dynamic ocean background image frames. Based on the element change information in the differential forecast field, the text generation microservice calls a large language model or rule template to generate structured forecast text. The forecast text covers descriptions of changes in elements such as sea surface temperature, current field, and wave height, ensuring semantic consistency with the visualization content.
[0057] The specific implementation of step S03 is as follows. The predicted text is input into a text-to-speech synthesis service. This service outputs a Mel-frequency spectrum sequence and phoneme-aligned timestamps. The phoneme-aligned timestamps record the start and end frame positions of each phoneme, used to subsequently drive the synchronization of lip-sync keypoints and phoneme boundaries. An acoustic-visual collaborative cross-modal alignment diffusion network uses a variant of the denoising diffusion probability model as its backbone, and the noise prediction network adopts a dual-stream U-Net architecture. The audio stream branch takes the Mel-frequency spectrum sequence as input and extracts temporal acoustic features through multi-layer convolution and multi-head self-attention encoders. The visual stream branch takes the 3D keypoint sequence of a digital face as input and extracts temporal geometric features of the face through a graph convolutional network and positional encoding. The outputs of the two branch encoders are fused at the U-Net bottleneck layer through a bidirectional cross-modal cross-attention mechanism. The forward direction uses audio stream features as the query matrix and visual stream features as the key and value matrices; the reverse direction uses visual stream features as the query matrix and audio stream features as the key and value matrices. The two attention outputs are added together and then fed into the decoder. Phoneme-aligned timestamp embeddings are used as time-step conditional vectors, concatenated with denoising step indices, and input into the noise prediction network to synchronize denoising iterations with phoneme boundaries. A topology regularization layer applies non-degeneracy constraints to the facial keypoint triangulation mesh, preventing geometric degradation of the lower face below the extreme lip shape. The speaker identity embedding vector is mapped through a fully connected layer and then injected into the normalized parameters of each U-Net layer using an affine transformation, achieving unified driving for multiple digital human images. Dynamic ocean background image frames are processed by a lightweight convolutional encoder to extract background contextual features, which are additively integrated into the U-Net decoder, ensuring that the output lip shape keypoints are consistent with background lighting and style. The network output is the frame-by-frame facial 3D keypoint displacement, which is superimposed on the digital human's basic pose to obtain a frame-by-frame lip shape keypoint sequence.
[0058] The specific implementation of step S04 is as follows. Technicians model rendering subtasks such as digital human face rendering, ocean background vector field rendering, chart animation rendering, subtitle synthesis, and audio / video synchronization encoding as a set of vertices in a graph. If there is a data dependency or shared GPU memory resources between two subtasks, an edge is added between them to form a task conflict graph. Then, an improved DSatur heuristic shading algorithm is executed on the task conflict graph. This algorithm prioritizes shading the vertex with the highest saturation at each step. Saturation is defined as the number of colors occupied by adjacent vertices. When saturation is the same, the vertex degree is used as a secondary sorting criterion, gradually assigning each vertex the smallest color number that does not conflict with adjacent vertices. After shading, the set of vertices of the same color constitutes a conflict-free parallel batch. Tasks within a batch have no data dependency and do not share memory, allowing them to be safely allocated to different GPUs for parallel execution. The batch execution order is determined by color number from smallest to largest. Strict parallelism of tasks within a batch is achieved using the CUDA streaming concurrency mechanism, and synchronization barriers are inserted between batches to ensure the correctness of data dependencies.
[0059] The specific implementation of step S05 is as follows. The lip-shape consistency score is obtained by normalizing and inverting the average Euclidean distance between the frame-by-frame predicted lip-shape keypoint sequence and the reference keypoint sequence. The normalization baseline value is obtained by statistically analyzing the average Euclidean distance of all frame pairs on the validation set. A higher score indicates better lip-shape generation quality. The dynamic step size adjustment function adjusts the number of denoising steps according to the lip-shape consistency score range: 10 steps are added when the score is below 0.60; the base number of 50 steps remains unchanged when the score is between 0.60 and 0.75; 10 steps are reduced when the score is between 0.75 and 0.90; and 20 steps are reduced when the score is not lower than 0.90. After each adjustment, inference is re-established and the score is updated, repeating this process until the score is not lower than 0.75 or the maximum number of adjustments is reached (3 times). After each rendering result is completed, audio and video are synchronously encoded with the speech to generate a digital human prediction video, which is then pushed to the publishing queue.
[0060] The specific implementation of step S06 is as follows. The system continuously monitors the arrival timestamp of the differential forecast field in the message queue and the completion timestamp of the video in the publishing queue, calculates the difference between the two to obtain the end-to-end latency, and sets the latency threshold to 120 seconds. When the end-to-end latency exceeds the latency threshold, the rendering resolution of the ocean background dynamic image frame is first reduced by one level according to the resolution levels, namely 4K, 2K, and 1080P. After each reduction, the prediction latency is recalculated until the prediction latency is lower than the latency threshold. If the resolution has been reduced to 1080P and still exceeds the latency threshold, the streamline density of the rendered ocean background vector field is simultaneously reduced step by step in 20% increments. The lower limit of the streamline density is determined by expert review to ensure the readability of the forecast information. The threshold for each level and the streamline density step size are determined iteratively after multiple rounds of experimental measurements of the rendering latency under different configurations in a simulation test environment.
[0061] It should be noted that the key technologies of this invention include: a local re-rendering and pre-rendered keyframe template merging mechanism based on differential prediction fields, which avoids redundant computation of full-field re-rendering by triggering incremental overlay rendering only for significantly changing local grid regions, thus greatly improving the real-time response capability of the visualization microservice; a bidirectional cross-modal cross-attention and phoneme timestamp constraint mechanism in the acoustic-visual collaborative cross-modal alignment diffusion network, which enables fine-grained alignment of audio temporal rhythms and facial geometric movements in the feature space, and the topology regularization layer further ensures the geometric continuity of the facial mesh; and a conflict-free parallel scheduling algorithm for multi-GPU rendering tasks based on graph coloring problem mapping, which formalizes task conflict relationships into a graph structure and systematically eliminates parallel conflict bubbles through the DSatur heuristic coloring system, maximizing GPU utilization. The above three key technologies work together to reduce the computational load of single-frame visualization through local re-rendering, leaving more GPU computing power for inference in the cross-modal alignment diffusion network, while the graph coloring scheduling algorithm maximizes the parallel efficiency of the remaining rendering tasks. Together, they ensure that the end-to-end latency meets the publishing window constraint of near real-time broadcasting.
[0062] It should be noted that this invention also solves the following technical problem: In the scenario of digital human video generation for marine forecasting, the high-precision synchronization between the digital lip-shape driving result and the speech phoneme boundary has always been a difficult problem to solve. Traditional methods usually use diffusion inference with a fixed step size, which cannot adaptively adjust the denoising iteration depth according to the lip-shape generation quality of the current frame, resulting in some frames showing misalignment between lip shape and phoneme switching time. This invention uses a dynamic step size adjustment function to adjust the number of denoising steps in real time based on the lip-shape consistency score feedback. When the score is low, the number of denoising steps is increased to improve detail accuracy, and when the score is high, the number of denoising steps is reduced to save computational resources. In addition, phoneme constraint correction is performed in each denoising step in combination with the phoneme alignment timestamp, so that the generation process of lip shape key points is always synchronized with the phoneme boundary, fundamentally solving the technical problem of insufficient lip shape and speech temporal alignment accuracy.
[0063] A second aspect of the present invention provides a computer-readable storage medium storing program instructions that, when executed in a computer, are used to perform the above-described method for automatically generating digital human videos for marine forecasting.
[0064] A third aspect of the present invention provides an automatic generation system for marine forecast digital human videos, comprising the aforementioned computer-readable storage medium. The system is any one of a computer, a server, or a microcontroller. The computer-readable storage medium is disposed within the system, and the system is provided with a microprocessor that executes the program instructions stored in the computer-readable storage medium.
[0065] Specifically, the principle of this invention is as follows: The reason why the technical solution of this invention can solve the above-mentioned technical problems is that the construction of the task conflict graph explicitly transforms the data dependencies and memory sharing relationships between tasks into a graph structure, enabling subsequent scheduling algorithms to formally perceive global conflict constraints. The DSatur heuristic coloring algorithm prioritizes coloring the vertex with the highest saturation at each step. Saturation reflects the number of colors already occupied by the task's neighbors; that is, the task with the greatest conflict pressure is grouped first. This strategy makes the coloring result close to the optimal number of colors in polynomial time, thus minimizing the number of parallel batches and maximizing the number of tasks within each batch. Tasks within the same color batch have no data dependencies and do not share memory, allowing them to be safely mapped to different GPUs for parallel execution, eliminating the serial waiting bubbles caused by the lack of conflict detection in traditional scheduling. The CUDA streaming concurrency mechanism further ensures hardware-level parallelism of tasks within a batch, while the synchronization barrier between batches ensures the correctness of data dependencies. The combined effect of these mechanisms maximizes the utilization of GPU computing resources in the time dimension, systematically reducing video generation latency.
[0066] The following provides a specific embodiment 1 of the present invention, and the specific implementation of each step in this embodiment 1 is described in detail below.
[0067] The specific implementation method of step S01 is as follows.
[0068] Sea surface temperature, current field, wave height, and other element fields are extracted from the NetCDF format file output by the numerical oceanography model to construct the current time-series forecast field matrix. ,in This is the index for the current time period. The forecast field matrix for the previous time period is denoted as... The formula for the grid-by-grid difference operation is as follows:
[0069] ;
[0070] In the formula, For the current time period and the previous time period, the first time period is... Line number The difference at the column point, This is the forecast value for this grid point at the current time. This represents the forecast value for that grid point in the previous time interval. All three have the same dimension, determined by the extracted elements. It is the absolute value of the grid point difference across the entire field for the hourly difference field over a continuous 30-day historical period. Percentile analysis was performed, and the 85th percentile value was used as the threshold for change. , and Dimensions are the same. Only when At that time, the grid point is marked as a region that needs to be locally re-rendered, and the difference forecast field is then pushed to the message queue.
[0071] The specific implementation method of step S02 is as follows.
[0072] The message queue triggers the parallel execution of the visualization microservice and the text generation microservice based on a publish-subscribe pattern. The visualization microservice reads marked local areas from the differential forecast field, performs incremental overlay rendering only on these areas, and merges them with pre-rendered keyframe templates to generate a sequence of dynamic ocean background image frames. The text generation microservice asynchronously outputs the forecast text; both processes proceed in parallel without blocking each other.
[0073] The specific implementation method of step S03 is as follows.
[0074] The forecast text is processed by a text-to-speech synthesis service to obtain a Mel spectrum sequence. Phoneme-aligned timestamp sequence ,in The total number of audio frames. For the Mel spectrum feature dimension, For the first Identifier of a phoneme For the first The starting frame number of each phoneme For the first The end frame number of each phoneme The total number of phonemes. The digital human facial 3D key point sequence is denoted as... ,in The total number of video frames. This represents the total number of facial key points. The speaker identity embedding vector is denoted as... , The dimension of the identity embedding vector. The acoustic-visual collaborative cross-modal aligned diffusion network uses a variant of the denoising diffusion probability model as its backbone, while the noise prediction network employs a two-stream U-shaped network architecture. Audio stream branch pairs. Temporal acoustic features are extracted using a multi-layer convolutional and multi-head self-attention encoder. Visual flow branch pairs Extracting Facial Geometric Temporal Features via Graph Convolutional Networks and Location Encoding ,in This provides a unified feature dimension for the outputs of the two-branch encoder. The two branches perform bidirectional cross-modal cross-attention fusion at the bottleneck layer, with the forward cross-attention... For query matrix, Given a key matrix and a value matrix, reverse cross-attention is used. For query matrix, The key and value matrices are summed from the two attention outputs and then fed into the decoder. Speaker identity embedding vector. After mapping through fully connected layers, parameters of each normalized layer are injected using an affine transformation. The ocean background image frames are then processed by a lightweight convolutional encoder to extract background context features. And it is incorporated into the decoder in an additive manner. In each denoising step... In this process, phoneme-aligned timestamps are embedded as time-step conditional vectors. , The embedding dimension of the time-step conditional vector, empirically set to 128, is concatenated with the denoising step index and input into the noise prediction network. Phoneme constraint correction is then applied to the predicted lip shape keypoints. A topology regularization layer applies non-degenerate constraints to the facial keypoint triangulation mesh. The network output is the frame-by-frame displacement of the 3D facial keypoints. Superimposed on the basic posture Then, a frame-by-frame sequence of lip-sync key points was obtained. The coordinates of the key points representing the basic pose of the digital human in a resting state are extracted offline from the initial frame of the digital human image by a 3D face reconstruction algorithm, and the formula is expressed as follows:
[0075] ;
[0076] In the formula, It is a frame-by-frame sequence of lip-sync key points. Broadcast to and in the time dimension For items of the same shape, perform element-wise addition; all three have the same dimension. .
[0077] During the training phase, mean squared error loss The formula is expressed as follows:
[0078] ;
[0079] In the formula, For the first Frame number Predicted displacement of key points This corresponds to the actual displacement. The square of the Euclidean norm. Dimensions are Dynamic time warping soft alignment loss The formula is expressed as follows:
[0080] ;
[0081] In the formula, For the first Predict a subset of lip keypoints from frames. Number of key points on the lips. For the first Reference lip shape key points corresponding to each phoneme To align the path, For the first The set of all valid alignment paths within a phoneme time interval. Dimensions are Topological regularization loss The formula is expressed as follows:
[0082] ;
[0083] In the formula, A collection of triangular facets for triangulation meshing of facial key points. For the first Frame number The area of each triangular facet. The lower limit threshold for area non-degradation is, empirically, [value]. , Dimensions are Total loss It consists of a weighted sum of three terms, and the formula is expressed as follows:
[0084] ;
[0085] In the formula, The normalized baseline value for the mean squared error loss is taken as the initial round of the validation set. The mean, with dimensions of The normalized baseline value for the dynamic time warping soft alignment loss is taken from the initial rounds of the validation set. The mean, with dimensions of The normalized baseline value for the topological regularization loss is taken from the initial rounds of the validation set. The mean, with dimensions of The three terms, when divided separately, are all dimensionless. Dimensionless total loss , , Here are the weighting coefficients for each loss, where Determined through a grid search on the validation set, the empirical value is... , A cosine annealing learning rate strategy is used for multiple rounds of training, and lip consistency scores are evaluated on the validation set in each round. ,Pick The highest checkpoint is used as the final model weight.
[0086] The specific implementation method of step S04 is as follows.
[0087] The rendering subtasks, such as digital human face rendering, ocean background vector field rendering, chart animation rendering, and subtitle compositing, are modeled as graphs. vertex set , Let be the set of edges in the graph. If two subtasks have data dependencies or share memory resources, then an edge is added between them to form a task conflict graph. A saturation-first coloring heuristic algorithm is then applied to the task conflict graph, and the vertices... saturation with degrees The definition is as follows:
[0088] ;
[0089] ;
[0090] In the formula, As vertex In a conflict graph, the set of neighboring vertices As vertex Assigned color numbers, The cardinality of the set, and All are dimensionless integers. Each step selects... The largest uncolored vertex, if they are the same, is selected. The largest one is assigned the smallest color number that satisfies the following conditions. :
[0091] ;
[0092] In the formula, It is a set of positive integers. As vertex The assigned minimum conflict-free color number is used repeatedly until all vertices are colored. After coloring, the set of vertices with the same color forms a conflict-free parallel batch. The batch execution order is determined by the color number from smallest to largest. Combined with the unified computing device architecture's streaming concurrency mechanism, strict parallelism is achieved within the batch, and synchronization barriers are inserted between batches to ensure the correctness of data dependencies.
[0093] The specific implementation method of step S05 is as follows.
[0094] Lip conformity score The formula is obtained by normalizing and inverting the average Euclidean distance between the frame-by-frame predicted lip shape keypoint sequence and the reference keypoint sequence, as follows:
[0095] ;
[0096] In the formula, For the first Frame prediction of lip shape key points, For the first The reference key point corresponding to the frame, The Euclidean norm is given, and the dimensionless average Euclidean distance in the molecule is given. , The normalized reference value has the following dimensions: It is obtained by calculating the average Euclidean distance of all frame pairs on the validation set, and then dividing the two. This is a dimensionless score, ranging from 0 to 1. The dynamic step size adjustment function is based on... Denoising steps Make adjustments. The base value defaults to 50, and is determined by applying it to the validation set. The integer step size for index traversal is determined experimentally within the range of 20 to 100 steps. hour, Add 10 steps; when hour, Keep the base value unchanged; when hour, Reduce 10 steps; when hour, Reduce steps by 20. Re-infer and update after each adjustment. The value is executed in a loop until... Alternatively, adjustments can be made up to a maximum of three times, determined by statistically analyzing the 95th percentile of the maximum number of rounds required for convergence on the test set. After each rendering result and the audio are synchronized and encoded, a digital human preview video is generated and pushed to the release queue.
[0097] The specific implementation method of step S06 is as follows.
[0098] Monitoring the arrival timestamps of differential forecast fields in the message queue Completion timestamp of videos in the publishing queue End-to-end delay The calculation formula is expressed as follows:
[0099] ;
[0100] In the formula, The timestamp of the difference forecast field arriving in the message queue. The timestamp for when the video is completed and pushed to the publishing queue. , , All dimensions are End-to-end delay threshold Set to 120 .when At this time, the rendering resolution of the ocean background dynamic image frame is reduced by one level, with the resolution levels being 4K, 2K, and 1080P respectively. After each reduction, the prediction latency is recalculated until the prediction latency is lower than 1000p. If the resolution has been reduced to 1080P and still exceeds the threshold, the streamline density will be gradually reduced in 20% increments. The lower limit of streamline density is determined by expert review as the minimum density to meet the readability of forecast information. The thresholds for each level and the streamline density increments are determined iteratively after multiple rounds of experimental measurements on rendering latency under different configurations in a simulation test environment.
[0101] To better understand and implement this invention, the following is a specific application scenario of the invention, Example 2: To verify the effectiveness of the invention, technicians built a test environment and, by connecting to a nearshore marine numerical oceanographic model system, used hourly forecast data for 7 consecutive days as input to fully run the marine forecast digital human video automatic generation method described in this invention, and measured and recorded key indicators such as end-to-end latency, lip-sync score, and GPU utilization.
[0102] In step S01, technicians read the NetCDF format forecast field file from the numerical oceanographic model system, extract three types of element fields: sea surface temperature, current field, and wave height, and perform percentile statistical analysis on the absolute values of the full-field grid differences for the hourly difference fields over 30 consecutive days. The statistical results are shown in Table 1.
[0103] Table 1. Statistical table of the 85th percentile values of the field differences of various factors.
[0104]
[0105] Based on the statistical results in Table 1, the change thresholds for each element field were set to the corresponding 85th percentile values. In the subsequent difference forecast field, only grid regions with absolute difference values exceeding the above thresholds were marked as regions requiring local re-rendering, and approximately 15% of the grid points were identified as regions with significant changes.
[0106] In step S02, the visualization microservice and the text generation microservice are started in parallel through a distributed stream processing middleware. The visualization microservice performs incremental overlay rendering only on the marked areas, merging the local re-rendering results into a pre-generated offline pre-rendered keyframe template to generate a complete sequence of dynamic ocean background image frames, with the initial rendering resolution set to 4K. The text generation microservice outputs structured forecast text based on the variation information of differential forecast field elements, including location descriptions of sea surface temperature anomalies and wave height variation trends.
[0107] In step S03, the forecast text is output by a text-to-speech synthesis service, which generates a Mel spectrum sequence and phoneme-aligned timestamps. The acoustic-visual collaborative cross-modal alignment diffusion network uses a variant of the denoised diffusion probability model as its backbone, employs a two-stream U-Net architecture, and injects normalized parameters using speaker identity embedding vectors to uniformly drive 50 announcer identity samples. The network takes Mel spectrum sequences, phoneme-aligned timestamps, digital facial 3D keypoint sequences, dynamic ocean background image frames, and speaker identity embedding vectors as inputs. It fuses audiovisual features through a bidirectional cross-modal cross-attention mechanism, outputting a frame-by-frame lip-sync keypoint sequence. For example... Figure 2 As shown, the trend of lip keypoint displacement at different phoneme boundaries in the network is highly consistent with the phoneme switching time. The topology regularization layer effectively prevents the geometric degradation of the lower mesh under the polar port type.
[0108] In step S04, the technicians modeled five rendering sub-tasks—digital human face rendering (vertex A), ocean background vector field rendering (vertex B), chart animation rendering (vertex C), subtitle compositing (vertex D), and audio / video synchronization encoding (vertex E)—as a set of vertices in a task conflict graph. Analysis revealed a memory sharing relationship between vertices A and D, a data dependency relationship between vertices C and E, and a data dependency relationship between vertices B and E. Edges were added accordingly to form the task conflict graph. The DSatur heuristic coloring algorithm was then applied to the task conflict graph, and the coloring results are shown in Table 2.
[0109] Table 2. Rendering Task Conflict Map Shading Results
[0110]
[0111] Based on the shading results in Table 2, batch 1 contains vertices A, B, and C. These three vertices have no data dependencies and do not share video memory, so they can be safely allocated to different GPUs for parallel execution. Batch 2 contains vertex D and executes after batch 1 is completed. Batch 3 contains vertex E and executes after batch 2 is completed. Combined with the CUDA streaming concurrency mechanism, the three subtasks within batch 1 are strictly parallelized on three different GPUs, and synchronization barriers between batches ensure the correctness of data dependencies.
[0112] In step S05, the lip-sync score of the current frame sequence is calculated. The initial inference score is 0.68, falling within the range of 0.60 to 0.75. The number of denoising steps remains unchanged at the base value of 50 steps. After audio and video synchronization encoding, the digital human prediction video is pushed to the release queue.
[0113] In step S06, technicians monitor the arrival timestamps of the differential forecast fields in the message queue and the video completion timestamps in the publishing queue to calculate end-to-end latency. For example... Figure 3As shown, under a rendering resolution of 4K, the end-to-end latency exceeded the 120-second latency threshold for some periods during the 7-day continuous test. The system automatically triggered a resolution degradation mechanism, reducing the rendering resolution from 4K to 2K. After recalculating the prediction latency, it was restored to within the threshold, and all videos in the publishing queue were pushed within the publishing window. The digital human prediction video generated in this embodiment is shown below. Figure 4 .
[0114] It should be noted that, compared with traditional serial or simple parallel rendering scheduling methods, the graph shading scheduling algorithm of this invention formalizes task conflict relationships into a graph structure, enabling the accurate identification and synchronous allocation of conflict-free task sets to different GPUs for parallel execution. This fundamentally eliminates the serial waiting bubbles caused by missing conflict detection, allowing for more efficient utilization of GPU computing resources in the time dimension. Compared with fixed-step diffusion inference, the dynamic step-size adjustment function adjusts the denoising steps in real time based on lip consistency scores, ensuring that the lip shape keypoint generation process maintains an adaptive balance between quality and efficiency, avoiding the problem of low-quality frames accumulating synchronization errors due to insufficient inference depth. Compared with full-field re-rendering, the local re-rendering mechanism based on differential forecast fields only performs incremental calculations on significantly changing regions, adaptively reducing the computational load of the visualization microservice with the degree of change in the forecast field, providing a fundamental guarantee for overall end-to-end latency control.
[0115] It should be noted that the variables involved in this invention are explained in detail in Tables 3 and 4.
[0116] Table 3. Variable Explanation Table (Part 1)
[0117]
[0118] Table 4. Variable Explanation Table (Part Two)
[0119]
[0120] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention.
Claims
1. A method for automatically generating digital human videos for marine forecasting, characterized in that, Includes the following steps: Obtain the forecast field file output by the numerical ocean model, extract the ocean element field, perform a difference operation with the previous forecast field to obtain the difference forecast field, and push the difference forecast field to the message queue. The message queue triggers the visualization microservice to perform local re-rendering on the differential forecast field, merges the pre-rendered keyframe template with the local re-rendering result, and generates a dynamic image frame sequence of the ocean background. At the same time, it triggers the text generation microservice to output the forecast text. The forecast text is input into the text-to-speech synthesis service to obtain the Mel spectrum sequence and phoneme-aligned timestamp. The Mel spectrum sequence, phoneme-aligned timestamp, digital human face 3D key point sequence, ocean background dynamic image frame and speaker identity are embedded into the vector input acoustic-visual collaborative cross-modal alignment diffusion network to output frame-by-frame lip shape key point sequence. The rendering subtasks are divided into conflict-free parallel batches based on a multi-GPU rendering task conflict-free parallel scheduling algorithm based on graph coloring problem mapping. Digital human face rendering, ocean background vector field rendering, chart animation rendering and subtitle synthesis are executed in parallel on multiple GPUs according to batch number. Precise synchronization between batches is achieved through CUDA streaming concurrency mechanism. The multi-GPU rendering task conflict-free parallel scheduling algorithm executes the DSatur heuristic coloring algorithm on the task conflict graph. The set of vertices with the same color constitutes a conflict-free parallel batch. Calculate the lip consistency score, and adjust the number of denoising steps of the acoustic-visual collaborative cross-modal alignment diffusion network based on the lip consistency score by calling the dynamic step size adjustment function. Then, encode each rendering result and the speech in audio-video synchronization, generate a digital human prediction video, and push it to the release queue. The system monitors the arrival timestamps of differential forecast fields in the message queue and the completion timestamps of videos in the publishing queue, calculates end-to-end latency, and reduces the rendering resolution of the visualization microservice or the streamline density of the ocean background vector field rendering when the end-to-end latency exceeds the latency threshold, in order to ensure that the forecast video is published within the publishing window. The acoustic-visual collaborative cross-modal aligned diffusion network uses a variant of the denoised diffusion probability model as its backbone, and the noise prediction network adopts a dual-stream U-Net architecture. The audio stream branch takes the Mel spectrum sequence as input, and the visual stream branch takes the digital human face 3D key point sequence as input. The two branches are fused at the U-Net bottleneck layer through a cross-modal cross-attention mechanism. Specifically, the cross-modal cross-attention mechanism fusion involves using audio stream features as the Query matrix, visual stream features as the Key matrix, and the Value matrix to perform forward cross-attention. Simultaneously, visual stream features are used as the Query matrix, audio stream features as the Key matrix, and the Value matrix to perform reverse cross-attention. The two attention outputs are then added together and sent to the decoder. Among them, the speaker identity embedding vector is mapped by a fully connected layer and then injected into the normalization parameters of each U-Net layer in the form of an affine transformation. Among them, the phoneme-aligned timestamp is embedded as a time step conditional vector, which is concatenated with the denoising step index and then input into the noise prediction network to synchronize the denoising iteration with the phoneme boundary. Among them, the dynamic image frames of the ocean background are processed by a lightweight convolutional encoder to extract background context features, which are then incorporated into the U-Net decoder in an additive manner.
2. The method for automatically generating digital human videos for marine forecasting according to claim 1, characterized in that, The difference forecast field refers to the grid-by-grid difference field between the current forecast field and the previous forecast field. Local re-rendering is triggered only in grid areas where the absolute value of the difference exceeds the change threshold. The change threshold is obtained by performing percentile statistical analysis on the difference distribution of historical forecast fields.
3. The method for automatically generating digital human videos for marine forecasting according to claim 2, characterized in that, The determination of the change threshold is specifically to perform percentile analysis on the absolute value of the full-field grid point difference of the hourly difference field over several consecutive days, and take the 85th percentile value as the change threshold.
4. The method for automatically generating digital human videos for marine forecasting according to claim 3, characterized in that, The message queue is implemented using a distributed stream processing middleware based on the publish-subscribe pattern. Each microservice acts as an independent subscriber, asynchronously consuming the forecast field data in the message queue, thus decoupling the serial process into a parallel pipeline.
5. The method for automatically generating digital human videos for marine forecasting according to claim 4, characterized in that, The pre-rendered keyframe template refers to a background layer pre-generated offline based on the typical spatial distribution of the ocean element field. During the real-time processing stage, only the local area corresponding to the differential forecast field is incrementally overlaid and rendered.
6. The method for automatically generating digital human videos for marine forecasting according to claim 5, characterized in that, The acoustic-visual collaborative cross-modal alignment diffusion network introduces a topological regularization layer to perform non-degenerate constraints on the triangular mesh formed by facial key points.
7. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores program instructions, which, when executed in a computer, are used to perform the automatic generation method for marine forecast digital human videos according to any one of claims 1-6.
8. A system for automatically generating digital human videos for marine forecasting, characterized in that, The system comprises the computer-readable storage medium of claim 7, wherein the system is a computer, the computer-readable storage medium is disposed within the system, and the system is provided with a microprocessor that executes program instructions stored in the computer-readable storage medium.