A Method and System for Audio Spectrum Fluidized Interactive Presentation Based on Shader Computation Power
By employing neurophysical evolution based on shader computing power and a competitive rendering method, the shortcomings of existing technologies in converting audio information into visual representations are addressed. This achieves a high degree of integration between audio and user interaction, as well as efficient rendering, generating vibrant and dynamic visual effects.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHENGDU LIBI TECH CO LTD
- Filing Date
- 2026-04-17
- Publication Date
- 2026-06-30
Smart Images

Figure CN122050418B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image processing technology, and in particular to an audio spectrum fluidized interactive presentation method and system based on shader computing power. Background Technology
[0002] Real-time graphics rendering and multimedia interactive technologies play a crucial role in fields such as virtual reality, gaming, and digital art. In these applications, effectively transforming abstract audio information into concrete, dynamic visual representations and achieving natural, fluid interaction with user behavior is a key challenge in enhancing user experience. Existing graphics processing units (GPUs), with their powerful parallel computing capabilities, provide the hardware foundation for real-time complex physics simulations and high-quality rendering, prompting the emergence of many innovative algorithms, especially in fluid simulation and data visualization. By performing calculations within shaders, processing speed can be significantly improved.
[0003] Existing technologies typically employ rule-based or simplified physical models to visualize the audio spectrum. These methods extract simple features of the audio and map them onto changes in the scale, color, or position of visual elements. Meanwhile, user interaction is often achieved by directly altering particle properties or triggering animations within pre-defined scenes. For example, some physics-based fluid simulation systems use the Navier-Stokes equations or their simplified forms to simulate fluid motion and leverage computational shaders to accelerate calculations, directly inputting audio features as force fields or density sources into the simulation.
[0004] However, existing technical solutions or methods have many technical shortcomings. First, mapping methods based on simple rules often fail to capture the complex dynamics and semantic information of audio, resulting in a monotonous visual presentation lacking depth and artistic appeal. Second, traditional physical simulation methods, even with GPU acceleration, still require a trade-off between computational accuracy and real-time performance, especially when dealing with complex interactions and subtle fluid behaviors, easily sacrificing real-time response or visual detail. Furthermore, existing systems often lack efficient and non-linear coupling mechanisms for integrating multimodal data, resulting in an unnatural and unified impact of audio and interaction on the visuals, making it difficult to create truly "fluid" and vibrant dynamic visual effects. Summary of the Invention
[0005] To address the aforementioned issues, this invention provides an audio spectrum fluidized interactive presentation method and system based on shader computing power. Employing neurophysical evolution and competitive rendering based on shader computing power, it enables nonlinear coupling between multidimensional audio features and user interaction to drive fluid evolution, and performs multi-semantic layer competitive visual mapping, providing real-time, high-quality, and highly integrated audiovisual interactive presentation.
[0006] To achieve the above objectives, this application adopts the following technical solution:
[0007] Firstly, a method for audio spectrum fluidized interactive presentation based on shader computing power is provided, comprising: real-time acquisition of audio streams and user interaction data, wherein the audio streams are used to extract multi-dimensional audio features, and the user interaction data are used to generate an interaction potential energy gradient map; in the parallel computing shader of a graphics processing unit (GPU), the multi-dimensional audio features and the interaction potential energy gradient map are processed through nonlinear coupling operations to generate a total dynamic field of audio interaction that covers the simulation space and is spatiotemporally unified; in the fragment shader of the GPU, the total dynamic field of audio interaction is used as a driving signal and input to a preset neural network model for dynamically modulating fluid physical behavior, and the neural network model is used to... The fluid state texture performs forward inference to complete the neurophysical evolution of the fluid and generate the evolved fluid state, wherein the fluid state texture includes a velocity field and a density gradient field. In the GPU rendering pipeline, the multidimensional audio features are semantically separated to obtain several audio semantic layers, and based on the evolved fluid state, competitive visual attribute mapping is performed on each audio semantic layer to generate the final rendered image. Through the GPU's asynchronous computing engine, the processes of generating the total dynamic field of audio interaction, performing the neurophysical evolution, and generating the final rendered image are organized into several parallel and asynchronously executed processing pipelines, and scheduled and output.
[0008] Based on the above technical solution, the audio spectrum fluidized interactive presentation method based on shader computing power provided in this application adopts neurophysical evolution and competitive rendering based on shader computing power, which can realize the nonlinear coupling between audio multidimensional features and user interaction to drive fluid evolution, and perform multi-semantic layer competitive visual mapping to provide real-time, high-quality, and highly integrated audiovisual interactive presentation.
[0009] In conjunction with the first aspect above, in one possible implementation, generating a unified audio interaction dynamic field covering the simulation space and time-space through nonlinear coupling operations includes: mapping the multidimensional audio features to a frequency-energy space and mapping the interaction potential energy gradient map to a spatial coordinate-force space; performing tensor product fusion operations on the data in the frequency-energy space and the data in the spatial coordinate-force space using a nonlinear function for fusing multidimensional data to generate an initial coupling field; identifying the field vortex core parameters determined by the current audio features and the interaction focus in the initial coupling field; and performing structured enhancement on the initial coupling field based on the field vortex core parameters to generate the audio interaction dynamic field.
[0010] In conjunction with the first aspect above, in one possible implementation, the nonlinear function for fusing multidimensional data includes: receiving the instantaneous timbre texture vector and the potential gradient value in the interactive potential gradient map from the multidimensional audio features; calculating the attention weights of the instantaneous timbre texture vector at different spatial locations through an attention mechanism network; and using the attention weights to perform weighted fusion and nonlinear transformation on the potential gradient value to form the basic data of the initial coupling field.
[0011] In conjunction with the first aspect mentioned above, in one possible implementation, the neural network model is used to perform forward reasoning on the fluid state texture to complete the neurophysical evolution of the fluid and generate the evolved fluid state. This includes: the neural network model is a preset physical perception neural operator network, whose internal structure includes cascaded feature encoding layers, spatiotemporal convolutional evolution layers, and physical consistency constraint layers. The spatiotemporal convolutional evolution layers are used to extract the perturbation features of the total dynamic field of the audio interaction in the spatial dimension and perform temporal state transition calculations in conjunction with the fluid state texture of the previous frame; the fluid state texture of the current frame is read, and the total dynamic field of the audio interaction and the fluid state texture are input into the neural network model; within the neural network model, multi-level feature extraction and state transition calculations based on the modulation of the total dynamic field of the audio interaction are performed; the updated velocity field and density gradient field are output and written into the new fluid state texture as the evolved fluid state.
[0012] In conjunction with the first aspect above, in one possible implementation, the completion of the neurophysical evolution of the fluid further includes implementing an entropy stabilization compensation loop based on visual perception importance: analyzing the evolved fluid state and calculating a perception importance map that identifies key visual regions and potentially unstable regions; feeding the perception importance map back to the audio interaction total dynamic field generation process of the next calculation cycle; and when generating the audio interaction total dynamic field of the next frame, directionally correcting the interaction potential energy gradient map based on the perception importance map, wherein the directional correction is manifested as the injection of virtual mass or energy into the regions identified by the map.
[0013] In conjunction with the first aspect above, in one possible implementation, the calculation of the perceptual importance map identifying key visual regions and potentially unstable regions includes: extracting fluid velocity field and density gradient field information from the evolved fluid state; calculating the spatial distance weight between each pixel position and the field vortex core by combining the field vortex core parameters in the total dynamic field of the audio interaction in the current frame; combining the fluid velocity field, density gradient field, and spatial distance weights, calculating the importance score of each pixel position through a function used to evaluate visual perception importance, and generating a perceptual importance map based on the importance score.
[0014] In conjunction with the first aspect above, in one possible implementation, performing competitive visual attribute mapping for each of the audio semantic layers includes: assigning an independent visual representation channel to each of the separated audio semantic layers, with each visual representation channel associated with a set of visual attribute influencing factors; for each pixel on the screen, obtaining the instantaneous energy intensity of each audio semantic layer at the evolved fluid state corresponding to the current pixel; based on the instantaneous energy intensity, competitively allocating rendering resources among the visual representation channels using a nonlinear function used to determine the allocation of rendering resources, and determining the channel that dominates the final visual attribute of the current pixel; and, based on the competition result, mixing the visual contributions of each channel to synthesize the final color and optical attributes of the pixel.
[0015] In conjunction with the first aspect above, in one possible implementation, the competitive allocation of rendering resources among the visual representation channels using a nonlinear function to determine the allocation of rendering resources includes: inputting the instantaneous energy intensity of each audio semantic layer at the current pixel into a normalized exponential function to calculate the competitive weight of each visual representation channel; weighting the visual attribute influence factors associated with each channel according to the competitive weights; and superimposing the weighted visual attribute influence factors of each channel, the superposition result defining the final visual attribute of the pixel.
[0016] In conjunction with the first aspect above, in one possible implementation, the process of generating the total dynamic field of audio interaction, performing the neurophysical evolution, and generating the final rendered image through the asynchronous computing engine of the GPU is organized into several parallel and asynchronously executed processing pipelines, and scheduled and output. This includes: encapsulating the logic for generating the total dynamic field of audio interaction into a first asynchronous computing pipeline; encapsulating the logic for performing the neurophysical evolution into a second graphics rendering pipeline; encapsulating the logic for generating the final rendered image into a third graphics rendering pipeline; setting different execution cycles for the first asynchronous computing pipeline, the second graphics rendering pipeline, and the third graphics rendering pipeline, and using a phase-shifting clock driving strategy for scheduling; configuring a dual or triple buffering mechanism for the fluid state texture, so that the write operation of the second graphics rendering pipeline and the read operation of the third graphics rendering pipeline are performed asynchronously on different buffers.
[0017] Secondly, an audio spectrum fluidized interactive presentation system based on shader computing power is provided, comprising: a feature acquisition module for real-time acquisition of audio streams and user interaction data, wherein the audio stream is used to extract multi-dimensional audio features, and the user interaction data is used to generate an interaction potential energy gradient map; a total dynamic field generation module for generating a unified audio interaction dynamic field covering the simulation space and time-space through nonlinear coupling operations on the audio multi-dimensional features and the interaction potential energy gradient map in the parallel computing shader of the graphics processing unit (GPU); and a neurophysical evolution module for inputting the audio interaction total dynamic field as a driving signal into a preset neural network model for dynamically modulating fluid physical behavior in the fragment shader of the GPU, and utilizing the... A neural network model performs forward reasoning on the fluid state texture to complete the neurophysical evolution of the fluid and generate the evolved fluid state, wherein the fluid state texture includes a velocity field and a density gradient field. A competitive rendering module is used to perform semantic separation on the multidimensional audio features in the GPU rendering pipeline to obtain several audio semantic layers, and based on the evolved fluid state, to perform competitive visual attribute mapping for each audio semantic layer to generate the final rendered image. An asynchronous scheduling engine module is used to organize the processes of generating the total dynamic field of audio interaction, performing the neurophysical evolution, and generating the final rendered image into several parallel and asynchronous processing pipelines through the asynchronous computing engine of the GPU, and to schedule and output them.
[0018] Compared with the prior art, the present invention has the following advantages:
[0019] This invention improves the real-time performance and computational efficiency of audio-driven visual presentation by integrating multiple computational stages, including audio-interactive dynamic field generation, neurophysical evolution, and competitive rendering, onto a GPU, supplemented by an asynchronous scheduling mechanism. This method effectively utilizes the parallel computing capabilities of the GPU, ensuring complex fluid physics simulations and dynamic rendering at high frame rates, providing users with a smooth and highly responsive interactive experience.
[0020] This invention utilizes nonlinear coupling computation to deeply fuse multidimensional audio features with user interaction data, generating a spatiotemporally unified audio interaction dynamic field, which is then input as a driving signal into a neurophysical model. This innovative coupling method enables the dynamic behavior of fluids to respond precisely to subtle changes in audio and real-time user actions, avoiding the disconnect between audio and visual representation found in traditional methods, and enhancing the richness and expressiveness of the visual effects.
[0021] The competitive visual attribute mapping mechanism introduced in this invention allows different audio elements to participate in the fluid presentation with their own independent visual styles, and dynamically determines the final visual attributes of pixels through a competitive mechanism. This multi-semantic-level visual encoding not only gives the final rendered image higher artistic value and information density, but also enhances the user's perception of changes in audio content, achieving a more delicate and multi-layered audiovisual fusion.
[0022] It should be understood that the descriptions of technical features, technical solutions, beneficial effects, or similar language in this application do not imply that all features and advantages can be achieved in any single embodiment. Rather, it is understood that the description of a feature or beneficial effect means that a specific technical feature, technical solution, or beneficial effect is included in at least one embodiment. Therefore, the descriptions of technical features, technical solutions, or beneficial effects in this specification do not necessarily refer to the same embodiment. Furthermore, the technical features, technical solutions, and beneficial effects described in this embodiment can be combined in any suitable manner. Those skilled in the art will understand that embodiments can be implemented without one or more specific technical features, technical solutions, or beneficial effects of a particular embodiment. In other embodiments, additional technical features and beneficial effects may be identified in specific embodiments that do not embody all embodiments. Attached Figure Description
[0023] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0024] Figure 1 This application provides a structural architecture diagram of an audio spectrum fluidized interactive presentation system based on shader computing power, as shown in the embodiments of this application.
[0025] Figure 2 A flowchart illustrating the audio spectrum fluidized interactive presentation method based on shader computing power provided in this application embodiment;
[0026] Figure 3 This is a schematic diagram of the perceived importance map provided in the embodiments of this application.
[0027] Figure 4 This is a competitive weight curve diagram of different audio semantic layers provided in the embodiments of this application. Detailed Implementation
[0028] It should be noted that, in this application, the terms "exemplary" or "for example" are used to indicate that something is being described as an example, illustration, or illustration. Any embodiment or design described as "exemplary" or "for example" in this application should not be construed as being more preferred or advantageous than other embodiments or design solutions. Specifically, the use of terms such as "exemplary" or "for example" is intended to present the relevant concepts in a concrete manner.
[0029] The audio spectrum fluidized interactive presentation method based on shader computing power provided in this application embodiment can be applied to, for example... Figure 1 In the audio spectrum fluidized interactive presentation system 100 based on shader computing power shown, such as Figure 1 As shown, the system includes:
[0030] The feature acquisition module is used to acquire audio streams and user interaction data in real time. The audio streams are used to extract multi-dimensional audio features, and the user interaction data is used to generate an interaction potential gradient map.
[0031] The total dynamic field generation module is used to generate a unified audio interaction dynamic field covering the simulation space and time-space through nonlinear coupling operations on the audio multidimensional features and interaction potential energy gradient map in the parallel computing shader of the graphics processor GPU.
[0032] The neurophysical evolution module is used to input the total dynamic field of the sound interaction as a driving signal into the fragment shader of the GPU to a preset neural network model for dynamically modulating the physical behavior of fluids, and to use the neural network model to perform forward reasoning on the fluid state texture to complete the neurophysical evolution of the fluid and generate the evolved fluid state, wherein the fluid state texture includes a velocity field and a density gradient field.
[0033] The competitive rendering module is used to perform semantic separation on the audio multidimensional features in the GPU rendering pipeline to obtain several audio semantic layers, and to perform competitive visual attribute mapping for each audio semantic layer based on the evolved fluid state to generate the final rendered image.
[0034] The asynchronous scheduling engine module is used to organize the processes of generating the total dynamic field of the audio interaction, performing the neurophysical evolution, and generating the final rendered image into several parallel and asynchronous processing pipelines through the asynchronous computing engine of the GPU, and to schedule and output them.
[0035] like Figure 2 As shown, embodiments of this application provide an audio spectrum fluidized interactive presentation method based on shader computing power, including:
[0036] Real-time acquisition of audio streams and user interaction data, wherein the audio streams are used to extract multi-dimensional audio features and the user interaction data are used to generate an interaction potential gradient map;
[0037] In the parallel computing shader of the graphics processing unit (GPU), the audio multidimensional features and interaction potential gradient map are processed through nonlinear coupling operations to generate a total dynamic field of audio interaction that covers the simulation space and is unified in time and space.
[0038] In the fragment shader of the GPU, the total dynamic field of the sound interaction is used as a driving signal and input to a preset neural network model for dynamically modulating the physical behavior of fluids. The neural network model is then used to perform forward reasoning on the fluid state texture to complete the neurophysical evolution of the fluid and generate the evolved fluid state. The fluid state texture includes a velocity field and a density gradient field.
[0039] In the GPU rendering pipeline, the audio multidimensional features are semantically separated to obtain several audio semantic layers, and based on the evolved fluid state, competitive visual attribute mapping is performed on each audio semantic layer to generate the final rendered image.
[0040] The asynchronous computing engine of the GPU organizes the processes of generating the total dynamic field of the audio interaction, performing the neurophysical evolution, and generating the final rendered image into several parallel and asynchronous processing pipelines, which are then scheduled and output.
[0041] It is worth noting that a highly optimized multiphysics coupling and neurophysical evolution system is constructed using the GPU's parallel computing shaders to achieve a fluid interactive presentation of the audio spectrum. Beyond traditional physical simulation methods, this approach fuses multidimensional audio features and user interaction data through precise nonlinear coupling operations to generate a total dynamic field for audio interaction that drives fluid behavior. This dynamic field then serves as input to a pre-defined neural network model, which performs forward inference in the segment shader to achieve rapid and realistic evolution of the fluid state. By combining semantically separated audio features with the evolved fluid state, a competitive visual attribute mapping mechanism is used to allocate visual resources to different audio semantic layers in the rendering pipeline. The entire process is efficiently scheduled by the GPU's asynchronous computing engine, organizing each stage of processing into parallel and asynchronous processing pipelines to maximize the utilization of hardware computing power.
[0042] In one possible implementation of the embodiments of this application, combined with Figure 2 The process of generating a unified audio-visual dynamic field covering the simulation space and time through nonlinear coupling operations includes:
[0043] The audio multidimensional features are mapped to the frequency-energy space, and the interactive potential energy gradient map is mapped to the spatial coordinate-force space.
[0044] The initial coupling field is generated by performing tensor product fusion operation on the data in the frequency-energy space and the data in the spatial coordinate-force space through a nonlinear function used to fuse multi-dimensional data.
[0045] In the initial coupling field, the core parameters of the field vortex, determined by the current audio features and the interaction focus, are identified;
[0046] Based on the core parameters of the field vortex, the initial coupling field is structurally enhanced to generate the total dynamic field of sound interaction.
[0047] In some implementations, a standardized mapping of the data space is performed. For multidimensional audio features, the audio stream samples from the most recent 0.1 to 0.5 seconds are converted into a frequency domain representation using a Fast Fourier Transform (FFT), generating a one-dimensional vector containing energy values across several frequency ranges. This vector forms the data foundation of the frequency-energy space. For user interaction data, the coordinates and velocity vectors of user input devices, such as mice or touchscreens, in screen space are converted into a two-dimensional texture covering the entire simulation space, i.e., an interaction potential gradient map. Each pixel in this texture stores a two-dimensional vector representing the direction and magnitude of the virtual force experienced at that point due to user interaction, forming the data of the spatial coordinate-force space. Tensor product fusion operations of nonlinear functions are performed to generate the initial coupling field. The global features of the audio are distributed into the local spatial influence of the interaction. This is implemented using a nonlinear transformation function. This function receives frequency-energy space data. Spatial coordinates - force space data As input. At each coordinate point in the simulation space. At the initial coupling field exist The value of a point is calculated using the following formula: ;in, It represents a two-dimensional or three-dimensional coordinate point in the simulation space. The interactive potential gradient map at this point The force vector at the point. It is a transformation matrix, whose matrix elements are composed of multidimensional audio feature vectors. Calculated using a set of pre-defined nonlinear functions, for example, the rotation component of the matrix can be associated with the mid-to-high frequency energy of the audio, while the scaling component can be associated with the low-frequency energy or total loudness of the audio. This tensor product-like operation ensures that the audio features are not simply superimposed on the interaction forces, but rather that the direction and magnitude of the interaction forces are structurally modulated, thus forming a complex initial coupling field. Core parameters of the field vortex are identified in the generated initial coupling field. Key structural information that can dominate macroscopic fluid dynamics, such as visually significant behaviors like rotation and curling, is extracted from the coupling results. This is achieved by calculating the initial coupling field. The curl field, i.e. The process involves identifying local extrema within the curl field. The locations, curl intensities, and radii of influence of these extrema are identified and parameterized into a set of field vortex core parameters, including vortex core coordinates, circulation intensity, and core radius. For example, a sharp, high-frequency percussion sound combined with a rapid user swipe interaction might generate a series of high-intensity, small-radius vortex cores along the interaction trajectory. Based on the extracted field vortex core parameters, the initial coupled field is structurally enhanced to generate the final audio-visual interaction dynamic field. The visual dynamic core, jointly determined by the audio and interaction, is amplified and stabilized, resulting in a more explicit and expressive effect in fluid evolution. Each identified field vortex core parameter is substituted into an analytical vortex model, such as the Rankine vortex model, to generate an idealized vortex force field. Then, all the generated idealized vortex force fields are weighted and superimposed onto the initial coupling field. Finally, the total dynamic field of the audio interaction is obtained. The calculation formula is: ;in, It is the first The analytical vortex force field generated by the vortex core in each field The value of the point, This is the corresponding weighting coefficient, which is typically positively correlated with the energy of the vortex core, i.e., the curl intensity, and the overall instantaneous energy of the audio, controlling the magnitude of the enhancement. Through this series of steps, audio features and user interaction are successfully and dynamically integrated into a spatiotemporally unified dynamic field, providing a rich and controllable driving source for subsequent fluid visual presentation.
[0048] For example, when generating the total dynamic field of audio interaction, the 0.2-second audio stream with a sampling rate of 44.1kHz is first transformed into a one-dimensional vector in the frequency-energy space using FFT. If the current audio has strong energy in the low-frequency range (20-200Hz), the extracted transformation matrix... The scaling component is calculated to be 1.5, while its rotation component is set to 30° based on the mid-to-high frequency energy. At this point, the coordinates in the simulation space are... At that point, if the user quickly swipes, the interaction potential energy gradient vector is generated. If the initial coupling field is (2.0,0), then... Calculated The initial coupling field is approximately (2.6, 1.5). Calculate the curl field of this initial coupling field. The core parameters of the field vortex located at the center were identified, assuming the identified vortex circulation intensity is 10.0 and the influence radius is 50 pixels. These parameters were then substituted into the Rankine vortex model to generate an idealized force field. If at point The analytical vortex force value at point (0.5, 0.5) corresponds to the weighting coefficient. Based on the instantaneous audio energy being set to 0.8, the final generated total dynamic field of audio interaction is... The computation process enables structured modulation and enhancement of interactive forces based on audio features.
[0049] In one possible implementation, combining Figure 2 The nonlinear function used for fusing multi-dimensional data includes:
[0050] Receive the instantaneous timbre texture vector and the potential gradient value in the interactive potential gradient map from the audio multidimensional features;
[0051] The attention weights of the instantaneous timbre texture vector at different spatial locations are calculated using an attention mechanism network.
[0052] The potential gradient values are weighted, fused, and nonlinearly transformed using the attention weights to form the basic data of the initial coupled field.
[0053] In some implementations, two key input data are acquired in parallel. The first is the instantaneous timbre texture vector from the audio multidimensional features. This vector is the result of advanced acoustic feature extraction of the audio signal within a short time window, typically 20 to 50 milliseconds, such as Mel-frequency cepstral coefficients (MFCCs) or spectral centroids, forming a feature vector with dimensions between 12 and 40. The first is the timbre attribute, which can accurately describe the brightness, roughness, and other timbre properties of a sound. The second is the interactive potential gradient map, which is a two-dimensional or three-dimensional texture where each pixel... It stores a potential energy gradient value representing a virtual force, i.e., a two-dimensional or three-dimensional vector. This vector is directly generated from user interaction behaviors, such as the speed and direction of mouse dragging. It represents each point in the simulation space. A scalar attention weight is calculated through a pre-defined attention mechanism network. In engineering, this network is typically a lightweight multilayer perceptron, whose input is an instantaneous timbre texture vector. Spatial coordinates of the current point The role of a network is to learn or pre-define a mapping relationship, determining the degree of influence a sound with specific timbre characteristics should have at different locations in space. For example, a crisp, high-frequency timbre might be assigned a higher attention weight near the user's interaction point, while a dull, low-frequency timbre might have a more even weight across the entire space. Attention Weighting At point The calculation formula can be expressed as: ;in, Function mappings representing attention mechanism networks; For instantaneous timbre texture vector, Spatial location coordinates; Transient timbre texture vector Spatial position coordinates splicing tensors; This is a preset feature mapping matrix used to couple timbre features and spatial location to the same feature space; It is the bias vector; As a non-linear activation function, the sigmoid function is typically used to constrain the output value between 0 and 1, representing the influence weight of timbre at a specific location; the output... Represents timbre exist The strength of influence at each point. The calculated attention weights are used to weight and fuse the potential gradient values, followed by a nonlinear transformation, to generate the fundamental data constituting the initial coupled field. For each point in space... First, the attention weights With the potential energy gradient vector at that point Component-wise multiplication is performed to achieve weighting. Then, to introduce more complex dynamic responses and prevent numerical overflow, the weighted result is passed through a nonlinear transformation function. For example, the hyperbolic tangent function Alternatively, a custom S-curve function can be used for processing. This process generates the basic data for the initial coupled field. At point The value is calculated using the following formula: , where the function The input vector is nonlinearly mapped to adjust its response curve; It is a hyperbolic tangent function used to perform nonlinear mapping of vectors and adjust the response curve; The preset amplitude gain operator is used to control the global strength dimension of the final driving force; This is the dynamic response sensitivity coefficient, used to adjust the steepness of the audio features in response to the interaction force field modulation; This is the interaction potential gradient vector after spatial attention weighting. The final generated... It is a vector field with the same dimension as the interactive potential gradient map. Each vector inside is not only driven by the user's direct interaction, but also by the real-time timbre and texture of the audio, which is spatially refined and non-uniformly modulated, providing high-quality input for the subsequent generation of a structured audio interaction dynamic field.
[0054] For example, consider the point with coordinates (100, 100) in the simulation space. For example, the instantaneous timbre texture vector containing 12-dimensional Mel-frequency cepstral coefficients (MFCCs) of the current frame is obtained in parallel. and the potential energy gradient vector generated by user interaction at that point. The pre-defined attention mechanism network uses a calculation formula. Mapping timbre features to spatial coordinates yields scalar attention weights. This reflects the degree of real-time modulation of the audio timbre at that point; using the nonlinear transformation formula Calculate the initial coupling field fundamental data and set the amplitude gain operator. Dynamic response sensitivity coefficient Substituting the data yields Calculation The final generated vector field It achieves fine-grained spatial modulation of audio timbre on local interactive force fields.
[0055] In one possible implementation, combining Figure 2 The neural network model is used to perform forward reasoning on the fluid state texture, completing the neurophysical evolution of the fluid, and generating the evolved fluid state, including:
[0056] The neural network model is a pre-defined physical perception neural operator network. Its internal structure includes a cascaded feature encoding layer, a spatiotemporal convolutional evolution layer, and a physical consistency constraint layer. The spatiotemporal convolutional evolution layer is used to extract the perturbation features of the total dynamic field of the sound interaction in the spatial dimension and to perform temporal state transition calculations in combination with the fluid state texture of the previous frame.
[0057] Read the fluid state texture of the current frame, and input the total dynamic field of the sound interaction and the fluid state texture into the neural network model;
[0058] Within the neural network model, multi-level feature extraction and state transition calculations based on the modulation of the total dynamic field of the audio interaction are performed;
[0059] The updated velocity field and density gradient field are output and written into a new fluid state texture as the evolved fluid state.
[0060] In some implementations, two core input textures are first read. The first is a fluid state texture representing the current frame's fluid state. This is a high-precision floating-point texture, for example, in RGBA32F format. Its R and G channels store the two-dimensional velocity field vector of each point in the current simulation space, while the B and O channels store the density gradient field information. The second is the generated total dynamic field of the audio interaction, which is a vector field texture of the same dimension. These two input textures are fed into a pre-defined physical perception neural operator network for a complete forward inference. Structurally, this network is designed as three tightly cascaded functional layers, executed entirely within the fragment shader in a pixel-parallel manner. The first layer is a feature encoding layer, which maps the input data with different physical meanings—velocity, density gradient, and external forces—to a unified high-dimensional feature space, facilitating effective feature fusion and evolution in subsequent network layers. This layer typically consists of several... The convolutional operation is implemented by linearly combining and non-linearly activating the input fluid state texture and the total dynamic field of sound interaction along the channel dimension, generating an intermediate feature map containing, for example, 32 or 64 channels. The second layer is the spatiotemporal convolutional evolution layer, which is the core of the network and is responsible for calculating the transition of the fluid state over time. This layer first utilizes spatial convolution kernels, such as... or Depthwise separable convolutions are used to extract local perturbation features of the total dynamic field of sound interaction in a high-dimensional feature space, representing how external forces affect the local motion trend of the fluid. These perturbation features are then fused with the features encoded from the fluid state of the previous frame. This fusion process is similar to the update gate mechanism in a gated recurrent unit (GRU), combining historical states and current input to calculate the new state, thus completing a time-separated update. arrive The state transition calculation is performed. If it is the first frame of the startup, a preset initial fluid state texture is read. The third layer is the physical consistency constraint layer, whose purpose is to decode the evolved high-dimensional features and correct the results to conform to basic fluid physics laws, such as mass conservation and momentum conservation. This layer decodes the high-dimensional feature map output by the spatiotemporal convolution evolution layer back to physical space to obtain the initial velocity field and density gradient field. Subsequently, a correction operation is applied, such as performing Helmholtz decomposition through fast Fourier transform to decompose the velocity field into divergence-free and irrotation-free parts, and discarding divergent components that may be introduced by network errors, thereby forcibly maintaining the incompressibility of the fluid and ensuring the stability and realism of the visual effect. The updated velocity field and density gradient field output after processing by the physical consistency constraint layer are written into the new fluid state texture by the fragment shader. This new texture constitutes the evolved fluid state and will be used as input in the next rendering frame. This process is repeated to realize the continuous dynamic evolution of the fluid. The entire evolution process At time step arrive The update can be expressed by the following formula: ;in, represent The fluid state texture at any given moment. This represents the total dynamic field of sound interaction. (Function) The corresponding specific operations are as follows This represents the feature encoding layer through the weight matrix. Perform linear mapping and nonlinear activation on the concatenated input. This transforms it into a high-dimensional feature space. (Function) The corresponding core state transition calculation is reflected in the superposition term. The specific formula is as follows: This operator utilizes a spatial convolution kernel. Local perturbations of the total dynamic field of sound interaction in a high-dimensional feature space are extracted, and the incremental evolution of the fluid state is calculated by combining historical states. (Function) The corresponding decoding and corrective projection operations are performed by The operator implementation includes the following steps: linear decoding: using the output weight matrix Mapping the evolved high-dimensional features back to the initial velocity field in physical space; divergence correction: by performing the projection operator. ,in To eliminate divergent components in the velocity field and force the fluid to remain incompressible, the pressure potential field is solved using the Poisson equation or Fast Fourier Transform. The final result is... It is the calculated fluid state texture for the next moment, which is written into a new texture by the fragment shader to achieve closed-loop continuous evolution of the fluid state.
[0061] For example, a pixel in the simulation space Read the fluid state texture of the current frame at that point. The R and G channels store two-dimensional velocity field vectors, while the B and O channels store density gradient field information and the total dynamic field of audio interaction. Feature coding layer Through formula The concatenated input vector is mapped to a high-dimensional intermediate feature map, followed by a spatiotemporal convolutional evolution layer. Local perturbations are extracted using spatial convolution kernels, and the incremental evolution term is calculated. Physical consistency constraint layer First, use the output weight moments The initial velocity field is obtained by decoding the high-dimensional features and superimposing the incremental terms.
[0062] Then the projection operator is executed. Perform divergence correction; if the pressure potential gradient at this point is obtained by fast Fourier transform... If the value is (0.02, 0.01), then the velocity field component at the next moment can be calculated as follows: The final updated fluid state The fragment shader writes the fragments into a new texture in parallel.
[0063] In one possible implementation, combining Figure 2 The neurophysical evolution of the completed fluid also includes implementing an entropy stabilization compensation loop based on the importance of visual perception:
[0064] The evolved fluid state was analyzed to calculate the perceptual importance map of the visually critical regions and the potentially unstable regions.
[0065] The process of feeding back the perceived importance map to the generation of the total dynamic field of audio interaction in the next calculation cycle;
[0066] When generating the total dynamic field of audio interaction for the next frame, the interaction potential gradient map is oriented and corrected according to the perception importance map, wherein the oriented correction is manifested as the injection of virtual mass or energy into the map's marked region.
[0067] In some implementations, the analysis process begins immediately after the neurophysical evolution of the current frame is complete, with the newly generated evolved fluid state texture as input. This process is executed in the GPU's computation shader, aiming to generate a single-channel grayscale map, i.e., a perceptual importance map. Each pixel value in the map, typically normalized to between 0 and 1, quantifies the visual importance of its corresponding spatial location. High-value regions represent high-speed fluid movement, fine structures with high-density gradients, or potentially unstable numerical points. This perceptual importance map is not directly used for rendering the current frame but is cached as a key modulation signal and fed back to the generation of the audio interaction dynamic field in the next computation cycle. This constitutes a feedback loop delayed by one frame, i.e., the... The analysis results of the frames will be used to guide the... Frame dynamics generation. At the start of the next calculation cycle, when preparing to generate the total dynamic field of the audio interaction for a new frame, the user-generated real-time interaction potential energy gradient map is read, along with the perception importance map cached from the previous frame. The interaction potential energy gradient map is then directionally corrected based on this map. This correction process involves injecting virtual energy into the high-importance regions identified in the map. Specifically, this is implemented by applying virtual energy to each point in the simulation space. Its corrected interactive potential gradient Calculated using the following formula: ;in, It is the modified potential gradient vector that will be used to generate the total dynamic field of the new frame of audio interaction. It is the original potential energy gradient vector directly generated by user interaction at the current moment. The perceptual importance map calculated from the previous frame is at the point The scalar value represents the intensity of the compensation. It is a globally adjustable scalar compensation coefficient, with a typical value range between 0.01 and 0.5, used to control the gain of the entire feedback loop. It is a directional vector, usually taken from the normalized velocity vector in the fluid state after the evolution of the previous frame. This ensures that the injected energy or momentum is consistent with the current motion trend of the fluid, thereby enhancing rather than disturbing the existing structure. Through this directional correction, it is possible to actively combat entropy increase and "replenish energy" to vortices or details that are about to decay, ensuring the lasting vitality of the visual effect. Figure 3 A heatmap showing the distribution of perception importance in the simulation space, calculated based on fluid velocity, density gradient, and distance to the field vortex core, is presented. High-brightness white areas in the figure mark visual focal points or potentially numerically unstable regions, guiding subsequent entropy stabilization compensation.
[0068] For example, when implementing the entropy stabilization compensation cycle, at the... After the frameflow evolution is completed, a perceptual importance map is generated through a computational shader. If at a certain point in the simulation space The fluid motion is intense and the structure is complex; the scalar value at this point is normalized to 0.80. Upon entering the... When generating the total dynamic field of audio interaction during the frame calculation period, the original potential energy gradient vector generated by the user in real time at that point is read. And obtain the normalized velocity direction vector at that position in the previous frame. Set the global compensation coefficient. It is 0.25, according to the formula. Calculations were performed, and the corrected potential energy gradient was obtained by substituting the numerical values. This targeted correction injects virtual energy consistent with the fluid's motion trend into key visual areas, effectively counteracting the dynamic decay caused by numerical dissipation and ensuring the enduring vitality of the fluid visual effects during long-term interaction.
[0069] In one possible implementation, combining Figure 2 The calculated perceptual importance map identifying key visual regions and potentially unstable regions includes:
[0070] From the evolved fluid state, extract information on the fluid velocity field and density gradient field;
[0071] Based on the field vortex core parameters in the total dynamic field of the audio interaction in the current frame, calculate the spatial distance weight between each pixel position and the field vortex core;
[0072] By combining the fluid motion velocity field, density gradient field, and spatial distance weights, an importance score for each pixel location is calculated using a function used to evaluate visual perception importance, and a perception importance map is generated based on the importance scores.
[0073] In some implementations, core physical information is extracted from the evolved fluid state texture of the current frame. The fluid state texture typically contains velocity fields in the R and G channels. That is, each pixel The two-dimensional velocity vector at the location, and the density gradient fields in channels B and O. That is, each pixel The gradient vector represents the direction and magnitude of density change. This information directly reflects the dynamic characteristics and structural details of the fluid. Combining the field vortex core parameters in the total dynamic field of the generated audio interaction in the current frame, the spatial distance weight between each pixel location and the field vortex core is calculated. For each pixel... It will traverse all identified field vortex cores. Each core is determined by its spatial location. and radius of influence Definition. Calculate pixels. To each vortex core Euclidean distance Then, the distance is converted into distance weights using a decay function, such as an exponential decay function or a smooth step function. For example, when Less than Time-weighted data has a high weighting and decays rapidly with increasing distance. Ultimately, the pixel... Spatial distance weight The aggregation method involves calculating the distance weights of all vortex cores, for example, by taking the maximum value or a weighted average. ;in, This is the attenuation coefficient, typically between 2 and 5. From the core of the vortex Center to pixel The distance. It is a scalar between 0 and 1, representing the pixel. The proximity to the vortex core is crucial, as the vortex core is often the focal point of visual dynamics. Taking into account the extracted physical quantities, a pre-defined visual perception importance function is used. The importance score for each pixel location is calculated, generating the final perceptual importance map. This function is typically a non-linear weighted combination function designed to simulate the sensitivity of the human visual system to motion, structure, and anomalous regions. Each pixel... Importance score The calculation formula is expressed as: ;in, It is a pixel. The magnitude of the velocity vector represents the intensity of the fluid motion. It is a pixel. The magnitude of the density gradient vector represents the sharpness of the fluid density change or the edge of the structure. It is the spatial distance weight calculated earlier. , , These are the normalization coefficients that adjust the contributions of each weight, and their sum is 1. Typical values are as follows: It is 0.4. It is 0.4. A value of 0.2 can be optimized through experimental iterations to ensure that the generated map effectively highlights visually critical regions and potentially unstable areas. The calculated... That is, the perception importance map at the point The value of is a normalized scalar. The higher the brightness, the higher the visual perception importance of the area, or the higher the potential instability, thus requiring more entropy stabilization compensation.
[0074] For example, a pixel in the simulation space Extract the velocity field vector at this point from the evolved fluid state texture. Its mold length The value is set to 15.0, and the density gradient field vector is extracted. Its mold length The value is 8.0. The nearest field vortex core was identified. Its central coordinates With point distance It is 20 pixels, and the radius of influence is... Set the attenuation factor to 60 pixels. It is 3.0, according to the formula. Calculate the spatial distance weight Set the normalization coefficient. , , Importance function perceived through visual perception Calculate the importance score for this point, and substitute the numerical value to obtain... After normalization, the value is used as the perceptual importance map value at that point. This quantification process can accurately identify visual dynamic focal points and potentially unstable regions.
[0075] In one possible implementation, combining Figure 2 Performing competitive visual attribute mapping for each of the audio semantic layers includes:
[0076] Each of the separated audio semantic layers is assigned an independent visual representation channel, and each visual representation channel is associated with a set of visual attribute influencing factors.
[0077] For each pixel on the screen, obtain the instantaneous energy intensity of each audio semantic layer at the evolved fluid state corresponding to the current pixel;
[0078] Based on the instantaneous energy intensity, a competitive allocation of rendering resources is performed among the visual representation channels using a nonlinear function to determine the allocation of rendering resources, thereby determining the channel that dominates the final visual attributes of the current pixel.
[0079] Based on the competition results, the visual contributions of each channel are combined to synthesize the final color and optical properties of the pixel.
[0080] In some implementations, semantic separation of previously extracted multidimensional audio features is performed in real-time within the GPU's rendering pipeline. This is typically achieved through a pre-trained lightweight classification network or rule set, which takes audio features such as spectral bandwidth, instantaneous loudness, and timbre as input and decomposes them into several independent audio semantic layers, such as a "percussion layer," a "melody instrument layer," a "bass rhythm layer," and a "vocal layer." Each semantic layer represents an independent dimension of the audio content. An independent visual representation channel is assigned to each separated audio semantic layer, and each visual representation channel is associated with a set of adjustable visual attribute influencing factors, such as parameters like color, brightness, texture, particle emission density, or distortion intensity. For example, a percussion layer might be associated with highly saturated instantaneous color changes and rapid particle bursts, while a melody instrument layer might be associated with smoother gradients and halo effects. This process is applied to each pixel on the screen. It will accurately obtain the instantaneous energy intensity of each audio semantic layer at that pixel point in the evolved fluid state. The energy intensity here It's not simply about volume intensity, but a comprehensive metric derived by mapping the current acoustic energy of each audio semantic layer to the fluid state of that pixel, such as the magnitude of the velocity field and the local density value of the density field, through a fusion function. For example, when the energy of a semantic layer is high and the fluid velocity at that pixel is also fast, its instantaneous energy intensity will be even higher. A non-linear function is used to determine the allocation of rendering resources, performing a competitive allocation of rendering resources among the visual representation channels to determine the channel that dominates the final visual attributes of the current pixel. This non-linear function is typically implemented using a softmax variant or a weighted exponential function, receiving the instantaneous energy intensity of each audio semantic layer at the current pixel. As input, output the competition weight for each channel. Competitive weight The calculation method is as follows: ;in, It is a positive concentration parameter, typically ranging from 0.5 to 5.0, controlling the "sharpness" of competition; a larger value... The value will make the weight of the dominant channel closer to 1, while the weight of other channels will be closer to 0, creating a more "hard" competition. This is the sum of the exponential energies of all channels, used for normalization. The sum of the competition weights for each pixel... All weights are equal to 1. These weights indicate the degree of dominance of each audio semantic layer in the final visual representation at the current pixel. Based on the competition results, the visual contributions of each channel are combined to synthesize the pixel. final color and optical properties For each visual attribute, such as color component, brightness, and transparency, it is defined as the influence factor of the corresponding visual attribute across all channels. The weighted sum. For example, pixels. final color Channel value It can be represented as: ;in, It is the first Each audio semantic layer channel has a preset red component visual attribute influence factor. Green, blue components, and other optical attributes can be calculated. This weighted blending ensures that the final rendered image not only reflects the physical evolution of the fluid but also achieves deep coupling and real-time dynamic interaction between audio content and visual representation by dynamically adjusting the contribution ratio of each audio semantic layer to the visual effect.
[0081] For example, when performing competitive visual attribute mapping, for a specific pixel on the screen Assuming the "percussion layer" has been separated... and "melody layer" Two semantic layers are used, and the instantaneous energy intensity at that pixel point after combining fluid velocity modulation is obtained. and Set the concentration parameter. Using nonlinear functions Calculate the competition weight of each channel, and substitute the values to obtain...
[0082] Correspondingly If preset The red component influence factor of the channel For high saturation, The red component influence factor of the channel If the color is soft, then the red channel value of the final color of this pixel is determined according to the formula. The calculation yields the following result:
[0083] By dynamically adjusting weights, the visual representation can accurately synthesize a final rendered image that is deeply coupled with the audio content at the pixel level, based on the real-time energy competition of the audio semantics.
[0084] In one possible implementation, combining Figure 2 The competitive allocation of rendering resources among visual representation channels using a non-linear function to determine rendering resource allocation includes:
[0085] The instantaneous energy intensity of each audio semantic layer at the current pixel is input into a normalized exponential function to calculate the competitive weight of each visual representation channel.
[0086] Based on the competition weights, the visual attribute influence factors associated with each channel are weighted.
[0087] The weighted visual attribute influence factors of each channel are superimposed, and the superposition result defines the final visual attribute of the pixel.
[0088] In some implementations, for each pixel on the screen The instantaneous energy intensity of all audio semantic layers at that pixel point. As input. Here It is aimed at the first An audio semantic layer, at the pixel level The instantaneous energy intensities are obtained by combining their acoustic energy with the corresponding fluid state. These instantaneous energy intensities are then input into a normalized exponential function to calculate the competitive weight of each visual representation channel. Based on these competitive weights, the visual attribute influencing factors associated with each channel are weighted. Each visual performance channel is pre-associated with a set of attribute influencing factors that define its visual characteristics, such as color, brightness, transparency, and luminous intensity. For each pixel... and a specific visual attribute to be calculated. Apply this attribute to each channel Influence factor Multiply by the corresponding competition weight For example, to calculate pixels The final red component Then each channel Preset red influence factor Will be multiplied The weighted visual attribute influence factors of each channel are superimposed, and the superimposed result defines the pixel. The final visual attributes. For pixels. any final visual attribute The calculation formula is as follows: ;in, Representing pixels The final visual attribute value. Representing the The attributes associated with each visual representation channel The visual attributes of the image are influenced by factors such as its preset red value. This overlay operation makes the color and optical properties of the final pixel a weighted mixture of all active audio semantic layers, with the weights competitively determined by real-time audio energy, achieving a tight, dynamic, and non-linear coupling between audio content and fluid rendering. Figure 4 This diagram illustrates the rendering competition weight curves assigned by a normalized exponential function at specific pixel locations, reflecting the instantaneous energy changes across different audio semantic layers. The graph visually demonstrates the winner-takes-all characteristic under the influence of the concentration parameter, ensuring a high degree of consistency between visual presentation and dominant audio features.
[0089] For example, when performing competitive rendering resource allocation, for pixels The instantaneous energy intensities of the two audio semantic layers were obtained as follows: and Set the concentration parameter. Using the normalized exponential function Calculate the competition weights and substitute the values to obtain... Correspondingly If the first channel has a preset transparency factor... The second channel's preset transparency influence factor The final transparency attribute of that pixel is determined according to the formula. The calculation yields the following result: The calculation process is through... By adjusting the competitive sharpness parameters, the higher-energy audio semantic layer can dominate the pixel-level visual attribute synthesis with a higher weight ratio, thus achieving non-linear deep dynamic coupling between audio content and rendering effects.
[0090] In one possible implementation, combining Figure 2 The process of generating the total dynamic field of the audio interaction, performing the neurophysical evolution, and generating the final rendered image through the asynchronous computing engine of the GPU is organized into several parallel and asynchronously executed processing pipelines, and scheduled and outputted, including:
[0091] The logic for generating the total dynamic field of the audio interaction is encapsulated as a first asynchronous computation pipeline;
[0092] The logic for performing the neurophysical evolution is encapsulated as a second graphics rendering pipeline;
[0093] The logic for generating the final rendered image is encapsulated as a third graphics rendering pipeline;
[0094] Different execution cycles are set for the first asynchronous computing pipeline, the second graphics rendering pipeline, and the third graphics rendering pipeline, and a staggered clock driving strategy is used for scheduling.
[0095] Configure a dual or triple buffering mechanism for the fluid state texture so that the write operation of the second graphics rendering pipeline and the read operation of the third graphics rendering pipeline are performed asynchronously on different buffers.
[0096] In some implementations, the logic for generating the overall dynamic field of audio interaction is encapsulated as a first asynchronous computation pipeline. This pipeline primarily performs computationally intensive tasks such as data preparation, nonlinear coupling operations, and identification and enhancement of core parameters of the field vortex. Since these operations are mainly pure data processing and do not directly depend on graphics rendering, they are allocated to the GPU's computational cores for asynchronous scheduling and executed in parallel with other pipelines in the instruction queue. This pipeline is typically implemented as a computation shader, receiving multidimensional audio features, user interaction data, and the perceptual importance map of the previous frame as input, and outputting an updated texture of the overall dynamic field of audio interaction. The logic for performing neurophysical evolution is encapsulated as a second graphics rendering pipeline. This pipeline is responsible for performing fluid state updates. This process mainly involves multiple forward inferences and physical consistency constraints on the fluid state texture and the overall dynamic field of audio interaction in the fragment shader to generate the evolved fluid state. Although the core computation is performed in the fragment shader, it still relies on the frame buffer and texture access mechanisms provided by the graphics pipeline, and is therefore classified as a graphics rendering pipeline. Implemented in the form of a full-screen quadrilateral rendering pipeline, the input is the current fluid state texture and the latest audio interaction dynamics field, and the output is the updated fluid state texture. The logic for generating the final rendered image is encapsulated in a third graphics rendering pipeline. This pipeline is responsible for visual attribute mapping and final image compositing during the GPU rendering stage. It reads the evolved fluid state texture and the current audio semantic layer features, calculates the final color and optical attributes for each pixel based on competitive mapping rules, and draws them to the screen frame buffer. This pipeline is also implemented in the form of a full-screen quadrilateral rendering pipeline and is usually the last graphics rendering stage. To achieve efficient concurrent execution, different execution cycles are set for these three pipelines, and a staggered clock driving strategy is used for scheduling. For example, the first asynchronous computation pipeline can be set to run at a relatively low frequency, such as updating the audio interaction dynamics field every 2-4 frames, or triggered when there are significant changes in audio or user interaction. The second graphics rendering pipeline performs fluid physics evolution at a fixed frequency per frame, such as once every 16 milliseconds, corresponding to a fixed frequency of 60Hz frame rate. The third graphics rendering pipeline is synchronized with the second pipeline, executing every frame, but it submits to the instruction queue as early as possible to reduce user-visible rendering latency. This phase-shifting strategy maximizes the use of idle GPU computing units; for example, while the rendering pipeline waits for material loading, the computation pipeline can perform dynamic field updates. To ensure efficient and contention-free handling of data dependencies between pipelines, especially continuous read / write operations on fluid state textures, double or triple buffering mechanisms are configured for fluid state textures. When the first asynchronous computation pipeline does not produce new output, the second graphics rendering pipeline continuously calls the most recently generated audio interaction dynamic field texture cached in the current video memory for evolution calculations. For example, triple buffering is used, meaning it has three copies: texture D, texture B, and texture C.In a given frame, the second graphics rendering pipeline reads the old state from texture D and writes it to texture B to generate the new state; simultaneously, the third graphics rendering pipeline reads the completed fluid state from texture C and renders it. In the next frame, the input and output texture roles between the pipelines switch cyclically; for example, the second pipeline reads from texture B and writes to texture C, while the third pipeline reads from texture D. This mechanism allows the write operations of the second graphics rendering pipeline and the read operations of the third graphics rendering pipeline to be performed asynchronously on different buffers, avoiding memory read / write conflicts and unnecessary synchronization waits, thus effectively hiding the latency of the rendering pipeline.
[0097] For example, during asynchronous scheduling, the GPU asynchronous computing engine organizes tasks into three independent pipelines and performs phase-shift scheduling based on a 60Hz frame rate. The first asynchronous computing pipeline is configured. The calculation is performed every two frames, with each calculation taking approximately 2.5ms, while the second graphics rendering pipeline... and the third graphics rendering pipeline Each frame takes 6ms and 7ms to execute. At the start of a frame, the asynchronous computing engine starts in the computing queue. To generate total dynamic field texture Meanwhile, asynchronous read and write operations are implemented using a triple buffering mechanism in the graphics queue, such as... Read the state from texture buffer D and write it to texture B, while Simultaneously, the rendering process reads the completed state from texture C from the previous frame. At this point, the total GPU processing time is not a simple sum of the three components, but rather, through staggered parallelism, the total processing time is compressed to a minimum. Around 16.6ms, frame pre-order. Through this mechanism, in the first... The frame cycle switches the buffer role, making Write to texture C and By reading texture B, the memory read / write latency between fluid state evolution and image synthesis is effectively hidden.
[0098] It should be noted that all equivalent changes and modifications made in accordance with the teachings of this invention are still within the scope of this invention. Those skilled in the art will readily conceive of other embodiments of this invention upon considering the specification and the disclosure of practical truth. This application is intended to cover any variations, uses, or adaptations of this invention that follow the general principles of this invention and include common knowledge or conventional techniques in the art not described herein.
Claims
1. An audio spectrum fluidized interactive presentation method based on shader computing power, characterized in that, The method includes: Real-time acquisition of audio streams and user interaction data, wherein the audio streams are used to extract multi-dimensional audio features, and the user interaction data are used to generate an interaction potential gradient map, wherein the interaction potential gradient map is a two-dimensional texture covering the entire simulation space, which is generated by converting the coordinates and motion velocity vector of the user input device, such as a mouse or touch screen, in the screen space for the user interaction data. In the parallel computing shader of a graphics processing unit (GPU), the audio multidimensional features and interaction potential energy gradient map are processed through nonlinear coupling operations to generate a unified audio interaction dynamic field covering the simulation space and time-space. This generation of the unified audio interaction dynamic field includes: mapping the audio multidimensional features to a frequency-energy space and mapping the interaction potential energy gradient map to a spatial coordinate-force space; performing tensor product fusion operations on the data in the frequency-energy space and the spatial coordinate-force space using a nonlinear function to fuse the multidimensional data, generating an initial coupling field; identifying the field vortex core parameters determined by the current audio features and interaction focus within the initial coupling field; and performing structured enhancement on the initial coupling field based on the field vortex core parameters to generate the overall audio interaction dynamic field. In the fragment shader of the GPU, the total dynamic field of the sound interaction is used as a driving signal and input to a preset neural network model for dynamically modulating the physical behavior of fluids. The neural network model is then used to perform forward reasoning on the fluid state texture to complete the neurophysical evolution of the fluid and generate the evolved fluid state. The fluid state texture includes a velocity field and a density gradient field. In the GPU rendering pipeline, the audio multidimensional features are semantically separated to obtain several audio semantic layers, and based on the evolved fluid state, competitive visual attribute mapping is performed on each audio semantic layer to generate the final rendered image. The asynchronous computing engine of the GPU organizes the processes of generating the total dynamic field of the audio interaction, performing the neurophysical evolution, and generating the final rendered image into several parallel and asynchronous processing pipelines, which are then scheduled and output.
2. The audio spectrum fluidized interactive presentation method based on shader computing power according to claim 1, characterized in that, The nonlinear function used for fusing multi-dimensional data includes: Receive the instantaneous timbre texture vector and the potential gradient value in the interactive potential gradient map from the audio multidimensional features; The attention weights of the instantaneous timbre texture vector at different spatial locations are calculated using an attention mechanism network. The potential gradient values are weighted, fused, and nonlinearly transformed using the attention weights to form the basic data of the initial coupled field.
3. The audio spectrum fluidized interactive presentation method based on shader computing power according to claim 1, characterized in that, The neural network model is used to perform forward reasoning on the fluid state texture, completing the neurophysical evolution of the fluid and generating the evolved fluid state, including: The neural network model is a pre-defined physical perception neural operator network. Its internal structure includes a cascaded feature encoding layer, a spatiotemporal convolutional evolution layer, and a physical consistency constraint layer. The spatiotemporal convolutional evolution layer is used to extract the perturbation features of the total dynamic field of the sound interaction in the spatial dimension and to perform temporal state transition calculations in combination with the fluid state texture of the previous frame. Read the fluid state texture of the current frame, and input the total dynamic field of the sound interaction and the fluid state texture into the neural network model; Within the neural network model, multi-level feature extraction and state transition calculations based on the modulation of the total dynamic field of the audio interaction are performed; The updated velocity field and density gradient field are output and written into a new fluid state texture as the evolved fluid state.
4. The audio spectrum fluidized interactive presentation method based on shader computing power according to claim 3, characterized in that, The neurophysical evolution of the completed fluid also includes implementing an entropy-stabilizing compensation loop based on the importance of visual perception: The evolved fluid state was analyzed to calculate the perceptual importance map of the visually critical regions and the potentially unstable regions. The process of feeding back the perceived importance map to the generation of the total dynamic field of audio interaction in the next calculation cycle; When generating the total dynamic field of audio interaction for the next frame, the interaction potential gradient map is oriented and corrected according to the perception importance map, wherein the oriented correction is manifested as the injection of virtual mass or energy into the map's marked region.
5. The audio spectrum fluidized interactive presentation method based on shader computing power according to claim 4, characterized in that, The calculated perceptual importance map, identifying key visual regions and potentially unstable regions, includes: From the evolved fluid state, extract information on the fluid velocity field and density gradient field; Based on the field vortex core parameters in the total dynamic field of the audio interaction in the current frame, calculate the spatial distance weight between each pixel position and the field vortex core; By combining the fluid motion velocity field, density gradient field, and spatial distance weights, an importance score for each pixel location is calculated using a function used to evaluate visual perception importance, and a perception importance map is generated based on the importance scores.
6. The audio spectrum fluidized interactive presentation method based on shader computing power according to claim 1, characterized in that, Performing competitive visual attribute mapping for each of the audio semantic layers includes: Each of the separated audio semantic layers is assigned an independent visual representation channel, and each visual representation channel is associated with a set of visual attribute influencing factors. For each pixel on the screen, obtain the instantaneous energy intensity of each audio semantic layer at the evolved fluid state corresponding to the current pixel; Based on the instantaneous energy intensity, a competitive allocation of rendering resources is performed among the visual representation channels using a nonlinear function to determine the allocation of rendering resources, thereby determining the channel that dominates the final visual attributes of the current pixel. Based on the competition results, the visual contributions of each channel are combined to synthesize the final color and optical properties of the pixel.
7. The audio spectrum fluidized interactive presentation method based on shader computing power according to claim 6, characterized in that, The competitive allocation of rendering resources among the visual representation channels using a non-linear function to determine rendering resource allocation includes: The instantaneous energy intensity of each audio semantic layer at the current pixel is input into a normalized exponential function to calculate the competitive weight of each visual representation channel. Based on the competition weights, the visual attribute influence factors associated with each channel are weighted. The weighted visual attribute influence factors of each channel are superimposed, and the superposition result defines the final visual attribute of the pixel.
8. The audio spectrum fluidized interactive presentation method based on shader computing power according to claim 1, characterized in that, The process of generating the total dynamic field of the audio interaction, performing the neurophysical evolution, and generating the final rendered image, through the asynchronous computing engine of the GPU, is organized into several parallel and asynchronously executed processing pipelines, and scheduled and output is performed as follows: The logic for generating the total dynamic field of the audio interaction is encapsulated as a first asynchronous computation pipeline; The logic for performing the neurophysical evolution is encapsulated as a second graphics rendering pipeline; The logic for generating the final rendered image is encapsulated as a third graphics rendering pipeline; Different execution cycles are set for the first asynchronous computing pipeline, the second graphics rendering pipeline, and the third graphics rendering pipeline, and a staggered clock driving strategy is used for scheduling. Configure dual or triple buffering mechanisms for the fluid state texture so that write operations of the second graphics rendering pipeline and read operations of the third graphics rendering pipeline are performed asynchronously on different buffers.
9. An audio spectrum fluidized interactive presentation system based on shader computing power, characterized in that, The system is used in the audio spectrum fluidized interactive presentation method based on shader computing power as described in any one of claims 1-8, the system comprising: The feature acquisition module is used to acquire audio streams and user interaction data in real time. The audio stream is used to extract multi-dimensional audio features, and the user interaction data is used to generate an interaction potential gradient map. The interaction potential gradient map is a two-dimensional texture covering the entire simulation space, which is generated by converting the coordinates and motion velocity vector of the user input device (mouse or touch screen) in the screen space into the user interaction data. The total dynamic field generation module is used in the parallel computing shader of the graphics processing unit (GPU) to generate a unified audio interaction dynamic field covering the simulation space and spatiotemporal space through nonlinear coupling operations on the multidimensional audio features and the interaction potential energy gradient map. The generation of the unified audio interaction dynamic field through nonlinear coupling operations includes: mapping the multidimensional audio features to a frequency-energy space and mapping the interaction potential energy gradient map to a spatial coordinate-force space; performing tensor product fusion operations on the data in the frequency-energy space and the spatial coordinate-force space using a nonlinear function for fusing multidimensional data to generate an initial coupling field; identifying the field vortex core parameters determined by the current audio features and interaction focus in the initial coupling field; and performing structured enhancement on the initial coupling field based on the field vortex core parameters to generate the total dynamic field of audio interaction. The neurophysical evolution module is used to input the total dynamic field of the sound interaction as a driving signal into the fragment shader of the GPU to a preset neural network model for dynamically modulating the physical behavior of fluids, and to use the neural network model to perform forward reasoning on the fluid state texture to complete the neurophysical evolution of the fluid and generate the evolved fluid state, wherein the fluid state texture includes a velocity field and a density gradient field. The competitive rendering module is used to perform semantic separation on the audio multidimensional features in the GPU rendering pipeline to obtain several audio semantic layers, and to perform competitive visual attribute mapping for each audio semantic layer based on the evolved fluid state to generate the final rendered image. The asynchronous scheduling engine module is used to organize the processes of generating the total dynamic field of the audio interaction, performing the neurophysical evolution, and generating the final rendered image into several parallel and asynchronous processing pipelines through the asynchronous computing engine of the GPU, and to schedule and output them.