AIGC video generation interaction method based on immersive computing

By combining multimodal data acquired through eye tracking and a circular microphone array, a two-dimensional dynamic attention mask matrix is ​​generated, and the self-attention weight distribution is adjusted. This solves the problem of blurred texture in the focus area in immersive computing environments and achieves precise alignment between the focus of immersive multimodal perception and video generation.

CN122199752APending Publication Date: 2026-06-12BEIJING YUANDIAN FUTURE TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING YUANDIAN FUTURE TECH CO LTD
Filing Date
2026-04-24
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In immersive computing environments, existing video generation methods based on head orientation cannot accurately determine the user's actual visual gaze focus and the location of ambient sound sources, resulting in blurred textures in the generated video in the focus area. This makes it impossible to align the immersive multimodal perception focus with the computational focus within the video generation model.

Method used

The relative position data between the pupil center and the corneal reflective spot is obtained by eye-tracking sensor. The gaze point coordinates are calculated by combining the three-dimensional spatial data collected by spatial depth camera. The azimuth and pitch angles of the sound source are located by using a ring microphone array. A two-dimensional dynamic attention mask matrix is ​​generated, and the calculation distribution of the self-attention weight matrix is ​​adjusted to guide the AIGC video diffusion model to perform high-frequency detail rendering in the overlapping area.

🎯Benefits of technology

It achieves alignment between the immersive multimodal perception focus and the pixel-level rendering area of ​​video generation, overcoming the defect of blurred texture in the focus area caused by allocating computing power based on head orientation, and ensuring that the computing power of video generation is oriented and focused on the focus area that the user actually perceives.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122199752A_ABST
    Figure CN122199752A_ABST
Patent Text Reader

Abstract

The present application relates to the technical field of human-computer interaction, and specifically relates to an artificial intelligence generated content video generation interaction method based on immersive computing. The method calculates three-dimensional gaze point coordinates by obtaining the relative position of the pupil center and the corneal reflection light spot, and obtains the azimuth angle and the pitch angle of the sound source by combining the ring microphone array beam forming processing; the gaze point coordinates and the sound source angle are input into the space-time alignment network to generate a two-dimensional dynamic attention mask matrix corresponding to the video generation model latent space resolution. The scheme maps the visual features and the sound field features into the attention constraint conditions inside the model, so that the computing power is directionally focused on the perception focus area, overcoming the texture blur defects of the focus area caused by the allocation of computing power based on the head orientation, and realizing the alignment of the perception focus and the video pixel rendering area.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of human-computer interaction technology, and more specifically to an AIGC video generation and interaction method based on immersive computing. Background Technology

[0002] In immersive computing environments, generative model-based video generation interaction methods typically employ head-orientation-based field-of-view optimization or uniform resolution output mechanisms. Conventional approaches use inertial sensors worn on the user's head to acquire yaw and pitch angles, thereby estimating the user's current gaze area. During video generation, the model designates the 2D image region corresponding to the head's orientation as the primary rendering window, allocating more computational convolutional kernels and network layers to image features within this window. Edge regions outside the window are downsampled or some denoising iterations are skipped, maintaining a basic video frame generation rate within the limited computing power of the graphics processing unit.

[0003] The aforementioned conventional approach relies solely on head orientation for region segmentation, failing to incorporate eye-tracking physiological characteristics and environmental sound field physical features. This results in the model being unable to determine the user's true visual focus under their current head orientation, as well as the spatial location of sound-emitting objects in the environment. When the user's head is fixed but their gaze shifts, or when the environmental sound source is located at the edge of the head orientation region, the existing technology distributes computational resources evenly across the entire head orientation viewpoint. This leads to insufficient weighting of image features in the area where the true gaze point coincides with the sound source, resulting in blurred pixel textures in that focal area. This fails to address the misalignment between the immersive multimodal perception focus and the computational focus within the video generation model. Summary of the Invention

[0004] The purpose of this invention is to provide a solution that can effectively address the problems described in the background section.

[0005] To achieve the above objectives, the technical solution adopted by the present invention is as follows: The AIGC video generation and interaction method based on immersive computing includes the following steps: acquiring the relative position data of the pupil center and the corneal reflective spot through an eye-tracking sensor, and calculating the coordinates of the user's gaze point in the three-dimensional environment by combining the three-dimensional spatial data collected by a spatial depth camera; A ring microphone array is used to collect multi-channel audio signals from the environment, and beamforming processing is performed to locate the azimuth and elevation angles of the sound source in three-dimensional space. The gaze point coordinates, azimuth angle, and pitch angle are input into a spatiotemporal alignment network to generate a two-dimensional dynamic attention mask matrix corresponding to the potential spatial resolution of the AIGC video diffusion model. In the self-attention mechanism calculation stage of AIGC video generation, the two-dimensional dynamic attention mask matrix is ​​used to perform dot product filtering on the self-attention weight matrix to suppress the self-attention weight values ​​of non-focal points and regions without sound sources, and amplify the self-attention weight values ​​of regions where focal points and sound sources overlap, guiding the AIGC video diffusion model to perform high-frequency detail rendering in the overlapping regions to generate target video frames.

[0006] Preferably, the step of obtaining the relative position data between the pupil center and the corneal reflective spot through the eye-tracking sensor and calculating the coordinates of the user's gaze point in the three-dimensional environment by combining the three-dimensional spatial data collected by the spatial depth camera includes: emitting an infrared beam using the eye-tracking sensor to form multiple corneal reflective spots on the surface of the user's cornea, and extracting the two-dimensional pixel coordinates of the multiple corneal reflective spots and the two-dimensional pixel coordinates of the pupil center; Acquire the environmental 3D point cloud data output by the spatial depth camera, and project the 2D pixel coordinates of the multiple corneal reflective spots and the 2D pixel coordinates of the pupil center onto the corresponding 3D spatial plane in the environmental 3D point cloud data; Based on the coordinates of the multiple corneal reflective spots in three-dimensional space, a three-dimensional fitted spherical model of the eyeball is constructed. Using the offset vector of the pupil center in three-dimensional space relative to the center of the three-dimensional fitted spherical model of the eyeball, the intersection point of the offset vector with the virtual plane in the three-dimensional point cloud data of the environment is calculated, and the coordinates of the intersection point are used as the coordinates of the gaze point.

[0007] Preferably, the step of acquiring multi-channel ambient audio signals through a ring microphone array and performing beamforming processing to locate the azimuth and elevation angles of the sound source in three-dimensional space includes: performing a fast Fourier transform on the time-domain audio signals acquired by each microphone node in the ring microphone array to extract frequency-domain signal features, and calculating the phase difference between any two microphone nodes. The phase difference value is input into a preset generalized cross-correlation function to generate a spatial spectrum estimation result, and the peak coordinates are searched in the spatial spectrum estimation result; Based on the physical geometry of the ring microphone array, the peak coordinates are converted into the azimuth and pitch angles of the sound source relative to the center of the ring microphone array. Kalman filtering is performed on the azimuth and elevation angles of multiple consecutive time frames to predict and eliminate abnormal coordinate jumps caused by environmental reverberation, and output the smoothed azimuth and elevation angles of the target sound source.

[0008] Preferably, the step of inputting the gaze point coordinates, the azimuth angle, and the pitch angle into a spatiotemporal alignment network to generate a two-dimensional dynamic attention mask matrix corresponding to the latent spatial resolution of the AIGC video diffusion model includes: constructing the spatiotemporal alignment network containing a visual feature extraction branch and an auditory feature extraction branch; converting the gaze point coordinates into a visual focus Gaussian heatmap in the visual feature extraction branch; and converting the target sound source azimuth angle and the target sound source pitch angle into an auditory focus Gaussian heatmap in the auditory feature extraction branch. The frame rate of the visual focus Gaussian heatmap and the frame rate of the auditory focus Gaussian heatmap are aligned to a unified time reference using a time-dimension interpolation algorithm. The aligned visual focus Gaussian heatmap and the auditory focus Gaussian heatmap are input to the feature fusion layer and added element-wise. The addition result is mapped to the two-dimensional grid size of the latent space of the AIGC video diffusion model to generate the two-dimensional dynamic attention mask matrix with values ​​ranging from 0 to 1.

[0009] Preferably, the step of performing dot product filtering on the self-attention weight matrix using the two-dimensional dynamic attention mask matrix in the self-attention mechanism calculation stage of AIGC video generation includes: obtaining the self-attention weight matrix calculated based on the query vector and the key vector in the attention calculation layer of the current denoising step of the AIGC video diffusion model. The two-dimensional dynamic attention mask matrix is ​​copied and extended in the channel dimension so that its dimension is consistent with the spatial dimension of the self-attention weight matrix. The extended two-dimensional dynamic attention mask matrix and the self-attention weight matrix are subjected to Hadamard product operation. The regions in the self-attention weight matrix whose corresponding mask values ​​are less than a preset threshold are set to 0, while the original values ​​of the regions whose corresponding mask values ​​are greater than or equal to the preset threshold are retained, thereby generating the target self-attention weight matrix after mask modulation. The value vector is weighted and summed using the target self-attention weight matrix.

[0010] Preferably, the step of guiding the AIGC video diffusion model to perform high-frequency detail rendering in the overlapping region to generate the target video frame includes: extracting the connected component regions in the two-dimensional dynamic attention mask matrix whose values ​​are greater than the preset threshold as high-frequency enhancement regions during the process of upsampling and decoding the latent feature map by the AIGC video diffusion model. Within the high-frequency enhancement region, a high-pass filter based on the Laplacian operator is applied to the latent feature map to extract the edge gradient features of the latent feature map in the spatial dimension; The edge gradient features are added to the latent feature map within the high-frequency enhancement region by residual addition, thereby enhancing the contrast difference between adjacent pixels within the high-frequency enhancement region; Low-pass filtering is applied to the non-high-frequency enhancement regions with values ​​less than or equal to the preset threshold for smoothing. The processed high-frequency enhancement regions and non-high-frequency enhancement regions are then spliced ​​and fused together in the spatial dimension and input to the final decoder layer to output the target video frame.

[0011] Preferably, the step of extracting the two-dimensional pixel coordinates of the plurality of corneal reflective spots and the two-dimensional pixel coordinates of the pupil center includes: acquiring ambient light intensity data collected by the ambient light sensor of the immersive head-mounted device where the eye-tracking sensor is located; When the ambient light intensity data is greater than the preset light threshold, the emission power of the infrared beam is increased, and the exposure gain value of the image sensor in the infrared band is dynamically increased. An adaptive threshold segmentation algorithm is used to separate the multiple corneal reflection spots from the background ambient light noise in the current frame image, and the gray-level centroid of the separated spot candidate regions is calculated as the accurate two-dimensional pixel coordinates. When the ambient light intensity data is less than or equal to the preset light threshold, the emission power of the infrared beam is reduced to reduce the thermal effect on the eye. An edge detection operator is used to extract the contour boundary of the pupil center, an ellipse is fitted, and the center of the ellipse is calculated as the two-dimensional pixel coordinates of the pupil center.

[0012] Preferably, the step of converting the peak coordinates into the azimuth and pitch angles of the sound source relative to the center of the ring microphone array includes: acquiring three-dimensional head posture data collected by the inertial measurement unit, wherein the three-dimensional head posture data includes head yaw angle, head pitch angle and head roll angle; Calculate the first rotation matrix of the center of the ring microphone array in the global world coordinate system based on the user's three-dimensional head posture data; Obtain the fixed installation offset of the ring microphone array relative to the user's head, and combine it with the first rotation matrix to convert the fixed installation offset into the absolute coordinates of the microphone array in the global world coordinate system; The initial positioning coordinates of the sound source in the global world coordinate system are transformed using the absolute coordinates of the microphone array. The initial positioning coordinates are then transformed into a head-relative local coordinate system with the center of the circular microphone array as the origin and the midpoint of the line connecting the user's two ears pointing straight ahead as the vertical axis. The azimuth and pitch angles are then calculated in the head-relative local coordinate system.

[0013] Preferably, the step of adding the aligned visual focus Gaussian heatmap and the auditory focus Gaussian heatmap to the feature fusion layer element by element includes: calculating the Euclidean distance between the coordinates of the Gaussian center point in the aligned visual focus Gaussian heatmap and the coordinates of the Gaussian center point in the auditory focus Gaussian heatmap; When the Euclidean distance is less than the preset spatial conflict threshold, the element-by-element addition operation is performed directly. When the Euclidean distance is greater than or equal to the preset spatial conflict threshold, the continuous dwell time of the sound source category label parsed by the speech recognition model from the multi-channel audio signal and the coordinates of the Gaussian center point in the visual focus Gaussian heatmap is obtained. When the sound source category label belongs to a preset strong interaction category and the continuous dwell time is greater than a preset duration threshold, the weight coefficient of the auditory focus Gaussian heatmap is adjusted to be greater than the weight coefficient of the visual focus Gaussian heatmap. Otherwise, the weighting coefficient of the visual focus Gaussian heatmap is kept greater than the weighting coefficient of the auditory focus Gaussian heatmap, and weighted fusion is performed according to the weighting coefficient.

[0014] Preferably, the step of performing a Hadamard product operation on the extended two-dimensional dynamic attention mask matrix and the self-attention weight matrix includes: before performing the Hadamard product operation, extracting the set of internal region boundary pixel coordinates with a value of one in the two-dimensional dynamic attention mask matrix; For each boundary pixel coordinate in the set of boundary pixel coordinates, calculate the shortest spatial distance from that boundary pixel coordinate to the external region with a value of 0; Using the shortest spatial distance as the independent variable, the attenuation weight value of the corresponding boundary pixel coordinates is calculated using a preset Gaussian attenuation function, wherein the width parameter of the Gaussian attenuation function is negatively correlated with the time step of the current denoising step of the AIGC video diffusion model. The mask values ​​corresponding to the boundary pixel coordinate set are weighted and replaced using the attenuation weight values. The stepped boundary transitioning from 1 to 0 in the replaced two-dimensional dynamic attention mask matrix is ​​transformed into a smooth attenuation transition boundary, and then the Hadamard product operation is performed.

[0015] Compared with the prior art, the beneficial effects of the present invention are as follows: 1. This invention calculates the three-dimensional gaze point coordinates by extracting the relative position of the pupil center and the corneal reflected light spot. Combined with beamforming processing using a ring microphone array, it obtains the azimuth and elevation angles of the sound source. This multimodal data is then input into a spatiotemporal alignment network to generate a two-dimensional dynamic attention mask matrix. During the self-attention mechanism calculation phase, this mask matrix is ​​used to perform dot product filtering on the self-attention weight matrix, amplifying the self-attention weight values ​​in the overlapping areas of the gaze point and the sound source, while suppressing the weight values ​​in non-overlapping areas. This mechanism directly maps biological visual features and physical sound field features into attention constraints in the latent space of the generation model. It changes the distribution of computational weights in the feature maps within the model at different spatial locations, allowing the computational power of video generation to be directed towards the user's actual perceived focus area. This overcomes the texture blurring defect in the focus area caused by allocating computational power based on head orientation, achieving alignment between the immersive multimodal perception focus and the pixel-level rendering area of ​​the video generation.

[0016] 2. During the mask matrix generation and modulation process, the accuracy of the three-dimensional spatial solution of the gaze point coordinates was improved by constructing a three-dimensional fitting spherical model of the eyeball to calculate the intersection of the offset vectors; the multi-channel audio signal was processed using a generalized cross-correlation function and Kalman filtering to eliminate abnormal jumps in the sound source coordinates caused by environmental reverberation; the infrared emission power and exposure gain were dynamically adjusted by acquiring the ambient light intensity to ensure the reliability of pupil and spot feature extraction under different lighting conditions; the rotation matrix was calculated based on head posture data to perform coordinate transformation, eliminating the interference of user head movement on the sound source positioning coordinates; when there is a conflict between audiovisual focus space, the fusion weights were dynamically adjusted based on the sound source category label and dwell time to avoid mutual interference of weights when audiovisual focus is separated; the mask boundary was smoothly attenuated using a Gaussian attenuation function, and the attenuation width dynamically changed with the denoising time step, eliminating the step-like hard truncation at the mask boundary and avoiding pixel stitching gaps at the junction of the focus area and background area in the final generated target video frame. Attached Figure Description

[0017] Figure 1 This is an overall flowchart of the video generation and interaction method based on immersive computing of the present invention; Figure 2 This is a flowchart of the gaze point coordinate calculation process of the present invention; Figure 3 This is a flowchart of the sound source localization and coordinate transformation process of the present invention; Figure 4 This is a flowchart of the mask matrix generation process of the present invention; Figure 5 This is a flowchart of the self-attention weight modulation process of the present invention; Figure 6 This is a flowchart of the high-frequency detail rendering process of the present invention. Detailed Implementation

[0018] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. The described embodiments are only some, not all, of the embodiments of this application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without creative effort are within the scope of protection of this application.

[0019] Please refer to Figure 1 This embodiment provides an AIGC video generation and interaction method based on immersive computing, running on a computing system equipped with an immersive head-mounted display device. The head-mounted display device integrates an eye-tracking sensor, a spatial depth camera, a ring microphone array, an inertial measurement unit, and an ambient light sensor. All of these hardware components are connected to the main processor built into the head-mounted display device via an internal high-speed bus. The main processor loads and runs an AIGC video diffusion model, specifically a video generation network based on a latent diffusion architecture, including an encoder, a decoder, and a temporal denoising network with multiple self-attention computation layers. In the latent diffusion architecture, video frames are mapped to a low-dimensional latent space for processing, with the feature map resolution of the latent space being a preset fixed size. When the user is wearing the head-mounted display device in a three-dimensional virtual or augmented reality environment, the main processor acquires the relative position data of the pupil center and the corneal reflective spot through the eye-tracking sensor, and simultaneously combines this with the three-dimensional spatial data of the environment acquired by the spatial depth camera to calculate the coordinates of the user's gaze point in the three-dimensional environment. At the same time, the main processor acquires multi-channel audio signals from the environment through the ring microphone array, performing beamforming processing to locate the azimuth and pitch angles of the sound source in three-dimensional space. After acquiring the aforementioned visual and auditory spatial coordinates, the main processor inputs the gaze point coordinates and the azimuth and pitch angles of the sound source into a pre-trained spatiotemporal alignment network to generate a two-dimensional dynamic attention mask matrix corresponding to the latent spatial resolution of the AIGC video diffusion model. During the denoising process of AIGC video generation, when entering the self-attention mechanism calculation stage, the main processor uses the generated two-dimensional dynamic attention mask matrix to perform dot-multiplication filtering on the self-attention weight matrix. This suppresses the self-attention weight values ​​in non-gaze points and areas without a sound source, while amplifying the self-attention weight values ​​in areas where the gaze point and sound source overlap. This guides the AIGC video diffusion model to concentrate computational resources on high-frequency detail rendering in the overlapping areas, ultimately generating and outputting the target video frame through the decoder. This mechanism directly maps biological visual features and physical sound field features to attention constraints in the latent space of the generation model, changing the distribution of computational weights in the model's internal feature maps at different spatial locations. This allows the computational power of video generation to be directed towards the user's actual perceived focal area, overcoming the texture blurring defect in the focal area caused by allocating computational power based on head orientation in conventional solutions. It achieves precise alignment between the immersive multimodal perception focal point and the pixel-level rendering area of ​​the video generation.

[0020] In one embodiment, reference Figure 2 The specific process of calculating the user's gaze point coordinates in a 3D environment by acquiring the relative position data between the pupil center and the corneal reflective spot using an eye-tracking sensor and combining it with 3D spatial data acquired by a depth camera includes the following technical details. The eye-tracking sensor incorporates an infrared emitter and an infrared image sensor. When the system activates eye tracking, the infrared emitter emits a specific wavelength of infrared light towards the user's eye. This infrared light undergoes specular reflection on the anterior surface of the cornea, forming multiple corneal reflective spots. Simultaneously, some of the infrared light enters the pupil, forming a dark pupil feature after reflection from the fundus. The infrared image sensor acquires infrared images of the eye, including corneal reflective spots and dark pupil images, at a preset frame rate. After acquiring the current frame's infrared image of the eye, the main processor uses image processing algorithms to extract the 2D pixel coordinates of multiple corneal reflective spots and the 2D pixel coordinates of the pupil center. After obtaining the above 2D pixel coordinates, the main processor simultaneously acquires the environmental 3D point cloud data output by the depth camera. The depth camera uses time-of-flight or structured light methods to measure the depth information of each point in the environment, generating a dense point cloud with a 3D coordinate distribution. The main processor, based on the calibration extrinsic parameter matrix between the eye-tracking sensor and the spatial depth camera, projects the two-dimensional pixel coordinates of multiple extracted corneal reflective spots and the two-dimensional pixel coordinates of the pupil center onto the corresponding three-dimensional spatial plane in the environmental three-dimensional point cloud data through inverse perspective projection transformation, thereby obtaining the initial coordinates of the aforementioned feature points in three-dimensional space. To accurately calculate the gaze direction, the main processor constructs a three-dimensional fitted spherical model of the eyeball based on the coordinates of the multiple corneal reflective spots in three-dimensional space. Specifically, assuming the eyeball is a standard sphere, the spherical equation can be expressed as:

[0021] in, The coordinates of the center of the spherical model fitted to the three-dimensional eyeball are given. Where is the radius of the sphere. The coordinates of the corneal reflective spot are shown in three dimensions. Due to noise in actual measurements, the main processor uses a nonlinear least squares method to fit and solve for the coordinates of multiple spots. An error function is then constructed. :

[0022] in, The number of corneal reflective spots. The index is used for the light spot. The Levenberg-Marquardt algorithm is used to iteratively optimize the error function to find the optimal sphere center coordinates. After obtaining a three-dimensional fitted spherical model of the eyeball, the main processor calculates the offset vector of the pupil center's coordinates in three-dimensional space relative to the model's center. This offset vector... The calculation formula is:

[0023] in, This represents the coordinates of the pupil center in three-dimensional space. This offset vector... The physical meaning of this is the current visual axis direction of the eyeball. To obtain the user's actual gaze location within the environment, the main processor constructs a virtual plane in the 3D point cloud data of the environment. This virtual plane is typically set at a specific distance from the user and perpendicular to the initial gaze direction. The main processor calculates the offset vector. The intersection point with this virtual plane is specifically calculated by converting the offset vector into parametric equation form:

[0024] These are the components of the line-of-sight vector along the three coordinate axes in a three-dimensional Cartesian coordinate system. Substituting these components into the general plane equation of the virtual plane... middle, , , These are the components of the normal vector of the virtual plane on the three coordinate axes. It is the constant term of the plane equation; the parameters are solved by rearranging the terms. The calculation is as follows:

[0025] The obtained parameters Substituting back into the parametric equations, the calculated results are... The coordinates serve as the user's gaze point coordinates in the 3D environment. By constructing a 3D fitted spherical model of the eyeball and calculating the intersection of offset vectors, the error of gaze offset caused by overall head movement is eliminated, thus improving the accuracy of the 3D spatial solution of the gaze point coordinates.

[0026] In one embodiment, reference Figure 3 The process of acquiring multi-channel ambient audio signals through a ring microphone array and performing beamforming processing to locate the azimuth and elevation angles of the sound source in three-dimensional space involves rigorous signal processing and spatial spectrum estimation logic. The ring microphone array consists of multiple microphone nodes evenly distributed in a circular ring shape, with an array radius of... The total number of microphone nodes is During the acquisition phase, each microphone node synchronously acquires time-domain audio signals. The main processor performs a Fast Fourier Transform on the time-domain audio signals acquired by each microphone node in the circular microphone array to extract frequency-domain signal features. For the first... Each microphone node has a time-domain signal. After a length of After the Fast Fourier Transform, the resulting frequency domain signal is:

[0027] in, For frequency index, This serves as the index for the time sampling points. After acquiring the frequency domain signals from all nodes, the main processor calculates the phase difference between any two microphone nodes. (Using a reference node...) For example, node With reference node In frequency Phase difference at Phase extraction via cross-power spectrum:

[0028] in, This represents the phase angle when taken as a complex number. This represents the frequency domain signal of the reference node. This represents the conjugate of the reference node's frequency domain signal. To accurately estimate the sound source location in the presence of environmental reverberation and noise, the main processor inputs the aforementioned phase difference value into a preset generalized cross-correlation function to generate a spatial spectrum estimation result. Specifically, a generalized cross-correlation function with phase transform weighting is used, whose frequency domain... The calculation formula is:

[0029] in, For time delay, ω is the angular frequency. express The frequency domain signal represents The conjugate of the frequency domain signal and the phase transformation weighting term in the denominator can whiten the amplitude spectrum of the signal, retaining only phase information and effectively suppressing the interference of environmental reverberation on time delay estimation. The main processor pre-constructs a steering vector based on the physical geometry of the ring microphone array. For any candidate sound source point in three-dimensional space, its azimuth angle relative to the center of the microphone array is... The pitch angle is Then the candidate point reaches the 1st Theoretical time delay of each microphone for:

[0030] in, For the first The angular position of each microphone on the ring For the speed of sound, This is the radius of the circular microphone array. The main processor performs a search in the three-dimensional spatial grid, calculating the generalized cross-correlation spatial spectrum output power corresponding to each candidate azimuth and elevation angle, and searching for peak coordinates in the spatial spectrum estimation results. It then finds the azimuth angle that maximizes the output power. With pitch angle This serves as a preliminary positioning result. Due to transient noise interference in the real environment, the preliminary positioning coordinates may exhibit abnormal jumps. The main processor performs Kalman filtering prediction on the azimuth and elevation angles across multiple consecutive time frames. The state vector of the Kalman filter... ,, For the first Azimuth at time, For the first Angular velocity of the azimuth at any given time. For the first The pitch angle at any moment, For the first The angular velocity of the pitch angle at any given moment includes both the current angle and angular velocity. The state transition equation is:

[0031] in, For the first The state vector at time t, Let's assume the state transition matrix of the uniform motion model. This represents process noise. The observation equation is:

[0032] in, The peak coordinates are obtained by searching the generalized cross-correlation function for the current frame. For the observation matrix, To detect noise, the main processor uses recursive calculations in the prediction and update steps, dynamically adjusting the reliability of the predicted and observed values ​​using Kalman gain, eliminating abnormal coordinate jumps caused by environmental reverberation, and outputting the smoothed target sound source azimuth and elevation angles.

[0033] In one embodiment, reference Figure 4 The process of inputting gaze point coordinates, azimuth, and pitch angles into a spatiotemporal alignment network to generate a two-dimensional dynamic attention mask matrix corresponding to the latent spatial resolution of the AIGC video diffusion model involves spatiotemporal alignment and dimensional mapping of multimodal features. The main processor constructs a spatiotemporal alignment network containing visual feature extraction and auditory feature extraction branches, both of which are composed of multilayer perceptrons or convolutional neural networks. In the visual feature extraction branch, the main processor first inputs the gaze point coordinates from the three-dimensional environment... By transforming the perspective projection model of a virtual camera onto a two-dimensional image plane, the two-dimensional coordinates of the visual focus are obtained. Subsequently, a Gaussian heatmap of the visual focus is generated within the region centered on these two-dimensional coordinates. (Gaussian heatmap) The formula for generating it is:

[0034] in, These are the pixel coordinates on the image plane. The Gaussian kernel width parameter represents the visual focus, determining the perceptual diffusion range of the visual attention area. In the auditory feature extraction branch, the main processor calculates the target sound source azimuth angle. pitch angle relative to the target sound source Mapping to the same two-dimensional image plane coordinate system, we obtain the two-dimensional coordinates of the auditory focus. And using the same logic to generate a Gaussian heatmap of auditory focus. :

[0035] in, Let be the Gaussian kernel width parameter for the auditory focus. Due to the different sampling hardware mechanisms of the visual and auditory sensors, the frame rate of the visual focus Gaussian heatmap typically differs from that of the auditory focus Gaussian heatmap. The main processor aligns the frame rates of both to a unified time base using a time-dimension interpolation algorithm. Assume the timestamp sequence of the visual heatmap is... The timestamp sequence of the auditory heatmap is The frame interval with a unified time reference is For time points where no corresponding heatmap exists on a unified time base, the main processor uses a linear interpolation algorithm to calculate the heatmap value for that time point. :

[0036] in, and To be interpolated to the current time point Two adjacent known heatmap timestamps and They are respectively place and The known heatmap values ​​are used. After time alignment, the main processor inputs the aligned visual focus Gaussian heatmap and auditory focus Gaussian heatmap into the top feature fusion layer of the spatiotemporal alignment network and performs element-wise addition to obtain the fused heatmap. To enable the fused heatmap to be applied to the AIGC video diffusion model, the main processor maps the spatial dimensions of the fused heatmap to the two-dimensional grid size of the AIGC video diffusion model's latent space. The resolution of the latent space is typically much smaller than the original image resolution. The main processor scales the size of the fused heatmap to the two-dimensional grid size of the latent space (e.g., using bilinear interpolation downsampling or adaptive average pooling operations). Finally, the main processor truncates and normalizes the scaled matrix values, constraining all values ​​to the range of zero to one, generating the final two-dimensional dynamic attention mask matrix. ,in and These represent the height and width of the latent space feature map, respectively.

[0037] In one embodiment, reference Figure 5 In the self-attention mechanism computation stage of AIGC video generation, the process of using a two-dimensional dynamic attention mask matrix to perform dot product filtering on the self-attention weight matrix directly interferes with the internal feature interaction logic of the diffusion model. The denoising network of the AIGC video diffusion model adopts a spatiotemporal Transformer architecture. In a certain attention computation layer of the current denoising step, the dimension of the input latent feature map tensor is... ,in, Represents the set of real numbers. For time frames, This represents the number of feature channels. The main processor first reshapes the latent feature map into a two-dimensional sequence. , where the sequence length The query vector matrix is ​​generated through three linear projection layers. Key vector matrix Sum value vector matrix All dimensions ,in This represents the dimension of the attention head. The main processor calculates the matrix product of the query vector and the transpose of the key vector to obtain the original self-attention weight matrix:

[0038] Among them, divided by The physical significance lies in scaling the dot product result to prevent the dot product value from becoming too large at higher dimensions, which could cause the subsequent Softmax function to enter the saturation region and generate a very small gradient. At this point, the main processor obtains the two-dimensional dynamic attention mask matrix generated in the previous steps. Due to the self-attention weight matrix The dimension is The main processor replicates and expands the two-dimensional dynamic attention mask matrix in both the channel and time dimensions. Specifically, the number of times the mask matrix is ​​replicated in the time dimension is related to... The same principle applies, and the expanded mask matrix is ​​either laid out as a diagonal block matrix by sequence index in the spatial dimension or directly mapped by spatial position index. The spatial dimension remains completely consistent with the self-attention weight matrix. During the expansion process, the mask value corresponding to a spatial location is assigned to all query-key interaction rows and columns corresponding to that location. Subsequently, the main processor performs a Hadamard product operation on the expanded two-dimensional dynamic attention mask matrix and the self-attention weight matrix, i.e., multiplies them element-wise:

[0039] The self-attention weight matrix after masking is used. After performing the Hadamard product, the main processor introduces a preset threshold. Perform filtering. The corresponding values ​​in the self-attention weight matrix... The region is directly set to zero, that is... The original values ​​of regions whose corresponding mask values ​​are greater than or equal to a preset threshold are retained. This operation physically severs the feature information transmission path of non-focal points and regions without sound sources in the latent space. Finally, the main processor performs a Softmax normalization operation (exponential summation normalization on each row) on the zeroed-out target self-attention weight matrix, and uses the normalized target self-attention weight matrix to perform a weighted summation calculation on the value vector:

[0040] It is the final output matrix of the self-attention module. Through the mask modulation mechanism mentioned above, when the model calculates the features of the overlapping region, it can extract more contextual information related to the region from the value vector of the whole image. When calculating the non-overlapping region, its attention weight is forcibly suppressed, realizing the directional focus of computing power at the perception focus.

[0041] In one embodiment, reference Figure 6The process of guiding the AIGC video diffusion model to perform high-frequency detail rendering in overlapping regions to generate target video frames occurs at the decoding stage at the end of the denoising network. During the AIGC video diffusion model's upsampling decoding of the latent feature map, the main processor extracts regions with values ​​greater than a preset threshold from the two-dimensional dynamic attention mask matrix. Since this mask matrix represents the overlapping regions of audiovisual focus, the main processor performs a connected component extraction algorithm within this matrix. Specifically, a two-pass scanning method based on breadth-first search is used to mark all connected pixel sets with values ​​greater than or equal to the threshold, and the area of ​​each connected component is calculated. The main processor filters out isolated noisy connected components with areas smaller than the set pixel threshold, and uses the remaining largest connected component or all valid connected components as the high-frequency enhancement region. Within the high-frequency enhancement region, the main processor processes the latent feature map of the current layer. A high-pass filter based on the Laplacian operator is applied to extract edge gradient features of the latent feature map in the spatial dimension. The convolution kernel of the discrete two-dimensional Laplacian operator... Represented as:

[0042] This operator responds to regions of dramatic gray-level changes in the feature map, i.e., high-frequency texture and edge information, by calculating the second-order spatial derivative. The main processor then adds the residuals of the calculated edge gradient features to the latent feature map within the high-frequency enhancement region.

[0043] in Features of the high-frequency enhancement region after residual enhancement. This is the residual scaling factor, used to control the injection intensity of high-frequency features. This residual addition operation directly enhances the contrast difference between adjacent pixels within the high-frequency enhancement region, resulting in pixels generated by subsequent decoding exhibiting clearer boundaries and textures in the focal region. Simultaneously, for non-high-frequency enhancement regions in the mask matrix with values ​​less than or equal to a preset threshold, the main processor applies a low-pass filter to smooth the latent feature map within that region. Convolution is performed using mean filtering or a Gaussian smoothing kernel.

[0044] in, For The filtering window neighborhood centered on the center, The total number of pixels in the neighborhood. coordinates The original feature values ​​at the location, The features of the non-high-frequency enhancement region have been smoothed. For high frequency enhancement region, This is a universal quantifier symbol. Low-pass filtering smoothing suppresses noise and irrelevant high-frequency components in the background region, reducing the computational cost of rendering the background region. After processing, the main processor concatenates and fuses the high-frequency enhanced region features enhanced by residual enhancement and the smoothed non-high-frequency enhanced region features in the spatial dimension according to the position index of the mask region, restoring a complete latent feature map tensor, which is then input into the final decoder layer (such as a transposed convolutional layer) to output the target video frame with high-frequency detail rendering effect in the focal region.

[0045] In one embodiment, the step of extracting the two-dimensional pixel coordinates of multiple corneal reflective spots and the two-dimensional pixel coordinates of the pupil center requires overcoming the complex and variable lighting interference in the immersive environment. The main processor acquires ambient light intensity data collected by the ambient light sensor configured on the immersive head-mounted device where the eye-tracking sensor is located, denoted as... The main processor has preset lighting thresholds stored internally. This threshold is set based on the saturation exposure of the infrared image sensor under conditions without infrared illumination. When the ambient light intensity data exceeds the preset threshold, it indicates that there is strong infrared stray light or strong visible light in the ambient light, causing local overexposure of the image sensor. At this time, the main processor increases the emission power of the infrared beam through pulse width modulation signal to improve the signal-to-noise ratio, while dynamically increasing the exposure gain value of the image sensor in the infrared band. In the image processing stage, the main processor uses an adaptive threshold segmentation algorithm to separate multiple corneal reflection spots from background ambient light noise in the current frame image. Specifically, a local adaptive thresholding method is used for any pixel in the image. Calculate the surrounding windows Mean gray value within the neighborhood with standard deviation The segmentation threshold of this pixel The calculation is as follows:

[0046] in, This is an empirical constant. Using this dynamic threshold for image binarization effectively filters out background noise blocks caused by localized strong reflections. For the separated candidate light spots, the main processor calculates the gray-level centroid as precise two-dimensional pixel coordinates:

[0047]

[0048] in, The set of pixels representing the candidate region of the light spot. This represents the grayscale value of the pixel. When the ambient light intensity is less than or equal to a preset light threshold, it is in a low-light environment. The main processor reduces the emission power of the infrared beam to reduce the eye thermal effect caused by the long-term high-power operation of the infrared LED, and reduces the exposure gain to avoid overexposure that obscures the dark pupil features. Under these conditions, the main processor uses an edge detection operator (such as the Canny operator) to extract the set of contour boundary pixels of the pupil center in the image. Subsequently, a direct least squares ellipse fitting algorithm is used to fit an ellipse equation to the set of boundary pixels:

[0049] The elliptic parameters are obtained by constructing the design matrix and solving the normal equations. Then, the geometric center coordinates of the ellipse are calculated, and these coordinates are used as the two-dimensional pixel coordinates of the pupil center. This extraction strategy based on dynamic switching of ambient light intensity ensures the reliability of pupil and light spot feature extraction under different lighting conditions.

[0050] In one embodiment, the step of converting peak coordinates into azimuth and pitch angles of the sound source relative to the center of the circular microphone array requires eliminating interference from user head movement on sound source localization. The main processor acquires three-dimensional user head pose data, typically in quaternion format, collected at high frequency by the inertial measurement unit. Output in the form of Each component has four parts. The main processor first converts the quaternion into a rotation matrix to represent the head's pose in the global world coordinate system. Rotation Matrix The calculation formula is:

[0051] This rotation matrix is ​​the first rotation matrix of the user's head 3D pose data in the global world coordinate system. Since the circular microphone array is typically fixedly mounted at a specific location on the head-mounted device (such as above or to the sides of the forehead), its physical center does not perfectly coincide with the origin of the inertial measurement unit (IMU). The main processor acquires the pre-calibrated fixed-mount offset vector of the circular microphone array relative to the origin of the IMU on the user's head. . These are the fixed offsets of the three axes in the IMU local coordinate system relative to the IMU coordinate origin, representing the physical center of the circular microphone array. Combined with the first rotation matrix, the main processor converts these fixed offsets into the absolute coordinates of the microphone array in the global world coordinate system. :

[0052] in, These are the position coordinates of the head in the global world coordinate system. The peak coordinates obtained by searching the spatial spectrum using the generalized cross-correlation function correspond to the preliminary location coordinates of the sound source in the global world coordinate system. To obtain a relative position of the sound source that conforms to human auditory perception, the main processor performs a coordinate transformation on the initial positioning coordinates using the absolute coordinates of the microphone array. The main processor then calculates the relative position vector of the sound source relative to the center of the microphone array. :

[0053] Subsequently, using the inverse matrix (i.e., the transpose matrix) of the first rotation matrix, the relative position vector is transformed into a relative local coordinate system of the head, with the center of the circular microphone array as the origin and the midpoint of the line connecting the user's two ears pointing straight ahead as the vertical axis:

[0054] Let the position vector of the sound source in the head relative to the local coordinate system be... ,in Point straight ahead, Pointing to the right, Pointing upwards. The main processor calculates the azimuth angle of the sound source based on trigonometric relationships. With pitch angle :

[0055]

[0056] The above coordinate transformation eliminates the interference of the user's head translation and rotation on the sound source positioning coordinates, ensuring that the obtained azimuth and pitch angles truly reflect the physical spatial position of the sound source relative to the user's binocular perception.

[0057] In one embodiment, the step of adding the aligned visual focus Gaussian heatmap and auditory focus Gaussian heatmap to the feature fusion layer element-wise provides dynamic adjustment capabilities when faced with spatial conflicts between visual and auditory focus. In multimodal perception, the visual focus of the user's gaze and the actual auditory focus of the sound in the environment may be spatially separated. The main processor calculates the coordinates of the Gaussian center point in the aligned visual focus Gaussian heatmap. Coordinates of the Gaussian center point in the Gaussian heatmap of auditory focus Euclidean distance between :

[0058] The main processor has a preset space conflict threshold. When the Euclidean distance is less than a preset spatial conflict threshold, it indicates that the visual and auditory focal points overlap or are highly close in space. The main processor directly performs element-wise addition on the Gaussian heatmaps of the visual and auditory focal points to maximize the mask value of the overlapping region. When the Euclidean distance is greater than or equal to the preset spatial conflict threshold, it indicates that the visual and auditory focal points are spatially separated. Direct addition in this case would result in two separate high-response regions in the mask matrix, dispersing the model's computing power. The main processor further obtains the sound source category labels parsed from the multi-channel audio signals collected by the circular microphone array by the speech recognition model. This speech recognition model is built on a convolutional recurrent neural network. Its front end extracts Mel-frequency cepstral coefficient features, and its back end outputs the probability distribution on a preset set of sound source categories, taking the category corresponding to the highest probability as the sound source category label. At the same time, the main processor calculates the duration of dwell time of the Gaussian center point coordinates in the visual focal point Gaussian heatmap. Specifically, this is achieved by maintaining a length of A sliding window queue that records the most recent The variance of the intra-frame visual center coordinates is incremented by a counter when the variance is less than a set threshold. The value of this counter is the duration of continuous dwell. The main processor determines whether the sound source category label belongs to a preset strong interaction category (such as human voices, alarm sounds, prompts, etc., which require immediate visual feedback from the user), and also determines whether the duration of continued dwell exceeds a preset duration threshold. When the sound source category label belongs to the preset strong interaction category and the duration of dwell time exceeds the preset duration threshold, it indicates that the user is focusing on a certain area, but a sudden sound requiring attention has appeared in the environment. In this case, auditory stimulation should dominate. The main processor will adjust the weighting coefficients of the auditory focus Gaussian heatmap. The weighting coefficients are dynamically adjusted to be greater than those of the Gaussian heatmap of the visual focus. The weighting coefficients are adjusted using a distance-based sigmoid mapping function:

[0059]

[0060] in, To adjust the parameters for the steepness of the curve, The activation function is Sigmoid. Otherwise (i.e., the sound source does not belong to the strongly interactive category, or the visual dwell time is short, indicating that the user is in a stable gaze state and there is irrelevant background noise in the environment), the main processor maintains that the weight coefficients of the visual focus Gaussian heatmap are greater than the weight coefficients of the auditory focus Gaussian heatmap. Finally, the main processor performs a weighted fusion of the two heatmaps according to the determined weight coefficients:

[0061] As the final attention heatmap after fusion, this mechanism dynamically adjusts the fusion weight based on the sound source category label and dwell time when separating the audiovisual focus, avoiding mutual interference of non-critical background noise on the rendering computing power of the visual gaze area.

[0062] In one embodiment, before performing the Hadamard product operation on the extended 2D dynamic attention mask matrix and the self-attention weight matrix, a mask boundary smoothing attenuation mechanism is introduced to eliminate pixel stitching gaps at the boundary between the focus region and the background region in the final generated target video frame. Because the 2D dynamic attention mask matrix undergoes threshold truncation and downsampling during generation, its numerical distribution typically exhibits a constant value close to one in the inner region and zero in the outer region, with a stepped hard truncation boundary at the boundary. This hard truncation boundary, after performing the Hadamard product, causes abrupt changes in the spatial position of the self-attention weights, resulting in visually grid-like stitching gaps during decoding and rendering. The main processor first extracts the set of pixel coordinates of the inner region boundaries where the value is equal to one from the 2D dynamic attention mask matrix. Specifically, this is achieved using a morphological erosion algorithm... The binary mask matrix is ​​eroded using an all-one structuring element. The eroded matrix is ​​then XORed with the original mask matrix to extract the set of boundary pixels that transition from one to zero. . For the original binary mask matrix, For the binary mask matrix elements after the erosion operation, for each boundary pixel coordinate in the boundary pixel coordinate set... The main processor calculates the shortest spatial distance from the boundary pixel coordinates to the outer region where the value is zero. To improve computational efficiency, the main processor employs a distance transformation algorithm, calculating the Euclidean distance from each pixel in the mask matrix to the nearest zero-value pixel through two linear scans. After obtaining the shortest spatial distance, the main processor uses this shortest spatial distance as the independent variable and calculates the attenuation weight value of the corresponding boundary pixel coordinates using a preset Gaussian attenuation function. :

[0063] Among them, the width parameter of the Gaussian decay function Time step of the current denoising step in the AIGC video diffusion model They show a negative correlation. In the denoising process of the diffusion model, the time step... Gradually decrease from the maximum time step to The main processor dynamically calculates the width parameter at the current time step according to the following formula. :

[0064] in, This is the preset maximum width constant. The physical meaning of this negative correlation design is: in the initial stage of denoising (time step...) (Larger), the latent feature map mainly contains global large-scale structural information, in which case a narrower decay boundary (smaller) is required. To accurately define the focus area; in the later stages of noise reduction (time step) (Smaller), the latent feature map enters the high-frequency detail generation stage, at which point a wider attenuation boundary (larger) is required. To achieve a smooth transition between the focal area and the background area and prevent abrupt texture breaks, the main processor uses calculated attenuation weight values ​​to perform weighted replacement on the mask values ​​corresponding to the boundary pixel coordinate set, transforming the original values... Replace with After the above processing, the stepped hard truncation boundary in the 2D dynamic attention mask matrix, which transitions from one to zero, is transformed into a smooth decaying transition boundary conforming to a Gaussian distribution. After completing the boundary smoothing process, the main processor uses the updated 2D dynamic attention mask matrix to perform Hadamard product operations, completely eliminating abrupt changes in attention weights at the mask boundaries and ensuring the spatial continuity and realism of the final video frames.

Claims

1. An AIGC video generation and interaction method based on immersive computing, characterized in that, Includes the following steps: The relative position data between the pupil center and the corneal reflective spot is obtained by an eye-tracking sensor, and the coordinates of the user's gaze point in the three-dimensional environment are calculated by combining the three-dimensional spatial data collected by a spatial depth camera. A ring microphone array is used to collect multi-channel audio signals from the environment, and beamforming processing is performed to locate the azimuth and elevation angles of the sound source in three-dimensional space. The gaze point coordinates, azimuth angle, and pitch angle are input into a spatiotemporal alignment network to generate a two-dimensional dynamic attention mask matrix corresponding to the potential spatial resolution of the AIGC video diffusion model. In the self-attention mechanism calculation stage of AIGC video generation, the two-dimensional dynamic attention mask matrix is ​​used to perform dot product filtering on the self-attention weight matrix to suppress the self-attention weight values ​​of non-focal points and regions without sound sources, and amplify the self-attention weight values ​​of regions where focal points and sound sources overlap, guiding the AIGC video diffusion model to perform high-frequency detail rendering in the overlapping regions to generate target video frames.

2. The AIGC video generation and interaction method based on immersive computing according to claim 1, characterized in that, The step of obtaining the relative position data between the pupil center and the corneal reflective spot through the eye-tracking sensor and calculating the user's gaze point coordinates in the three-dimensional environment by combining the three-dimensional spatial data collected by the spatial depth camera includes: emitting an infrared beam using the eye-tracking sensor to form multiple corneal reflective spots on the surface of the user's cornea, and extracting the two-dimensional pixel coordinates of the multiple corneal reflective spots and the two-dimensional pixel coordinates of the pupil center. Acquire the environmental 3D point cloud data output by the spatial depth camera, and project the 2D pixel coordinates of the multiple corneal reflective spots and the 2D pixel coordinates of the pupil center onto the corresponding 3D spatial plane in the environmental 3D point cloud data; Based on the coordinates of the multiple corneal reflective spots in three-dimensional space, a three-dimensional fitted spherical model of the eyeball is constructed. Using the offset vector of the pupil center in three-dimensional space relative to the center of the three-dimensional fitted spherical model of the eyeball, the intersection point of the offset vector with the virtual plane in the three-dimensional point cloud data of the environment is calculated, and the coordinates of the intersection point are used as the coordinates of the gaze point.

3. The AIGC video generation and interaction method based on immersive computing according to claim 1, characterized in that, The step of acquiring multi-channel ambient audio signals through a ring microphone array and performing beamforming processing to locate the azimuth and elevation angles of the sound source in three-dimensional space includes: performing a fast Fourier transform on the time-domain audio signals acquired by each microphone node in the ring microphone array to extract frequency-domain signal features, and calculating the phase difference between any two microphone nodes. The phase difference value is input into a preset generalized cross-correlation function to generate a spatial spectrum estimation result, and the peak coordinates are searched in the spatial spectrum estimation result; Based on the physical geometry of the ring microphone array, the peak coordinates are converted into the azimuth and pitch angles of the sound source relative to the center of the ring microphone array. Kalman filtering is performed on the azimuth and elevation angles of multiple consecutive time frames to predict and eliminate abnormal coordinate jumps caused by environmental reverberation, and output the smoothed azimuth and elevation angles of the target sound source.

4. The AIGC video generation and interaction method based on immersive computing according to claim 1, characterized in that, The step of inputting the gaze point coordinates, the azimuth angle, and the pitch angle into a spatiotemporal alignment network to generate a two-dimensional dynamic attention mask matrix corresponding to the potential spatial resolution of the AIGC video diffusion model includes: constructing the spatiotemporal alignment network containing a visual feature extraction branch and an auditory feature extraction branch; converting the gaze point coordinates into a visual focus Gaussian heatmap in the visual feature extraction branch; and converting the target sound source azimuth angle and the target sound source pitch angle into an auditory focus Gaussian heatmap in the auditory feature extraction branch. The frame rate of the visual focus Gaussian heatmap and the frame rate of the auditory focus Gaussian heatmap are aligned to a unified time reference using a time-dimension interpolation algorithm. The aligned visual focus Gaussian heatmap and the auditory focus Gaussian heatmap are input to the feature fusion layer and added element-wise. The addition result is mapped to the two-dimensional grid size of the latent space of the AIGC video diffusion model to generate the two-dimensional dynamic attention mask matrix with values ​​ranging from 0 to 1.

5. The AIGC video generation and interaction method based on immersive computing according to claim 1, characterized in that, The step of performing dot product filtering on the self-attention weight matrix using the two-dimensional dynamic attention mask matrix in the self-attention mechanism calculation stage of AIGC video generation includes: obtaining the self-attention weight matrix calculated based on the query vector and the key vector in the attention calculation layer of the current denoising step of the AIGC video diffusion model. The two-dimensional dynamic attention mask matrix is ​​copied and extended in the channel dimension so that its dimension is consistent with the spatial dimension of the self-attention weight matrix. The extended two-dimensional dynamic attention mask matrix and the self-attention weight matrix are subjected to Hadamard product operation. The regions in the self-attention weight matrix whose corresponding mask values ​​are less than a preset threshold are set to zero, while the original values ​​of the regions whose corresponding mask values ​​are greater than or equal to the preset threshold are retained, thereby generating the target self-attention weight matrix after mask modulation. The value vector is weighted and summed using the target self-attention weight matrix.

6. The AIGC video generation and interaction method based on immersive computing according to claim 5, characterized in that, The step of guiding the AIGC video diffusion model to perform high-frequency detail rendering in the overlapping region to generate the target video frame includes: during the process of the AIGC video diffusion model upsampling and decoding the latent feature map, extracting the connected region regions in the two-dimensional dynamic attention mask matrix whose values ​​are greater than the preset threshold as high-frequency enhancement regions. Within the high-frequency enhancement region, a high-pass filter based on the Laplacian operator is applied to the latent feature map to extract the edge gradient features of the latent feature map in the spatial dimension; The edge gradient features are added to the latent feature map within the high-frequency enhancement region by residual addition, thereby enhancing the contrast difference between adjacent pixels within the high-frequency enhancement region; Low-pass filtering is applied to the non-high-frequency enhancement regions with values ​​less than or equal to the preset threshold for smoothing. The processed high-frequency enhancement regions and non-high-frequency enhancement regions are then spliced ​​and fused together in the spatial dimension and input to the final decoder layer to output the target video frame.

7. The AIGC video generation and interaction method based on immersive computing according to claim 2, characterized in that, The step of extracting the two-dimensional pixel coordinates of the plurality of corneal reflective spots and the two-dimensional pixel coordinates of the pupil center includes: acquiring ambient light intensity data collected by the ambient light sensor of the immersive head-mounted device where the eye-tracking sensor is located; When the ambient light intensity data is greater than the preset light threshold, the emission power of the infrared beam is increased, and the exposure gain value of the image sensor in the infrared band is dynamically increased. An adaptive threshold segmentation algorithm is used to separate the multiple corneal reflection spots from the background ambient light noise in the current frame image, and the gray-level centroid of the separated spot candidate regions is calculated as the accurate two-dimensional pixel coordinates. When the ambient light intensity data is less than or equal to the preset light threshold, the emission power of the infrared beam is reduced to reduce the thermal effect on the eye. An edge detection operator is used to extract the contour boundary of the pupil center, an ellipse is fitted, and the center of the ellipse is calculated as the two-dimensional pixel coordinates of the pupil center.

8. The AIGC video generation and interaction method based on immersive computing according to claim 3, characterized in that, The step of converting the peak coordinates into the azimuth and pitch angles of the sound source relative to the center of the ring microphone array includes: acquiring three-dimensional head posture data collected by the inertial measurement unit, wherein the three-dimensional head posture data includes head yaw angle, head pitch angle and head roll angle; Calculate the first rotation matrix of the center of the ring microphone array in the global world coordinate system based on the user's three-dimensional head posture data; Obtain the fixed installation offset of the ring microphone array relative to the user's head, and combine it with the first rotation matrix to convert the fixed installation offset into the absolute coordinates of the microphone array in the global world coordinate system; The initial positioning coordinates of the sound source in the global world coordinate system are transformed using the absolute coordinates of the microphone array. The initial positioning coordinates are then transformed into a head-relative local coordinate system with the center of the circular microphone array as the origin and the midpoint of the line connecting the user's two ears pointing straight ahead as the vertical axis. The azimuth and pitch angles are then calculated in the head-relative local coordinate system.

9. The AIGC video generation and interaction method based on immersive computing according to claim 4, characterized in that, The step of adding the aligned visual focus Gaussian heatmap and the auditory focus Gaussian heatmap to the feature fusion layer element by element includes: calculating the Euclidean distance between the coordinates of the Gaussian center point in the aligned visual focus Gaussian heatmap and the coordinates of the Gaussian center point in the auditory focus Gaussian heatmap. When the Euclidean distance is less than a preset spatial conflict threshold, the element-by-element addition operation is performed. When the Euclidean distance is greater than or equal to the preset spatial conflict threshold, the continuous dwell time of the sound source category label parsed by the speech recognition model from the multi-channel audio signal and the coordinates of the Gaussian center point in the visual focus Gaussian heatmap is obtained. When the sound source category label belongs to a preset strong interaction category and the continuous dwell time is greater than a preset duration threshold, the weight coefficient of the auditory focus Gaussian heatmap is adjusted to be greater than the weight coefficient of the visual focus Gaussian heatmap. Otherwise, the weighting coefficient of the visual focus Gaussian heatmap is kept greater than the weighting coefficient of the auditory focus Gaussian heatmap, and weighted fusion is performed according to the weighting coefficient.

10. The AIGC video generation and interaction method based on immersive computing according to claim 5, characterized in that, The step of performing a Hadamard product operation on the extended two-dimensional dynamic attention mask matrix and the self-attention weight matrix includes: before performing the Hadamard product operation, extracting the set of internal region boundary pixel coordinates with a value of one in the two-dimensional dynamic attention mask matrix; For each boundary pixel coordinate in the set of boundary pixel coordinates, calculate the shortest spatial distance from that boundary pixel coordinate to the external region with a value of 0; Using the shortest spatial distance as the independent variable, the attenuation weight value of the corresponding boundary pixel coordinates is calculated using a preset Gaussian attenuation function, wherein the width parameter of the Gaussian attenuation function is negatively correlated with the time step of the current denoising step of the AIGC video diffusion model. The mask values ​​corresponding to the boundary pixel coordinate set are weighted and replaced using the attenuation weight values. The stepped boundary transitioning from 1 to 0 in the replaced two-dimensional dynamic attention mask matrix is ​​transformed into a smooth attenuation transition boundary, and then the Hadamard product operation is performed.