Method and system for low-bitrate cloud rendering live vr rendering
By combining scene-customized models and video encoding on VR devices, and utilizing lightweight super-resolution models and multi-scale discriminators, the computation and bandwidth issues of 4K real-time rendering on VR devices are solved, achieving high-quality, low-latency image display suitable for a variety of demanding application scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- 北京渲光科技有限公司
- Filing Date
- 2025-08-22
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies face challenges in achieving 4K real-time rendering on VR devices, including insufficient computing power and high bandwidth requirements. Existing super-resolution technologies also fall short in terms of image generation quality and efficiency, impacting user experience.
By combining scene-customized models with video coding, using lightweight super-resolution models for cloud training and local inference, and employing cross-scale Transformer modules and multi-scale discriminator design, combined with diffusion models and variational autoencoders, low bitrate cloud rendering streaming is achieved.
Achieving high-quality, low-latency visual experiences with limited resources, improving image quality metrics and reducing computational overhead, suitable for applications such as large-scale cloud simulation and VR tourism.
Smart Images

Figure CN121074333B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the fields of artificial intelligence, computer vision, and computer graphics, and specifically relates to a method and system for real-time VR rendering with low bitrate cloud rendering streaming. Background Technology
[0002] Achieving 4K real-time rendering on VR devices is revolutionary for large-scale real-time simulation applications. For example, in the field of smart grid fault detection, 4K real-time rendering can help engineers observe the grid structure and its operating status more clearly, quickly locate and resolve potential problems, and ensure the stable operation of the power system. In VR tourism, tourists can wear VR devices to virtually tour famous landmarks around the world, enjoying an unprecedented immersive experience.
[0003] However, existing technologies face numerous challenges in achieving real-time 4K rendering on VR devices. First, real-time rendering of 4K images directly on VR devices places extremely high demands on computing power, a requirement that most VR devices struggle to meet due to hardware limitations. Cloud rendering has become a viable solution, but transmitting 4K image quality from the cloud to the VR device requires enormous bandwidth, increasing costs and potentially causing latency issues that negatively impact user experience. Another approach is to first transmit lower-resolution images (e.g., 1080p) and then upscale them to 4K on the VR device. Currently, this super-resolution technology primarily relies on Generative Adversarial Networks (GANs). While GANs have made significant progress in image generation, they still fall short in processing speed, efficiency, and generation quality. Particularly in terms of detail preservation, edge smoothness, and overall consistency, existing methods often fail to achieve ideal results, affecting the final visual experience.
[0004] To address the aforementioned limitations, this invention proposes a method and system for real-time VR rendering using low-bitrate cloud streaming. Summary of the Invention
[0005] To address the problems existing in the prior art, this invention provides a method and system for real-time VR rendering using low-bitrate cloud rendering streaming. This avoids the high bandwidth pressure associated with directly transmitting 4K video and overcomes the hardware limitations of low-spec devices unable to perform complex rendering tasks. By combining scene-customized models with video encoding, a high-quality, low-latency visual experience is achieved with limited resources. This is suitable for various application scenarios with high requirements for image quality and real-time performance, such as large-scale cloud simulation, VR tourism, and smart grid monitoring.
[0006] To achieve the above objectives, the present invention provides the following solution:
[0007] A method for real-time VR rendering with low bitrate cloud rendering streaming, the method comprising:
[0008] Obtain the rendered frame image;
[0009] Constructing a lightweight super-resolution model;
[0010] Input the rendered frame image into the lightweight super-resolution model to obtain the reconstructed 4K resolution image;
[0011] Based on the reconstructed 4K resolution image, it is rendered and displayed in real time on VR devices.
[0012] Preferred lightweight super-resolution models include:
[0013] A U-Net generator G, wherein N is generated by N CSTB = Consists of 4 CSTB modules; one discriminator network D gan A diffusion model and a pre-trained variational autoencoder (VAE), wherein the VAE is divided into encoder E. vae and decoder D vae ;
[0014] The generator G replaces the standard convolutional layers in U-Net with cross-scale Transformer modules (CSTB). Specifically, it replaces all 3×3 convolutional layers in the U-Net encoder and decoder, retains downsampling and upsampling operations, and adds CSTB at skip connections to enhance cross-scale information transfer. The structure is as follows:
[0015] CSTB(z i =DeformConv(LayerNorm(z) i-1 +MultiHeadAttn(z i-1 ,z i-1 ,z i-1 )))
[0016] Where MultiHeadAttn represents multi-head attention, DeformConv represents deformable convolution, LayerNorm represents normalization layer, and z i-1 This represents the output feature map of the (i-1)th CSTB;
[0017] Discriminator Network D gan Design the structure of the multi-scale discriminator: multi-scale discriminator {D1,D2,D3}, each discriminator uses the same backbone network PatchGAN, independently trains parameters, and the last layer outputs the probability;
[0018] Diffusion models include: forward process and reverse process;
[0019] The forward process involves creating a noisy version z of z0 at random time step t. t The formula is:
[0020]
[0021] Where x0 is the latent representation of the original image, i.e., without added noise, x t The noise latent representation at time step t is the state of the original image after adding t steps of noise, q(x t |x t-1 ) is a conditional probability distribution, representing the probability distribution from x. t-1 Generate x t The process follows a Gaussian distribution, α t =1-β t It is the single-step noise retention coefficient, representing the proportion of original information retained in step t. It is the cumulative noise retention factor, i.e. This means that x is generated directly from x0. t The conditional distribution of β t It is a noise scheduler used to control the noise level at time step t. As t gradually increases (i.e., t→T), then x... t It will gradually approach pure noise N(0,I) for q(x) t |x t-1 Reparameterize α t =1-β t , The formula q(x) is obtained t |x0), where I is the identity matrix;
[0022] Reverse process: connect z0 and Perform additional diffusion steps to the same time step s to generate z. s and Then use the pre-trained decoder D vae Decode the result back into pixel space to obtain the image x. s and Image x s and Input to discriminator D gan During the evaluation:
[0023]
[0024] Where I is the identity matrix, It is the cumulative noise figure, used to control the noise level at time step s, z s It is a noisy version of the true latent representation z0 at time step s. It is the generator output. Noise level at time step s.
[0025] Preferably, a meta-path controller is introduced to dynamically skip redundant CSTB computation units based on the local complexity of the feature map, thereby reducing computational overhead while maintaining generation quality.
[0026] splicing feature z init Perform local complexity awareness and output the complexity graph Ω, represented as:
[0027]
[0028] in, It is z init In the eigenvector at position (i,j), ||·||² is the L2 norm, γ is the balance factor, and Entropy(·) is the channel distribution entropy of the local 3×3 window. p k It is the probability distribution of the local window histogram in 256 bins;
[0029] Predicting the skip probability using a lightweight convolutional network is expressed as:
[0030] M skip =σ(Conv 3×3 (ReLU(Conv 1×1 (Ω))))
[0031] Where σ is the Sigmoid function, outputting a probability graph in the range [0,1]. Conv 1×1 It is a 1×1 convolutional layer, the purpose of which is to reduce the dimensionality to 8 channels;
[0032] During the reasoning phase, hard coding is used, that is... Then, at position (i,j), the current CSTB calculation is skipped, and the previous level feature is directly reused. During the training phase, a soft mask is used, that is, Gumbel-Softmax is used instead of hard coding, as shown below:
[0033]
[0034] Where G and G' are injected random noises, G and G' follow Gumbel(0,1), and τ is a temperature coefficient representing the smoothness control parameter. When τ→0 + The output approaches a hard decision, i.e., 0 or 1; when τ→∝, the output approaches a uniform distribution.
[0035] Enhance feature consistency through global dependency modeling;
[0036] Dynamic learning bias Δp∈R H×W×2N Where N is the kernel size, and the convolution sampling position is adjusted:
[0037]
[0038] Where z is the input feature map, representing the input feature tensor to be convolved, and p is the target position coordinate, representing the currently calculated pixel position on the output feature map. n It is a predefined sampling offset, representing the fixed relative offset of the nth sampling point in the convolution kernel, Δp. n It is the dynamically learned offset, representing the adaptive position adjustment amount for the nth sampling point, w n is the convolution weight, representing the weight of the convolution kernel at the nth sampling point, where N is the total number of sampling points, representing the number of sampling points of the convolution kernel.
[0039] Preferably, the method for inputting the rendered frame image into a lightweight super-resolution model to obtain the reconstructed 4K resolution image includes:
[0040] The rendered frame image is encoded into the latent space using a pre-trained variational autoencoder, and noise is gradually added through a diffusion process, allowing the generator to learn to recover the latent representation of the high-resolution image from the noise. The latent representation is then decoded back into the pixel space by the decoder to obtain the super-resolution image.
[0041] The present invention also provides a system for real-time VR rendering with low bitrate cloud rendering streaming, the system being used to implement the aforementioned method, the system comprising: an acquisition module, a construction module, a reconstruction module, and a rendering module;
[0042] The acquisition module is used to acquire the rendered frame image;
[0043] The building module is used to build a lightweight super-resolution model;
[0044] The reconstruction module is used to input the rendered frame image into the lightweight super-resolution model to obtain the reconstructed 4K resolution image.
[0045] The rendering module is used to perform real-time rendering and display on VR devices based on the reconstructed 4K resolution image.
[0046] Preferred lightweight super-resolution models include:
[0047] A U-Net generator G, wherein N is generated by N CSTB = Consists of 4 CSTB modules; one discriminator network D gan A diffusion model and a pre-trained variational autoencoder (VAE), wherein the VAE is divided into encoder E. vae and decoder D vae ;
[0048] The generator G replaces the standard convolutional layers in U-Net with cross-scale Transformer modules (CSTB). Specifically, it replaces all 3×3 convolutional layers in the U-Net encoder and decoder, retains downsampling and upsampling operations, and adds CSTB at skip connections to enhance cross-scale information transfer. The structure is as follows:
[0049] CSTB(z i =DeformConv(LayerNorm(z) i-1 +MultiHeadAttn(z i-1 ,z i-1 ,z i-1 )))
[0050] Where MultiHeadAttn represents multi-head attention, DeformConv represents deformable convolution, LayerNorm represents normalization layer, and z i-1 This represents the output feature map of the (i-1)th CSTB;
[0051] Discriminator Network D gan Design the structure of the multi-scale discriminator: multi-scale discriminator {D1,D2,D3}, each discriminator uses the same backbone network PatchGAN, independently trains parameters, and the last layer outputs the probability;
[0052] Diffusion models include: forward process and reverse process;
[0053] The forward process involves creating a noisy version z of z0 at random time step t. t The formula is:
[0054]
[0055] Where x0 is the latent representation of the original image, i.e., without added noise, x t The noise latent representation at time step t is the state of the original image after adding t steps of noise, q(x t |x t-1 ) is a conditional probability distribution, representing the probability distribution from x. t-1 Generate x t The process follows a Gaussian distribution, α t =1-β t It is the single-step noise retention coefficient, representing the proportion of original information retained in step t. It is the cumulative noise retention factor, i.e. q(x t |x0) means that x is generated directly from x0. t The conditional distribution of β tIt is a noise scheduler used to control the noise level at time step t. As t gradually increases (i.e., t→T), then x... t It will gradually approach pure noise N(0,I) for q(x) t |x t-1 Reparameterize α t =1-β t , The formula q(x) is obtained t |x0), where I is the identity matrix;
[0056] Reverse process: connect z0 and Perform additional diffusion steps to the same time step s to generate z. s and Then use the pre-trained decoder D vae Decode the result back into pixel space to obtain the image x. s and Image x s and Input to discriminator D gan During the evaluation:
[0057]
[0058] Where I is the identity matrix, It is the cumulative noise figure, used to control the noise level at time step s, z s It is a noisy version of the true latent representation z0 at time step s. It is the generator output. Noise level at time step s.
[0059] Preferably, a meta-path controller is introduced to dynamically skip redundant CSTB computation units based on the local complexity of the feature map, thereby reducing computational overhead while maintaining generation quality.
[0060] splicing feature z init Perform local complexity awareness and output the complexity graph Ω, represented as:
[0061]
[0062] in, It is z init In the eigenvector at position (i,j), ||·||² is the L2 norm, γ is the balance factor, and Entropy(·) is the channel distribution entropy of the local 3×3 window. p k It is the probability distribution of the local window histogram in 256 bins;
[0063] Predicting the skip probability using a lightweight convolutional network is expressed as:
[0064] M skip =σ(Conv 3×3 (ReLU(Conv 1×1 (Ω))))
[0065] Where σ is the Sigmoid function, outputting a probability graph in the range [0,1]. Conv 1×1 It is a 1×1 convolutional layer, the purpose of which is to reduce the dimensionality to 8 channels;
[0066] During the reasoning phase, hard coding is used, that is... Then, at position (i,j), the current CSTB calculation is skipped, and the previous level feature is directly reused. During the training phase, a soft mask is used, that is, Gumbel-Softmax is used instead of hard coding, as shown below:
[0067]
[0068] Where G and G' are injected random noises, G and G' follow Gumbel(0,1), and τ is a temperature coefficient representing the smoothness control parameter. When τ→0 + The output approaches a hard decision, i.e., 0 or 1; when τ→∝, the output approaches a uniform distribution.
[0069] Enhance feature consistency through global dependency modeling;
[0070] Dynamic learning bias Δp∈R H×W×2N Where N is the kernel size, and the convolution sampling position is adjusted:
[0071]
[0072] Where z is the input feature map, representing the input feature tensor to be convolved, and p is the target position coordinate, representing the currently calculated pixel position on the output feature map. n It is a predefined sampling offset, representing the fixed relative offset of the nth sampling point in the convolution kernel, Δp. n It is the dynamically learned offset, representing the adaptive position adjustment amount for the nth sampling point, w n is the convolution weight, representing the weight of the convolution kernel at the nth sampling point, where N is the total number of sampling points, representing the number of sampling points of the convolution kernel.
[0073] Preferably, the process of inputting the rendered frame image into a lightweight super-resolution model to obtain the reconstructed 4K resolution image includes:
[0074] The rendered frame image is encoded into the latent space using a pre-trained variational autoencoder, and noise is gradually added through a diffusion process, allowing the generator to learn to recover the latent representation of the high-resolution image from the noise. The latent representation is then decoded back into the pixel space by the decoder to obtain the super-resolution image.
[0075] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement any one of the methods described above.
[0076] The present invention also provides a computer-readable storage medium storing a computer program that, when executed, implements any one of the methods described above.
[0077] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0078] This paper proposes a method and system for real-time VR rendering using low-bitrate cloud streaming. The method utilizes a pre-trained variational autoencoder (VAE) to encode images into a latent space, and gradually adds noise through a diffusion process, allowing the generator to learn how to recover the latent representation of the high-resolution image from the noise. Finally, a decoder decodes the image back into pixel space to obtain a super-resolution image. To further optimize training performance, a dynamic time-step adjustment strategy is adopted, balancing the training progress between the generator and discriminator based on the discriminator's accuracy. Furthermore, the discriminator employs a multi-scale structure design, enabling accurate discrimination of images at different scales, and improves discrimination performance through spectral normalization and multi-scale weight fusion techniques. Compared to traditional methods, this method addresses the problem of U-Net structure generators struggling to capture global contextual information, enhances texture consistency, and improves the generation capability for multi-scale edges and structural details. Simultaneously, the multi-scale discriminator design improves the discrimination capability for high-frequency and low-frequency structural features, enhancing training stability. Experimental results show that this method can not only effectively improve image quality indicators (such as PSNR and SSIM), but also significantly reduce FID values, demonstrating higher training and inference efficiency and stronger robustness. Attached Figure Description
[0079] To more clearly illustrate the technical solution of the present invention, the drawings used in the embodiments are briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0080] Figure 1 This is a diagram of the lightweight super-resolution model architecture according to an embodiment of the present invention;
[0081] Figure 2 This is a schematic diagram of the meta-path controller according to an embodiment of the present invention;
[0082] Figure 3 This is a schematic diagram of the reasoning process in an embodiment of the present invention;
[0083] Figure 4 This is a schematic diagram of the structure of an electronic device according to an embodiment of the present invention. Attached image description:
[0085] 1010, Processor; 1020, Memory; 1030, Input / Output Interface; 1040, Communication Interface; 1050, Bus. Detailed Implementation
[0086] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0087] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0088] Example 1
[0089] This embodiment provides a method for real-time VR rendering with low bitrate cloud rendering streaming, the method comprising:
[0090] Obtain the rendered frame image;
[0091] Constructing a lightweight super-resolution model;
[0092] Input the rendered frame image into the lightweight super-resolution model to obtain the reconstructed 4K resolution image;
[0093] Based on the reconstructed 4K resolution image, it is rendered and displayed in real time on VR devices.
[0094] In this embodiment, as Figure 1 As shown, (1) the lightweight super-resolution model architecture includes a U-Net generator G (composed of N... CSTB =Composed of 4 CSTB modules), and a discriminator network D gan A diffusion model and a pre-trained variational autoencoder (VAE), wherein the VAE is divided into encoder E vae and decoder D vae .
[0095] (2) Training data processing: For each pair of high-resolution image x0 and low-resolution image x low First, use VAE's encoder E. vae Encode them as latent representations z0 and z low This latent encoding enables efficient processing in a lower-dimensional space. In this invention, x represents the image, and z represents the encoded feature map.
[0096] (3) In the overall architecture diagram, “high-quality output images” are the output of the model push process, and “true and false order” are the output of the model training process.
[0097] (4) The core objective of this technology is to achieve real-time rendering and display of 4K ultra-high-definition images on low-performance VR devices. By training a lightweight model in the cloud and combining local inference with video encoding and transmission, it solves the performance bottleneck caused by hardware limitations or high bandwidth requirements in traditional solutions. This technology is divided into two main stages: the model training stage (cloud) and the model inference stage (local + cloud collaboration).
[0098] 1) During the model training phase, the system independently models and trains each virtual scene based on a high-performance cloud server. Specifically, 1080P and 4K resolution images of the same scene are rendered simultaneously in the cloud as input and target output. This data is used to train the image super-resolution network, enabling the model to convert low-resolution images into high-quality 4K images. To further optimize the model's computational efficiency and meet the operational requirements of resource-constrained devices, we adopt an "overfitting training" strategy, i.e., training a highly lightweight model separately for each specific virtual scene. Although this approach sacrifices the model's generalization ability, since the content of each virtual scene is relatively fixed, it can significantly compress the model parameter size while ensuring reconstruction quality and improving inference efficiency. After training, all models for different virtual scenes are uniformly stored in a cloud model library for subsequent deployment.
[0099] 2) During the model inference phase, the entire process achieves collaborative work between the cloud and the local device. The cloud server renders 1080P resolution images in real-time based on the current virtual scene the user is in, and uses H.265 high-efficiency encoding for video compression to significantly reduce transmission bitrate and bandwidth usage. The compressed video stream is transmitted over a network (such as WebRTC technology) to the local low-performance VR device for decoding, restoring a series of frame images. Subsequently, the decoded frames are automatically matched and input into the corresponding lightweight super-resolution model based on their virtual scene type. These models are pre-downloaded from the cloud model library and cached on the local device, specifically designed for image enhancement tasks in specific scenes. The output of the processed models is a reconstructed 4K resolution image, which is ultimately rendered and displayed in real-time by the local VR device.
[0100] The advantage of this architecture lies in that it avoids the high bandwidth pressure caused by directly transmitting 4K video while overcoming the hardware limitations of low-end devices that cannot perform complex rendering tasks. By combining scene-customized models with video encoding, it achieves a high-quality, low-latency visual experience with limited resources, making it suitable for various application scenarios with high requirements for image quality and real-time performance, such as large-scale cloud simulation, VR tourism, and smart grid monitoring.
[0101] In this embodiment, the diffusion process is as follows:
[0102] (1) Forward process: Create a noisy version z of z0 at random time step t. t The formula is as follows:
[0103]
[0104] Where x0 is the latent representation of the original image (without added noise). t The noise latent representation at time step t is the state of the original image after adding t steps of noise. q(x) t |x t-1 ) is a conditional probability distribution, representing the probability distribution from x. t-1 Generate x t The process follows a Gaussian distribution. α t =1-β t It is the single-step noise retention coefficient, representing the proportion of original information retained in step t. It is the cumulative noise retention factor, i.e. q(x t |x0) means that x is generated directly from x0. t The conditional distribution is determined to avoid step-by-step calculations. β t It is a noise scheduler used to control the noise level at time step t. As t gradually increases, i.e., t→T, then x t It will gradually approach pure noise N(0,I). For q(x)t |x t-1 Reparameterize α t =1-β t , The formula q(x) is obtained t |x0). I is the identity matrix.
[0105] 1) Cosine noise scheduling design: If β t Linear noise scheduling, with its inconsistent rates of change at both ends of the time step, can lead to abrupt changes in high-frequency noise, affecting training stability. Therefore, linear scheduling is replaced with cosine noise scheduling because the rate of change of the cosine function along the time axis is symmetrically distributed, making β... t Changes are slow in the initial stage (t≈0) and the final stage (t≈T), and slow in the intermediate stage. The changes are relatively rapid. This scheduling method avoids the abrupt changes of linear scheduling, smooths out excessive noise, and reduces gradient oscillations during training; it accelerates noise addition in the intermediate stages, more efficiently covering the data distribution. The formula for cosine noise scheduling is:
[0106]
[0107] Where, α min and α max These are the minimum and maximum coefficients for noise scheduling, used to control the noise range (default α). min =0.001, α max =0.02). t is the current time step, t∈{1,2,…,T}. T is the total number of diffusion steps (default T=1000).
[0108] 2) Adaptive Step Size Selection Strategy: During the training phase, the model is trained using a complete T-step diffusion process to ensure that the model learns the noise distribution across the entire step size. During the inference phase, the model is selected based on the low-resolution latent representation z. low The complexity of dynamically adjusting the effective number of steps T in the reverse process. eff Only execute the first T eff The reverse process reduces redundant calculations for simple samples.
[0109]
[0110] Among them, ||z low ||2 is the low-resolution latent representation z low The L2 norm measures the complexity (a larger value indicates richer details). γ is a smoothing factor (default γ = 0.1), which prevents the denominator from being zero and controls the adjustment magnitude.
[0111] (2) Generator input: input the triplet (z) t ,t,z lowThe input is fed into a generator G, whose goal is to predict... To make it as close as possible to the true potential representation z0.
[0112] (3) The reverse process of the standard diffusion model: converting pure noise x t ~N(0,I) is converted into clean data x0.
[0113] 1)Formula
[0114]
[0115] 2) Model learning: The goal is to propose a model p θ (x t-1 |x t ), whose parameter θ can minimize the KL divergence between the tractable posterior distributions at all time steps.
[0116]
[0117] Where, θ * The optimal parameters for the inverse process of the diffusion model are those obtained by minimizing the sum of KL divergences, i.e., the generator parameters. This can be simplified to the noise state x. t True noise in t The L2 norm between the predicted noise and the predicted noise.
[0118] 3) Sample generation: For the already trained model p θ (x t-1 |x t ), can be obtained from x T Starting from ~N(0,I), the model iteratively denoises and generates new samples, i.e.
[0119]
[0120] Where, x T The latent variable q(x) represents the diffusion process at the final time step T. T ) represents x as defined in the forward process of the diffusion model. T The marginal probability distribution.
[0121] (4) The reverse process of this method: In order to further optimize the prediction results, z0 and Perform additional diffusion steps to the same time step s to generate z. s and Then use the pre-trained decoder D vae Decode the result back into pixel space to obtain the image x. s and These images will be input into the discriminator D. gan An evaluation will be conducted in the future.
[0122]
[0123] Where I is the identity matrix, This is the cumulative noise figure, used to control the noise level at time step s. s It is a noisy version of the true latent representation z0 at time step s. It is the generator output. Noise level at time step s.
[0124] (5) Dynamically Adjusting the Time Step s: The accuracy of the discriminator is monitored using an Exponential Moving Average (EMA) mechanism, and the time step s is adjusted accordingly. This ensures that the discriminator is neither too powerful nor too weak, thus maintaining a balance between the generator and the discriminator. The formula for dynamically adjusting the time step is as follows:
[0125]
[0126] in, It is the EMA accuracy of the i-th training iteration, acc batch λ is the discriminator accuracy for the current batch. ema It is the EMA weight (λ) ema =0.05), where T is the maximum diffusion time step and s is the dynamically adjusted time step.
[0127] It's important to note that the forward noise addition process is essentially the same as in the standard diffusion model (with only two minor innovations), adding noise up to time T. However, the reverse noise reduction process differs from the standard diffusion model; it doesn't involve a time step T, but only a time step s (this s can be dynamically adjusted). The subsequent process involves... s and The inputs are fed into the decoder, and the output is x. s and Then, the images are randomly concatenated and input into a discriminator, which then determines which image is the "high-quality input image" and adds noise. s The decoded image (this is true); which image is the "low-quality input image" after being noise-added by the generator. The image is then decoded (this one is fake). The final output is the order of truth and falsehood, with four possible results: true-false, false-true, true-true, and false-false.
[0128] In this embodiment, generator G:
[0129] (1) Generator Structure: If the generator G adopts a simple U-Net structure, it has limitations. On the one hand, traditional convolutions are difficult to capture the global context and are limited by the local receptive field, resulting in insufficient texture consistency. On the other hand, its ability to generate multi-scale edges and structural details is limited. To address this, this invention replaces the standard convolutional layers in U-Net with cross-scale Transformer modules (CSTB). Specifically, it replaces all 3×3 convolutional layers in the U-Net encoder and decoder, retains downsampling and upsampling operations, and adds CSTB at skip connections to enhance cross-scale information transfer. Its structure is as follows:
[0130] CSTB(z i =DeformConv(LayerNorm(z) i-1 +MultiHeadAttn(z i-1 ,z i-1 ,z i-1 )))
[0131] Where MultiHeadAttn represents multi-head attention, DeformConv represents deformable convolution, LayerNorm represents normalization layer, and z i-1 This represents the output feature map of the (i-1)th CSTB.
[0132] 1) Generator input data: Input noise, latent state z t The latent representation z of time step t and low-resolution image low Input 1, the current potential state z t Input 1 is the noise latent representation after the forward diffusion process. It contains information about the latent representation z0 encoded from the original high-resolution image x0 after noise processing. Input 2, time step t, represents the current stage of the diffusion process. It provides the generator with information about the denoising progress, helping the generator understand the current level of denoising that needs to be achieved. Input 3, the latent representation z0 of the low-resolution image. low It is through the pre-trained VAE encoder E vae For low-resolution image x low The latent representation z of the low-resolution image is obtained through encoding. low It provides the generator with structural and semantic information about the image, guiding the generator to produce a high-resolution image consistent with the low-resolution image.
[0133] 2) Feature fusion of input data: First, z t ∈R H×W×C and z low ∈R H×W×C The initial feature z is formed by concatenating the channels. t-low ∈R H×W×2CWhere H×W×C represents the height, width, and number of channels of the feature map. Then, a 1×1 convolution is used to reduce the concatenated features to the original number of channels, yielding z. init ∈R H×W×C The formula is expressed as:
[0134] z init =Conv 1×1 (z t ⊕z low )
[0135] Here, ⊕ represents the concatenation of channel dimensions. init As the first input to CSTB, after N CSTB After N CSTB iterations (for example, a generator consists of N...), CSTB =Composed of 4 CSTB modules), to obtain the final output
[0136] 3) Meta-Path Controller (MPC): Traditional dynamic path computation only depends on the global L2 norm ||z low ||2 makes a decision (i.e., skips the CSTB module), but the complexity distribution is uneven across different image regions. For example, in a VR scene, a person's face requires detailed calculation, while a solid-color background can be simplified. MPC is a spatially adaptive computation scheduling mechanism. Its core idea is to dynamically skip redundant CSTB computation units based on the local complexity of the feature map, significantly reducing computational overhead while maintaining generation quality. The flowchart is as follows. Figure 2 As shown:
[0137] a) Local complexity analysis: For the splicing feature z init Perform local complexity awareness and output the complexity graph Ω.
[0138]
[0139] in, It is z init In the eigenvector at position (i,j), ||·||² is the L2 norm, and γ is the balance factor (default γ = 0.5). Entropy(·) is the channel distribution entropy of the local 3×3 window. p k It is the probability distribution of the local window histogram in 256 bins.
[0140] b) Dynamic mask generation: Predicting skip probability using a lightweight convolutional network.
[0141] M skip =σ(Conv 3×3 (ReLU(Conv 1×1 (Ω))))
[0142] Where σ is the Sigmoid function, outputting a probability graph in the range [0,1]. Conv 1×1 It is a 1×1 convolutional layer, the purpose of which is to reduce the dimensionality to 8 channels.
[0143] c) Gumbel-Softmax mechanism: During the inference phase, hard coding is used, i.e. At position (i,j), the current CSTB calculation is skipped, and the previous level feature is directly reused. During the training phase, a soft mask is used, that is, Gumbel-Softmax is used instead of hard coding.
[0144]
[0145] Where G and G' are injected random noises, and G and G' follow Gumbel(0,1). τ is a temperature coefficient, representing the smoothness control parameter, which changes as τ→0. + The output approaches a hard decision (0 or 1); as τ→∝, the output approaches a uniform distribution.
[0146] 1) Multi-Head Attention (MultiHeadAttn): Enhances feature consistency through global dependency modeling. Input feature map z i-1 ∈R H×W×C (The first input was z) init Global dependencies are calculated through query, key, and value projection:
[0147]
[0148] Here, Q is the query matrix, which is compared with the key matrix to determine which information is important. K is the key matrix, and the comparison result with the query matrix is used to calculate attention weights, thereby determining which information is relevant. V is the value matrix, which is weighted according to the calculated attention weights to generate the final output. k This is the dimension of the key vector, used to scale the attention score and prevent the gradient vanishing problem of the Softmax function caused by an excessively large inner product, thus ensuring the stability of the model. Additionally, the multi-head mechanism splits the model into h parallel heads (default h = 8) to enhance multi-scale feature fusion.
[0149] 2) Deformable Convolution (DeformConv): Dynamically adjusts the receptive field to capture multi-scale details and improve edge sharpness. It dynamically learns the bias Δp∈R. H×W×2N Where N is the kernel size, and the convolution sampling position is adjusted:
[0150]
[0151] Where z is the input feature map, representing the input feature tensor to be convolved (usually from the output of the previous layer). p is the target position coordinate, representing the currently calculated pixel position on the output feature map. n This is a predefined sampling offset, representing the fixed relative offset of the nth sampling point in the convolution kernel. Δp n This is the dynamically learned offset, representing the adaptive position adjustment amount for the nth sampling point. n is the convolution weight, representing the weight of the convolution kernel at the nth sampling point. N is the total number of sampling points, representing the number of sampling points for the convolution kernel.
[0152] 3) LayerNorm: Normalizes the attention output to stabilize the training process.
[0153] (2) Advantages of CSTB:
[0154] 1) Global context modeling: Multi-head self-attention mechanism captures long-distance dependencies and solves the texture breakage problem caused by local convolution in the original method.
[0155] 2) Multi-scale detail enhancement: Deformable convolution dynamically adapts to edges and structures, improving PSNR by approximately 0.8dB and SSIM by 5%.
[0156] 3) Training and inference efficiency (requires optimization of the inference process, such as dynamic computation path and cached attention matrix): Mixed precision training reduces memory usage by 40%, and dynamic computation path enables inference speed to reach 25 FPS (1080Ti).
[0157] 4) Robustness improvement (requires combination with cross-scale consistency loss function): Cross-scale consistency loss suppresses multi-scale generation inconsistency, reducing FID by 12.3%.
[0158] (3) The role of the generator and the training objective
[0159] 1) During training, the goal of generator G is to learn how to start from noisy latent states z. t A clean latent representation z0 is gradually recovered. This is achieved by interacting with the discriminator D. gan Through adversarial training, the generator continuously optimizes its predictive capabilities to generate more realistic and high-quality high-resolution images.
[0160] 2) The training objective of the generator is to minimize the difference between the generated image and the real image in the latent space, while deceiving the discriminator into being unable to distinguish between the generated and real images. This is achieved through the generator's loss function, which is a combination of content loss and adversarial loss.
[0161] In this embodiment, the discriminator D ganIf the discriminator uses a single-scale convolutional network, on the one hand, it only focuses on the original resolution features, and its ability to discriminate high-frequency textures (such as hair and texture details) and low-frequency structures (such as object contours) is insufficient; on the other hand, traditional weight normalization methods (such as BatchNorm) may lead to gradient explosion or mode collapse, making training unstable. Therefore, this invention designs a multi-scale discriminator:
[0162] (1) Structure of multi-scale discriminator: Multi-scale discriminator {D1,D2,D3}, each discriminator uses the same backbone network (such as PatchGAN) but trains parameters independently, and the last layer outputs the probability.
[0163] (2) Input and output of the multi-scale discriminator: Since the noise reduction process only takes s time steps, the discriminator input is x. s and Instead of the actual high-resolution image x0 and the generated super-resolution image
[0164] 1) Input D1 Output the probability of being true or false. It is a random permutation of the discriminator input, the purpose of which is to improve the robustness of adversarial training. The discriminator input consists of a random permutation of real and generated images in the channel dimension, and the discriminator needs to predict the correct order of this permutation.
[0165]
[0166] Where, x s It is z s Decoding the image at time step s, yes Decoding the image at time step s, This indicates a concatenation operation along the channel dimension. This represents a randomized splicing operation.
[0167] 2) Input D2 downsampling map, i.e. Output the probability of being true or false.
[0168] 3) D3 input downsampling map, i.e. Output the probability of being true or false.
[0169] (3) Spectrum Normalization: Spectrum normalization is applied after each convolutional layer of the discriminator, and the weight matrix is updated before each forward propagation. The formula is as follows:
[0170]
[0171] Where σ(W) is the spectral norm of the weight matrix (i.e., the largest singular value), calculated using a power-law iterative approximation algorithm. W is the weight matrix of the convolutional layer. SN These are the normalized weights to ensure that the network satisfies Lipschitz continuity (i.e., constrains gradient magnitude).
[0172] (4) The output of each discriminator is obtained as a scalar value through global average pooling, which represents the "realism score" of the image at that scale.
[0173] (5) Multi-scale weighted fusion: By combining the discrimination results of different scales through weighted averaging, a scale confidence factor α is introduced to enhance robustness. k The weights are dynamically adjusted.
[0174]
[0175] Among them, w k It is a scale weighting coefficient that satisfies (Default values w1 = 0.5, w2 = 0.3, w3 = 0.2, reflecting an emphasis on high-frequency details). α k It is the scale confidence factor. If a certain scale discriminator outputs a high probability (D... k →1), then its confidence level α k Increase, and vice versa. Set a threshold. (for example ),if This indicates the image is real; otherwise, the image is generated (i.e., not real). During the training phase, the weights can be further optimized based on the gradient magnitude. It is the loss gradient of the k-th discriminator, and τ is the temperature coefficient (τ = 0.1), which is used to control the magnitude of weight adjustment.
[0176] In this embodiment, the loss function is:
[0177] (1) Cross-scale consistency loss:
[0178]
[0179] Among them, Downsample k This indicates the k-th level downsampling (the ratio is...). Constrained multi-scale feature alignment.
[0180] (2) Generator loss function: The generator loss function is a weighted combination of multiple loss functions. By balancing these loss terms, the generator can improve perceptual quality and realism while maintaining structural accuracy.
[0181] 1) Content loss L mse Content loss is measured by mean squared error (MSE) to determine the structural difference between the generated image and the real image in the latent space.
[0182]
[0183] Where z0 is the latent representation of the true high-resolution image. It is a potential representation of the high-resolution image output by the generator.
[0184] 2) Multi-scale adversarial loss L adv Adversarial loss encourages the generator to produce images that can deceive the discriminator, making it unable to distinguish between generated and real images.
[0185]
[0186] Among them, w k These are the weights for each scale, D k (·) represents the mean of the output probability map of the k-th discriminator. E is the expected value.
[0187] 3) Perceptual loss L perc High-level features are extracted using a pre-trained VGG19 network, and the distance between the generated image and the real image in the feature space is calculated. Perceptual loss can constrain the generated image through high-level features, making it more consistent with human visual perception and reducing the oversmoothing problem caused by MSE loss.
[0188]
[0189] Where φ is the Conv in the VGG19 network 4×4 Feature extractor of layer, Conv 4×4 Layers balance the representational power of low-level textures (such as edges) and high-level semantics (such as object structure).
[0190] 4) Gradient consistency loss L grad : Calculate the difference between the generated image and the real image in the gradient space, force the generated image to be consistent with the real image in terms of edge distribution, and suppress blur artifacts.
[0191]
[0192] in, The image gradient is represented by the Sobel operator, i.e. |·||1 represents the L1 norm, which is more sensitive to edge differences and enhances sparsity constraints.
[0193] 5) Loss function of meta-path controller
[0194] LMPC =λ consist ·L consist +(1-λ consist )L balance
[0195] Among them, L consist It is a decision consistency loss, used to prevent frequent switching of computational states and maintain spatiotemporal continuity. It is the mask at position (i,j). L consist It detects mask mutations using the Laplacian operator and penalizes outlier decisions (such as skipping single pixels). L balance It is the computational load balancing loss, used to ensure a balanced computational load across different regions. Where μ target This is the target skip rate, defaulting to μ. target =0.4, and dynamically adjusted, that is Where ||z low || 2,max It is all z low The maximum value in the L2 norm of λ. consist Are continuous weights (e.g., λ) consist =0.4).
[0196] 6) Generator loss function L G
[0197] L G =L mse +λ adv ·L adv +λ CS ·L CS +λ perc ·L perc +λ grad ·L grad +λ MPC ·L MPC
[0198] Where, λ adv It is the weight that resists loss (e.g., set to λ). adv =1×10 -3 ), λ CS The weights of the consistency loss (e.g., λ) CS =0.1), λ perc The weights of the perceived loss (e.g., λ) perc =0.01), λ grad These are the weights of the gradient consistency loss (e.g., λ). grad =0.05), λ MPC These are the weights of the MCP loss (e.g., λ). MPC =0.05), used to balance the impact of each loss term.
[0199] (3) Parameter update: The model weights are updated by alternating between discriminator loss and generator loss. This alternating training method helps maintain the balance between the generator and discriminator during training, thereby improving the stability of the model and the quality of the generated images.
[0200] In this embodiment, the inference process (training process and inference process) is as follows: Figure 3 As shown:
[0201] (1) Encoding low-resolution images: Encode low-resolution input images x low Encode into the latent space to obtain the latent representation z low .
[0202] (2) Reverse diffusion process
[0203] 1) Starting with pure Gaussian noise, the reverse diffusion process is carried out in the potential space.
[0204] 2) Use generator G to progressively reduce noise and generate a high-quality, high-resolution latent representation of the image.
[0205] (3) Reasoning optimization
[0206] 1) Dynamic path calculation: based on input z low The complexity (by calculating ||z) low ||2), Adaptively skip the computation of some CSTB modules:
[0207]
[0208] Where θ is the threshold (default θ = 0.5), which can be dynamically adjusted based on the validation set.
[0209] 2) Cached attention matrix: The attention weights at a fixed time step t are cached to reduce redundant calculations and improve inference speed by 30%.
[0210] Decoding the latent representation: the final generated latent representation Through the pre-trained decoder D vae Decoding to pixel space yields a high-resolution output image.
[0211] In this embodiment, 1. Generator level
[0212] (1) Global texture consistency: The introduction of a cross-scale Transformer module replaces the traditional convolution. The multi-head self-attention mechanism captures long-distance dependencies from the global view, completely solving the texture breakage problem caused by local convolution and generating images with coherent and natural textures.
[0213] (2) Multi-scale detail rendering: Deformable convolution dynamically adjusts the receptive field according to the image content, accurately capturing multi-scale details from fine hair to grand outlines, significantly improving edge sharpness. Compared with conventional methods, PSNR is improved by about 0.8dB and SSIM is improved by 5%, resulting in richer visual details.
[0214] (3) High efficiency of training and inference: The hybrid precision training strategy reduces the memory usage by 40%, and the dynamic calculation path intelligently skips redundant calculation nodes based on the input complexity. The inference speed on the 1080Ti graphics card can reach 25FPS, achieving a fine balance between efficiency and quality.
[0215] 2. Discriminator level
[0216] (1) Multi-scale feature discrimination: Construct a multi-scale discriminator matrix to examine the authenticity of the image from the original scale to the step-by-step downsampling. It neither overlooks the subtle flaws of high-frequency textures nor ignores the overall logic of low-frequency structures. The discrimination granularity is far greater than that of a single-scale scheme.
[0217] (2) Training stability guarantee: Spectrum normalization replaces the traditional BatchNorm, strictly constrains the gradient magnitude, eliminates the risk of gradient explosion and mode collapse, the training process is smooth and the model convergence is significantly enhanced.
[0218] (3) Dynamic weight fusion: The innovative introduction of scale confidence factor and gradient-driven dynamic weight adjustment mechanism intelligently allocates weights according to the confidence and discrimination contribution of different scales, which is more accurate and reliable in judging the authenticity of images compared with fixed weight scheme.
[0219] 3. Overall Generation Effect
[0220] (1) Ultra-high image quality: The cross-scale consistency loss function strictly constrains the alignment of multi-scale features, and is supplemented by multi-dimensional optimization of perceptual loss and gradient consistency loss. The FID index is reduced by 12.3%, and the generated image achieves a qualitative leap in structural accuracy and perceptual quality, which is both faithful and realistic.
[0221] Robustness is comprehensively improved: From the noise-resistant diffusion inverse process of the generator to the multi-scale redundant discrimination of the discriminator, the entire architecture forms a robust closed loop, which can still stably output high-quality super-resolution results even in the face of low-quality and blurry input images, and the scope of applicable scenarios is greatly expanded.
[0222] Example 2
[0223] This embodiment provides a system for real-time VR rendering with low bitrate cloud rendering and streaming. The system is used to implement the method described in Embodiment 1. The system includes: an acquisition module, a construction module, a reconstruction module, and a rendering module.
[0224] The acquisition module is used to acquire the rendered frame image;
[0225] Modules for building lightweight super-resolution models;
[0226] The reconstruction module is used to input the rendered frame image into the lightweight super-resolution model to obtain the reconstructed 4K resolution image;
[0227] The rendering module is used for real-time rendering and display on VR devices based on the reconstructed 4K resolution image.
[0228] In this embodiment, the lightweight super-resolution model includes:
[0229] A U-Net generator G, wherein N is generated by N CSTB = Consists of 4 CSTB modules; one discriminator network D gan A diffusion model and a pre-trained variational autoencoder (VAE), wherein the VAE is divided into encoder E. vae and decoder D vae ;
[0230] The generator G replaces the standard convolutional layers in U-Net with cross-scale Transformer modules (CSTB). Specifically, it replaces all 3×3 convolutional layers in the U-Net encoder and decoder, retains downsampling and upsampling operations, and adds CSTB at skip connections to enhance cross-scale information transfer. The structure is as follows:
[0231] CSTB(z i =DeformConv(LayerNorm(z) i-1 +MultiHeadAttn(z i-1 ,z i-1 ,z i-1 )))
[0232] Where MultiHeadAttn represents multi-head attention, DeformConv represents deformable convolution, LayerNorm represents normalization layer, and z i-1 This represents the output feature map of the (i-1)th CSTB;
[0233] Discriminator Network D gan Design the structure of the multi-scale discriminator: multi-scale discriminator {D1,D2,D3}, each discriminator uses the same backbone network PatchGAN, independently trains parameters, and the last layer outputs the probability;
[0234] Diffusion models include: forward process and reverse process;
[0235] The forward process involves creating a noisy version z of z0 at random time step t.t The formula is:
[0236]
[0237] Where x0 is the latent representation of the original image, i.e., without added noise, x t The noise latent representation at time step t is the state of the original image after adding t steps of noise, q(x t |x t-1 ) is a conditional probability distribution, representing the probability distribution from x. t-1 Generate x t The process follows a Gaussian distribution, α t =1-β t It is the single-step noise retention coefficient, representing the proportion of original information retained in step t. It is the cumulative noise retention factor, i.e. q(x t |x0) means that x is generated directly from x0. t The conditional distribution of β t It is a noise scheduler used to control the noise level at time step t. As t gradually increases (i.e., t→T), then x... t It will gradually approach pure noise N(0,I) for q(x) t |x t-1 Reparameterize α t =1-β t , The formula q(x) is obtained t |x0), where I is the identity matrix;
[0238] Reverse process: connect z0 and Perform additional diffusion steps to the same time step s to generate z. s and Then use the pre-trained decoder D vae Decode the result back into pixel space to obtain the image x. s and Image x s and Input to discriminator D gan During the evaluation:
[0239]
[0240] Where I is the identity matrix, It is the cumulative noise figure, used to control the noise level at time step s, z s It is a noisy version of the true latent representation z0 at time step s. It is the generator output. Noise level at time step s.
[0241] In this embodiment, a meta-path controller is introduced to dynamically skip redundant CSTB computation units based on the local complexity of the feature map, thereby reducing computational overhead while maintaining generation quality.
[0242] splicing feature z init Perform local complexity awareness and output the complexity graph Ω, represented as:
[0243]
[0244] in, It is z init In the eigenvector at position (i,j), ||·||² is the L2 norm, γ is the balance factor, and Entropy(·) is the channel distribution entropy of the local 3×3 window. p k It is the probability distribution of the local window histogram in 256 bins;
[0245] Predicting the skip probability using a lightweight convolutional network is expressed as:
[0246] M skip =σ(Conv 3×3 (ReLU(Conv 1×1 (Ω))))
[0247] Where σ is the Sigmoid function, outputting a probability graph in the range [0,1]. Conv 1×1 It is a 1×1 convolutional layer, the purpose of which is to reduce the dimensionality to 8 channels;
[0248] During the reasoning phase, hard coding is used, that is... Then, at position (i,j), the current CSTB calculation is skipped, and the previous level feature is directly reused. During the training phase, a soft mask is used, that is, Gumbel-Softmax is used instead of hard coding, as shown below:
[0249]
[0250] Where G and G' are injected random noises, G and G' follow Gumbel(0,1), and τ is a temperature coefficient representing the smoothness control parameter. When τ→0 + The output approaches a hard decision, i.e., 0 or 1; when τ→∝, the output approaches a uniform distribution.
[0251] Enhance feature consistency through global dependency modeling;
[0252] Dynamic learning bias Δp∈R H×W×2N Where N is the kernel size, and the convolution sampling position is adjusted:
[0253]
[0254] Where z is the input feature map, representing the input feature tensor to be convolved, and p is the target position coordinate, representing the currently calculated pixel position on the output feature map. n It is a predefined sampling offset, representing the fixed relative offset of the nth sampling point in the convolution kernel, Δp. n It is the dynamically learned offset, representing the adaptive position adjustment amount for the nth sampling point, w n is the convolution weight, representing the weight of the convolution kernel at the nth sampling point, where N is the total number of sampling points, representing the number of sampling points of the convolution kernel.
[0255] In this embodiment, the process of inputting the rendered frame image into the lightweight super-resolution model to obtain the reconstructed 4K resolution image includes:
[0256] The rendered frame image is encoded into the latent space using a pre-trained variational autoencoder, and noise is gradually added through a diffusion process, allowing the generator to learn to recover the latent representation of the high-resolution image from the noise. The latent representation is then decoded back into the pixel space by the decoder to obtain the super-resolution image.
[0257] Example 3
[0258] Based on the same inventive concept, corresponding to any of the above embodiments, this disclosure also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method described in any of the above embodiments.
[0259] Figure 4 This embodiment illustrates a more specific hardware structure of an electronic device, which may include a processor 1010, a memory 1020, an input / output interface 1030, a communication interface 1040, and a bus 1050. The processor 1010, memory 1020, input / output interface 1030, and communication interface 1040 are interconnected internally via the bus 1050.
[0260] The processor 1010 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this specification.
[0261] The memory 1020 can be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc. The memory 1020 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented by software or firmware, the relevant program code is stored in the memory 1020 and is called and executed by the processor 1010.
[0262] The input / output interface 1030 is used to connect input / output modules to realize information input and output. Input / output modules can be configured as components within the device (not shown in the figure) or externally connected to the device to provide corresponding functions. Input devices may include keyboards, mice, touchscreens, microphones, various sensors, etc., while output devices may include displays, speakers, vibrators, indicator lights, etc.
[0263] The communication interface 1040 is used to connect a communication module (not shown in the figure) to enable communication between this device and other devices. The communication module can communicate via wired means (such as USB (Universal Serial Bus), network cable, etc.) or wireless means (such as mobile network, WIFI (Wireless Fidelity), Bluetooth, etc.).
[0264] Bus 1050 includes a pathway for transmitting information between various components of the device, such as processor 1010, memory 1020, input / output interface 1030, and communication interface 1040.
[0265] It should be noted that although the above-described device only shows the processor 1010, memory 1020, input / output interface 1030, communication interface 1040, and bus 1050, in specific implementations, the device may also include other components necessary for normal operation. Furthermore, those skilled in the art will understand that the above-described device may only include the components necessary for implementing the embodiments of this specification, and not necessarily all the components shown in the figures.
[0266] The system described in the above embodiments is used to implement a three-dimensional structural restoration method for high-quality urban renewal landscape buildings in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
[0267] Example 4
[0268] Based on the same inventive concept, corresponding to the methods of any of the above embodiments, this disclosure also provides a non-transitory computer-readable storage medium that stores computer instructions for causing the computer to perform the methods described in any of the above embodiments.
[0269] The computer-readable medium of this embodiment includes permanent and non-permanent, removable and non-removable media, and information storage can be implemented by any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transfer medium that can be used to store information accessible by a computing device.
[0270] The computer instructions stored in the storage medium of the above embodiments are used to cause the computer to perform the methods described in any of the above embodiments, and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.
[0271] Those skilled in the art should understand that the discussion of any of the above embodiments is merely exemplary and is not intended to imply that the scope of this disclosure (including the claims) is limited to these examples; within the framework of this disclosure, the technical features of the above embodiments or different embodiments can also be combined, the steps can be implemented in any order, and there are many other variations of different aspects of the embodiments of this disclosure as described above, which are not provided in detail for the sake of brevity.
[0272] Additionally, to simplify the description and discussion, and to avoid obscuring the embodiments of this disclosure, the provided drawings may or may not show well-known power / ground connections to integrated circuit (IC) chips and other components. Furthermore, the apparatus may be shown in block diagram form to avoid obscuring the embodiments of this disclosure, and this also takes into account the fact that the details of implementation of these block diagram apparatuses are highly dependent on the platform on which the embodiments of this disclosure will be implemented (i.e., these details should be fully understood by those skilled in the art). While specific details (e.g., circuitry) have been set forth to describe exemplary embodiments of this disclosure, it will be apparent to those skilled in the art that the embodiments of this disclosure may be implemented without these specific details or with variations thereof. Therefore, these descriptions should be considered illustrative rather than restrictive.
[0273] Although this disclosure has been described in conjunction with specific embodiments thereof, many substitutions, modifications, and variations of these embodiments will be apparent to those skilled in the art from the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may be used with the embodiments discussed.
[0274] Therefore, the units of the various examples described in the embodiments of this application can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0275] The embodiments described above are merely preferred embodiments of the present invention and are not intended to limit the scope of the present invention. Various modifications and improvements made to the technical solutions of the present invention by those skilled in the art without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.
Claims
1. A method for real-time VR rendering with low bitrate cloud rendering streaming, characterized in that, The method includes: Obtain the rendered frame image; Constructing a lightweight super-resolution model; Input the rendered frame image into the lightweight super-resolution model to obtain the reconstructed 4K resolution image; Real-time rendering and display are performed on VR devices based on the reconstructed 4K resolution image; Lightweight super-resolution models include: A U-Net generator , among which, by It consists of one CSTB module; a discriminator network. A diffusion model and a pre-trained variational autoencoder (VAE), wherein the VAE is divided into an encoder and an encoder. and decoder ; generator The standard convolutional layers in U-Net are replaced with cross-scale Transformer modules (CSTB), specifically, all of them in the U-Net encoder and decoder are replaced. The convolutional layer retains downsampling and upsampling operations, and adds CSTB at skip connections to enhance cross-scale information transfer. The structure is as follows: in, This indicates the attention of the bulls. Represents deformable convolution. Indicates the normalization layer. Indicates the first The output feature map of CSTB; Discriminator Network Design the structure of a multi-scale discriminant: Multi-scale discriminant Each discriminator uses the same backbone network PatchGAN, with independent training parameters, and the last layer outputs probabilities; Diffusion models include: forward process and reverse process; Among them, the forward process: at random time steps Create Noise version The formula is: in, It is the latent representation of the original image, i.e., without added noise. Indicates time step The noise latent representation is the original image after adding noise. The state after step noise, It is a conditional probability distribution, representing the distribution from... generate The process follows a Gaussian distribution. It is the single-step noise retention coefficient, representing the first step noise retention coefficient. The proportion of original information retained in each step It is the cumulative noise retention factor, i.e. , Indicates from Direct generation Conditional distribution, It is a noise scheduler used to control time steps. The noise level, with Gradually increase, that is ,but It will gradually approach pure noise. ,right Perform reparameterization, let , The formula is obtained. , It is the identity matrix; The formula for cosine noise scheduling is: in, and These are the minimum and maximum coefficients for noise control, used to control the noise range. It is the current time step. , It is the total number of diffusion steps; According to low-resolution latent representation The complexity of dynamically adjusting the effective number of steps in the reverse process. The calculation formula is: in, It is a low-resolution latent representation L2 norm, It is a smoothing factor; Reverse process: and Perform additional diffusion steps at the same time step. ,generate and Then use the pre-trained decoder Decode the result back into pixel space to obtain the image. and ,image and Input to the discriminator During the evaluation: in, It is the identity matrix. It is the accumulated noise figure, used to control the time step. The noise level, It is a real potential representation At time step The noise version, It is the generator output. At time step The noise level.
2. The method according to claim 1, characterized in that, A meta-path controller is introduced to dynamically skip redundant CSTB computation units based on the local complexity of the feature map, thereby reducing computational overhead while maintaining generation quality. splicing features Perform local complexity awareness and output a complexity graph. , is represented as: in, yes In position eigenvectors, It is the L2 norm. It is a balancing factor. It is local The channel distribution entropy of the window, , It is the probability distribution of the local window histogram in 256 bins; Predicting the skip probability using a lightweight convolutional network is expressed as: in, It's the Sigmoid function, which outputs a probability graph in the range [0,1]. yes The purpose of the convolutional layer is to reduce the dimensionality to 8 channels; During the reasoning phase, hard coding is used, that is... Then in position Skip the current CSTB calculation and directly reuse the features from the previous level. During the training phase, use a soft mask, that is, use Gumbel-Softmax instead of hard coding, as shown below: in, and It is injected random noise. and obey , It is the temperature coefficient, representing the smoothness control parameter. The output approaches a hard decision, i.e., 0 or 1; when The output approaches a uniform distribution. ; Enhance feature consistency through global dependency modeling; Dynamic learning bias ,in This refers to the kernel size, which adjusts the convolution sampling position. in, It is the input feature map, representing the input feature tensor to be convolved. These are the target location coordinates, representing the currently calculated pixel position on the output feature map. It is a predefined sampling offset, representing the first sampling offset in the convolution kernel. A fixed relative offset for each sampling point It is a dynamic learning offset, representing the first... Adaptive position adjustment amount for each sampling point These are convolution weights, representing the convolution weights of the first convolution. The convolution kernel weights for each sampling point This represents the total number of sampling points, indicating the number of sampling points for the convolution kernel.
3. The method according to claim 1, characterized in that, Methods for inputting rendered frame images into a lightweight super-resolution model to obtain reconstructed 4K resolution images include: The rendered frame image is encoded into the latent space using a pre-trained variational autoencoder, and noise is gradually added through a diffusion process, allowing the generator to learn to recover the latent representation of the high-resolution image from the noise. The latent representation is then decoded back into the pixel space by the decoder to obtain the super-resolution image.
4. A system for real-time VR rendering with low bitrate cloud rendering streaming, the system being used to implement the method described in any one of claims 1-3, characterized in that, The system includes: an acquisition module, a construction module, a reconstruction module, and a rendering module; The acquisition module is used to acquire the rendered frame image; The building module is used to build a lightweight super-resolution model; The reconstruction module is used to input the rendered frame image into the lightweight super-resolution model to obtain the reconstructed 4K resolution image. The rendering module is used to perform real-time rendering and display on VR devices based on the reconstructed 4K resolution image.
5. The system according to claim 4, characterized in that, Lightweight super-resolution models include: A U-Net generator , among which, by It consists of one CSTB module; a discriminator network. A diffusion model and a pre-trained variational autoencoder (VAE), wherein the VAE is divided into an encoder and an encoder. and decoder ; generator The standard convolutional layers in U-Net are replaced with cross-scale Transformer modules (CSTB), specifically, all of them in the U-Net encoder and decoder are replaced. The convolutional layer retains downsampling and upsampling operations, and adds CSTB at skip connections to enhance cross-scale information transfer. The structure is as follows: in, This indicates the attention of the bulls. Represents deformable convolution. Indicates the normalization layer. Indicates the first The output feature map of CSTB; Discriminator Network Design the structure of a multi-scale discriminant: Multi-scale discriminant Each discriminator uses the same backbone network PatchGAN, with independent training parameters, and the last layer outputs probabilities; Diffusion models include: forward process and reverse process; Among them, the forward process: at random time steps Create Noise version The formula is: in, It is the latent representation of the original image, i.e., without added noise. Indicates time step The noise latent representation is the original image after adding noise. The state after step noise, It is a conditional probability distribution, representing the distribution from... generate The process follows a Gaussian distribution. It is the single-step noise retention coefficient, representing the first step. The proportion of original information retained in each step It is the cumulative noise retention factor, i.e. , Indicates from Direct generation Conditional distribution, It is a noise scheduler used to control time steps. The noise level, with Gradually increase, that is ,but It will gradually approach pure noise. ,right Perform reparameterization, let , The formula is obtained. , It is the identity matrix; Reverse process: and Perform additional diffusion steps at the same time step. ,generate and Then use the pre-trained decoder Decode the result back into pixel space to obtain the image. and ,image and Input to the discriminator During the evaluation: in, It is the identity matrix. It is the accumulated noise figure, used to control the time step. The noise level, It is a real potential representation At time step The noise version, It is the generator output. At time step The noise level.
6. The system according to claim 5, characterized in that, A meta-path controller is introduced to dynamically skip redundant CSTB computation units based on the local complexity of the feature map, thereby reducing computational overhead while maintaining generation quality. splicing features Perform local complexity awareness and output a complexity graph. , is represented as: in, yes In position eigenvectors, It is the L2 norm. It is a balancing factor. It is local The channel distribution entropy of the window, , It is the probability distribution of the local window histogram in 256 bins; Predicting the skip probability using a lightweight convolutional network is expressed as: in, It's the Sigmoid function, which outputs a probability graph in the range [0,1]. yes The purpose of the convolutional layer is to reduce the dimensionality to 8 channels; During the reasoning phase, hard coding is used, that is... Then in position Skip the current CSTB calculation and directly reuse the features from the previous level. During the training phase, use a soft mask, that is, use Gumbel-Softmax instead of hard coding, as shown below: in, and It is injected random noise. and obey , It is the temperature coefficient, representing the smoothness control parameter. The output approaches a hard decision, i.e., 0 or 1; when The output approaches a uniform distribution. ; Enhance feature consistency through global dependency modeling; Dynamic learning bias ,in This refers to the kernel size, which adjusts the convolution sampling position. in, It is the input feature map, representing the input feature tensor to be convolved. These are the target location coordinates, representing the currently calculated pixel position on the output feature map. It is a predefined sampling offset, representing the first sampling offset in the convolution kernel. A fixed relative offset for each sampling point It is a dynamic learning offset, representing the first... Adaptive position adjustment amount for each sampling point These are convolution weights, representing the convolution weights of the first convolution. The convolution kernel weights for each sampling point This represents the total number of sampling points, indicating the number of sampling points for the convolution kernel.
7. The system according to claim 5, characterized in that, The process of inputting the rendered frame image into a lightweight super-resolution model to obtain the reconstructed 4K resolution image includes: The rendered frame image is encoded into the latent space using a pre-trained variational autoencoder, and noise is gradually added through a diffusion process, allowing the generator to learn to recover the latent representation of the high-resolution image from the noise. The latent representation is then decoded back into the pixel space by the decoder to obtain the super-resolution image.
8. An electronic device, characterized in that, It includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, implements the method as described in any one of claims 1 to 3.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed, implements the method as described in any one of claims 1 to 3.