A 4K ultra-high-definition real-time cloud push stream and rendering method and system

By building a super-resolution model for a specific virtual scene in the cloud and transmitting low-quality images, combined with spatial and channel attention mechanisms, the network bottleneck of 4K image transmission on low-end devices is solved, achieving efficient 4K ultra-high-definition real-time rendering and improving the user experience.

CN120856947BActive Publication Date: 2026-06-26北京渲光科技有限公司 +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
北京渲光科技有限公司
Filing Date
2025-08-04
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies face challenges in transmitting 4K ultra-high-definition images from high-end cloud devices to low-end devices, including high network bandwidth consumption, high latency, and insufficient hardware in low-end devices. Furthermore, general super-resolution algorithms struggle to achieve ideal visual effects in all scenarios.

Method used

For each virtual scene, an overfitted super-resolution model is built in the cloud. Low-quality images are transmitted to the client and processed using a lightweight super-resolution model. Combined with spatial attention mechanism and efficient bidirectional channel attention mechanism, scene adaptability and detail performance are enhanced.

Benefits of technology

Real-time rendering of 4K ultra-high-definition images in resource-constrained environments reduces bandwidth requirements and latency while improving visual effects and processing efficiency, making it suitable for fields such as digital twins, simulation, and cloud gaming.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120856947B_ABST
    Figure CN120856947B_ABST
Patent Text Reader

Abstract

The application discloses a kind of 4K ultra-high definition real-time cloud push stream and rendering method and system, the present application is trained in cloud to each virtual scene overfitting super-resolution model and is transmitted to client, only need to transmit low-quality image, greatly reduce bandwidth demand and delay, simultaneously, low-quality image is handled to generate 4K output image using lightweight super-resolution model in client, break through the performance bottleneck of low configuration equipment, so that it can realize 4K ultra-high definition real-time rendering in resource limited environment.In addition, the present application also enhances scene adaptability and detail performance through multi-scale gradient loss, dynamic spatial attention mechanism, avoids the detail fuzzy problem of general algorithm, so as to provide visual effect and processing efficiency far more than traditional method in specific virtual scene, which brings significant improvement for user experience improvement in digital twin, simulation, cloud game and other fields.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of high-definition image rendering, specifically to a 4K ultra-high-definition real-time cloud streaming and rendering method and system. Background Technology

[0002] In fields such as digital twins, simulation, and cloud gaming, achieving real-time rendering of 4K ultra-high-definition images is crucial for enhancing user experience. For example, in digital twin technology, by accurately reproducing objects and environments in the physical world, engineers can be provided with a virtual platform for product design, testing, and optimization. This not only accelerates the R&D cycle but also reduces experimental costs. In simulation, whether it's medical surgery simulation or complex urban planning, high-definition image quality enhances immersion, making users feel as if they are operating and making decisions in a real environment. In cloud gaming, a smooth experience at 4K resolution allows players to enjoy the ultimate game visuals, enhancing entertainment and engagement. Furthermore, high-quality video streaming can be applied to scenarios such as distance education and virtual tourism, greatly enriching the forms of content presentation and interactive methods.

[0003] Current technologies face numerous challenges in transmitting 4K ultra-high-definition image quality from high-end cloud devices to low-end devices. First, directly transmitting 4K video streams consumes significant network bandwidth, which is impractical for most users' internet connections due to high data transmission costs and potential latency issues that negatively impact user experience. Conversely, transmitting lower-resolution footage first and then upscaling it on the local device requires powerful GPU support to complete the conversion from low resolution to 4K. However, many low-end devices lack the hardware capabilities for this. Even with appropriate software solutions, complex network structures and numerous parameters slow down processing speeds, failing to meet the demands of real-time video processing. Furthermore, because the requirements for image detail vary across different scenarios, general-purpose super-resolution algorithms struggle to achieve ideal visual effects in all situations, particularly in edge detail and texture representation. Summary of the Invention

[0004] To address the aforementioned technical challenges, this invention proposes an innovative 4K ultra-high-definition real-time cloud streaming and rendering method and system. This method aims to construct a lightweight single-image super-resolution network model on low-spec devices, utilizing spatial attention and efficient bidirectional channel attention mechanisms to significantly reduce computational burden while maintaining high reconstruction quality. This approach is particularly suitable for resource-constrained or time-sensitive application environments, such as real-time video playback on mobile devices. To further optimize performance, we train an overfitted super-resolution model for each virtual scene. This not only allows for lower quality video transmitted to the client and smaller model parameters, thereby accelerating transmission and improving the quality of the final output, but also results in weak generalization ability due to overfitting, limiting its applicability to specific virtual scenes. Therefore, customized super-resolution models based on specific application scenarios become necessary. While this method limits the applicability of the model, through targeted design, it can provide visual effects and processing efficiency far exceeding traditional methods in specific scenarios.

[0005] To achieve the above objectives, the present invention provides the following solution:

[0006] A method for real-time cloud streaming and rendering of 4K ultra-high definition includes the following steps:

[0007] For each virtual scene, a super-resolution model is built in the cloud, and the super-resolution model and label data are stored in the cloud model library;

[0008] Synchronously download the super-resolution models from the cloud model library to the client model library;

[0009] The rendering frames are processed in the cloud to generate low-quality images, which are then encoded and transmitted to the client via WebRTC.

[0010] The client receives low-quality images, processes them using super-resolution models from the client's model library to generate 4K output images, and then displays them.

[0011] Preferably, the steps for building a super-resolution model using the cloud include:

[0012] After uploading a new virtual scene to the cloud, training data is generated from the rendering frames. The training data includes low-quality images and corresponding 4K ultra-high-definition images.

[0013] Using the generated training data, train a super-resolution model for each virtual scene until the preset training stopping condition is met.

[0014] Preferably, the super-resolution model is updated using a knowledge distillation update method based on parameter differences, the steps of which include:

[0015] Extracting the new model θ from the cloudB Compared with the baseline model θ A Perform difference calculations to obtain the difference parameters:

[0016] Δθ raw =θ B -θ A

[0017] Then, sparsity filtering is performed:

[0018]

[0019] in, This represents the threshold filtering function; Δθ represents the difference parameters between the new model and the baseline model;

[0020] In the client, the difference parameter Δθ is fused into the baseline model θ. A The enhanced model is obtained as follows:

[0021] θ′ A =θ A +Δθ

[0022] And a distillation loss function is used for constraint:

[0023]

[0024] Where x is the input low-quality image, F θ (·) is the forward inference output of the super-resolution model, p θ (x) is the feature distribution of the model output in the intermediate layer, D KL (p||q) represents the KL divergence, α consis β is the weighting coefficient for the output consistency loss. feature These are the weighting coefficients of the feature distribution alignment loss;

[0025] Finally, the priority of model updates is defined by access frequency and model change intensity, and the model to be updated is selected based on the priority. The priority calculation method is as follows:

[0026]

[0027] Among them, S Rate This indicates the frequency of access to this scenario on the client side, MAX. Rate It is the maximum access frequency threshold set by the system, ||Δθ|| F It is the Frobenius norm of the difference parameter matrix.

[0028] Preferably, the overall network structure of the super-resolution model updated in the client includes: spatial and channel integrated attention module, depthwise separable convolutional layer, pixel shuffling layer and bilinear mechanism;

[0029] The spatial and channel integrated attention module contains several attention blocks, each combining a spatial attention mechanism and an efficient bidirectional channel attention mechanism. The spatial attention mechanism is used to capture pixel-level relationships in the feature map, while the bidirectional channel attention mechanism uses forward and backward convolutions to extract importance information between channels.

[0030] The depthwise separable convolutional layer is used to extract and transform features;

[0031] The pixel shuffling layer is used for image upsampling;

[0032] The bilinear mechanism is used to assist in the upsampling operation.

[0033] Preferably, before the super-resolution model in the client model library processes low-quality images, preprocessing is required, including the following steps:

[0034] For static virtual scenes, a lightweight classification network is used to determine the complexity of the input image and select a shallow or deep path; for simple images, some network modules are skipped; for complex images, all network modules are activated.

[0035] For dynamic virtual scenes, the complexity of the dynamic scene is determined by designing a neural network.

[0036] Preferably, the steps for determining the complexity of a dynamic scene by designing a neural network include:

[0037] First, a feature extraction network is used to extract scene complexity representations from the input frame in real time:

[0038] z = g φ (I LR )

[0039] Among them, I LR It is a low-resolution input image. C, H, and W represent the number of channels, height, and width of the low-resolution image, respectively, and g φ It is a lightweight network consisting of 3 layers of depthwise separable convolutions, where z is a 128-dimensional implicit code that represents the complexity of the scene.

[0040] Then, the convolution kernel parameters of the neural network are dynamically adjusted based on z:

[0041] ΔW k =Reshape(Linear) k (z))

[0042] W′ k =W k +α W ·ΔW k

[0043] Among them, W k It is the convolution kernel of the k-th layer of the network, ΔW k It represents the weight offset of the k-th convolutional kernel. Reshape indicates reshaping the vector into a convolutional kernel tensor. Linear k This represents a fully connected layer, taking 128-dimensional features as input and outputting... and It is the number of channels for the input and output features of the k-th layer network. α W It is a learnable scaling factor;

[0044] Dynamically adjust the convolution kernel size of the neural network according to the complexity of the scene:

[0045] W dynamic =FC1(z)+FC2(GAP(X))

[0046]

[0047] Where GAP(X) is the global average pooling of the feature map, FC1 and FC2 are fully connected layers that fuse latent coding and features, FC3 is the output scalar of the fully connected layer, and σ is the sigmoid activation function. K represents rounding down. size It is the dynamic convolution kernel size;

[0048] Finally, the loss function is adjusted, and the total loss function is:

[0049]

[0050] in, This represents the multi-scale MSE loss. Indicates perceived loss. Indicates edge enhancement loss, λ1, λ2, and λ3 represent regularization constraints, and λ1, λ2, and λ3 represent weight coefficients.

[0051] Preferably, spatial attention mechanisms include:

[0052] F spatial =σ(Conp1D(DepthwiseConv(X,W dynamic )+X))

[0053] W dymamic =FC(GAP(X))

[0054] in, W dynamic GAP(X) represents the dynamic convolution kernel weights, which are the weights of the input feature map. Global average pooling is performed. FC stands for fully connected, and DepthwiseConv represents depthwise separable convolution. Conv1D(DepthwiseConv(X,W)) dynamic )+X) represents concatenating the original features with the features after dynamic convolution, fusing multi-scale information through 1×1 convolution, and σ is the activation function Sigmoid.

[0055] The preferred, efficient bidirectional channel attention mechanism generates global channel weights by combining forward and backward attention:

[0056] X channel =σ(FC(concat(F) forward F backward )))

[0057] Among them, F forward It is forward channel attention, F backward It is backward channel attention, σ is the activation function Sigmoid, FC is fully connected, and concat is the channel concatenation operation.

[0058] The present invention also provides a 4K ultra-high-definition real-time cloud streaming and rendering system, the system being used to implement the above method, comprising: a cloud module, a transmission module, a client module, and a generation module;

[0059] The cloud module is used to construct super-resolution models and store the super-resolution models and label data in the cloud model library;

[0060] The transmission module is used to synchronously download the super-resolution models in the cloud model library to the client module;

[0061] The generation module is used to process the rendering frame using the cloud module to generate a low-quality image, which is then encoded and transmitted to the client via WebRTC.

[0062] The client module is used to receive low-quality images, process the low-quality images using the super-resolution model to generate 4K output images, and then display them.

[0063] Preferably, the process of constructing the super-resolution model includes:

[0064] After the cloud module uploads a new virtual scene, it generates training data from the rendering frame. The training data includes low-quality images and corresponding 4K ultra-high-definition images.

[0065] Using the generated training data, train a super-resolution model for each virtual scene until the preset training stopping condition is met.

[0066] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0067] This invention trains an overfitted super-resolution model for each virtual scene in the cloud and transmits it to the client. This requires transmitting only low-quality images, significantly reducing bandwidth requirements and latency. Simultaneously, the client uses a lightweight super-resolution model to process the low-quality images and generate 4K output images, overcoming the performance bottleneck of low-end devices and enabling real-time rendering of 4K ultra-high-definition images in resource-constrained environments. Furthermore, this invention enhances scene adaptability and detail representation through mechanisms such as multi-scale gradient loss and dynamic spatial attention, avoiding the detail blurring problem of general algorithms. This provides visual effects and processing efficiency far exceeding traditional methods in specific virtual scenes, significantly improving the user experience in fields such as digital twins, simulation, and cloud gaming. Attached Figure Description

[0068] To more clearly illustrate the technical solution of the present invention, the drawings used in the embodiments are briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0069] Figure 1 This is a flowchart of the method of the present invention. Detailed Implementation

[0070] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0071] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0072] The technical terms used in this invention will be explained below.

[0073] Cloud: High-performance computing devices.

[0074] Client: Devices with low computing power, such as mobile phones.

[0075] Rendered frames: These are obtained through real-time rendering by the "cloud" application (for videos, this means the video is played in real-time in the "cloud"; for applications, such as games, it means the frames rendered in real-time). After a series of operations in the "cloud," the "rendered frames" generate training data, which mainly consists of low-quality images. If a super-resolution model needs to be trained in the "cloud," then 4K ultra-high-definition images corresponding to the low-quality images are also required.

[0076] Example 1

[0077] This embodiment provides a method for real-time cloud streaming and rendering of 4K ultra-high definition, the steps of which include:

[0078] S1. For each virtual scene, construct a super-resolution model using the cloud and store the super-resolution model and label data in the cloud model library.

[0079] Upload a new virtual scene to the "cloud" (e.g., for interactive games, or a video). The "cloud" then processes the data from "rendering frames" to "generating training data," at which point a corresponding 4K ultra-high-definition image is required (for training the super-resolution model).

[0080] Specifically, the "cloud" employs video encoding, classifying video frames into keyframes (I-frames), prediction frames (P-frames), and bidirectional prediction frames (B-frames). P-frames and B-frames are reconstructed from keyframes, thus having lower bitrates, while keyframes have higher bitrates and contain complete image information. P-frames are frames encoded using forward predictive coding of keyframes, reducing redundant information and improving compression ratio. B-frames are frames bidirectionally encoded using keyframes before and after them, further reducing redundant information and achieving an even higher compression ratio. On the server side, the raw video is encoded using a standard video encoder (such as H.264) to reduce bandwidth usage.

[0081] Then, through video segmentation, using a variable-length video segmentation method, the video is divided into multiple segments by detecting visually significant changes between consecutive frames (when the difference between the current frame and the previous frame exceeds a predefined threshold, a new segment is started), and each segment is represented by a keyframe.

[0082] To extract high-level features from keyframes, a variational autoencoder (VAE) is used for feature extraction on each keyframe. The VAE consists of an encoder and a decoder, learning the mapping from the input image to the latent space and the mapping from the latent vectors back to the original image by minimizing the reconstruction error between the input and reconstructed data. The encoder maps the input image to the distribution in the latent space, and the decoder maps the latent vectors back to the reconstructed image. Through training, the VAE can effectively capture high-level features of the image. The loss function of the VAE consists of a reconstruction error term and a regularization term.

[0083] Finally, by training a super-resolution model for each virtual scene (because the models in this embodiment are all overfitted, a super-resolution model needs to be trained for each virtual scene, so that the super-resolution model needs to be transmitted from the "cloud" to the "client" in real time, greatly reducing bandwidth and latency), the generated "training data" is used to train the super-resolution model, and the trained super-resolution model, along with label data (such as ID, virtual scene name, update time, version number, etc.), is stored in the "cloud model library". The specific steps are as follows:

[0084] The K-means algorithm is used to cluster the extracted latent feature vectors. Specifically, it iteratively searches for the centroid that minimizes the inertia within the cluster, thus grouping visually similar video clips into the same group.

[0085] First, the silhouette coefficient is used to determine the number of clusters. The silhouette coefficient measures how well a data point matches its own cluster and how well it separates from other clusters. A larger silhouette coefficient indicates a better match between the data point and its own cluster. However, relying solely on the silhouette coefficient may lead to suboptimal clustering performance. Therefore, the optimal number of clusters is found by incorporating the constraints of the minimum working model. To determine the optimal number of clusters, the silhouette coefficient is used as an evaluation metric, and the number of clusters with the largest silhouette coefficient is selected as the optimal solution K. * .

[0086]

[0087] Where K is the number of possible clusters, and its value ranges from 1 to 10. This indicates that, given a set of candidate clusters, In this process, a specific value of K is found that maximizes the possible value of the objective function Silhousette(K). Silhousette(K) is an indicator for evaluating the quality of clustering results, representing the silhouette coefficient value calculated when the number of clusters is K.

[0088] For each cluster, keyframes of all segments within that cluster are extracted, and their low-quality versions and corresponding high-quality frames are used as training data to train the super-resolution model. It's important to note that because super-resolution models are prone to overfitting, to enhance their adaptability, this embodiment trains a separate super-resolution model for each virtual scene. The trained super-resolution models are then labeled and stored in a "cloud model library."

[0089] S2. Synchronously download the super-resolution models from the cloud model library to the client model library.

[0090] The "client" downloads installation packages from the "cloud", including application installation packages (such as application installation packages, game installation packages, video players, etc.), and simultaneously downloads the "cloud model library" to the "client model library" (this avoids downloading models from the cloud to the client in real time, thereby reducing bandwidth and lowering latency).

[0091] The overall network structure of the super-resolution model consists of the following parts: a spatial and channel integrated attention module, convolutional layers, pixel shuffling layers, and a bilinear mechanism. The core part is the spatial and channel integrated attention module, which integrates spatial attention and an efficient bidirectional channel attention mechanism. It can capture the relationships between pixels and the dependencies between channels in the feature map, thereby enhancing the feature representation.

[0092] Spatial and Channel Integrated Attention Module: This module contains multiple attention blocks, each combining spatial attention mechanisms and efficient bidirectional channel attention mechanisms. These attention blocks improve feature selectivity by dynamically adjusting channel importance, thereby enhancing image reconstruction quality while maintaining efficiency. Specifically, the spatial attention module captures pixel-level relationships in the feature map, while the bidirectional channel attention module utilizes forward and backward convolutions to extract inter-channel importance information.

[0093] The purpose of spatial attention mechanisms is to capture the relationships between pixels in the spatial dimension of feature maps, thereby enhancing the model's ability to perceive image details. The specific steps are as follows:

[0094] Input feature map: Input feature map Where C represents the number of channels, H is the height of the feature map, and W is the width of the feature map.

[0095] Spatial attention convolution:

[0096] F spatial =σ(Conv1D(DepthwiseConv(X,W dynamic )+X))

[0097] W dynamic =FC(GAP(X))

[0098] in, W dynamic GAP(X) represents the dynamic convolution kernel weights, which are the weights of the input feature map. Global average pooling is performed to generate C×1×1 channel descriptors for dynamically generating convolutional kernel parameters. FC (Fully Connected) is a fully connected layer (with ReLU activation in the middle), generating weight offsets for 3×3 depthwise convolutional kernels to enhance the adaptability of spatial features. DepthwiseConv represents depthwise separable convolution, using 3×3 depthwise convolution and 1×1 pointwise convolution, combined with dynamic kernel weight adjustment (i.e., dynamically generating partial kernel parameters based on input features) to enhance local feature capture capabilities. Residual connections are used to reduce the number of parameters while avoiding gradient vanishing, ensuring high reconstruction quality despite a lightweight design. Conv1D(DepthwiseConv(X,W)) dynamic ()+X) represents concatenating the original features with the dynamically convolved features, fusing multi-scale information through a 1×1 convolution. σ is the sigmoid activation function, which limits the output value to between 0 and 1, representing the importance of each pixel position. The closer the value is to 1, the more important the pixel is in the feature map.

[0099] Applying spatial attention: The generated spatial attention map F spatial The input feature map is multiplied pixel-by-pixel to perform a weighted operation, resulting in... This can enhance or suppress specific spatial locations in the feature map.

[0100] X aug =X×F spatial .

[0101] The purpose of efficient bidirectional channel attention mechanisms is to capture the dependencies between channels in the channel dimension of feature maps, thereby enhancing the model's ability to perceive important information in feature maps.

[0102] Adaptive average pooling: for the input feature map Adaptive average pooling can dynamically adjust the size of the pooling window to generate an output feature map of a specified size.

[0103]

[0104] in, H in and W in H represents the height and width of the input feature map X, respectively. out and W out h represents the target height and target width in the output feature map F, respectively. pool and w pool These represent the theoretical pooling window height and the theoretical pooling window width, respectively. step and w stepThese represent the actual pooling window step size, which defines the step size in the input height and width directions when the pooling window moves from one output position to the next output position in the actual computation.

[0105] This pooling method can flexibly extract features and is invariant to changes in the size of the feature map, thus compressing the input feature map into a smaller size.

[0106] Forward channel attention: Performs 1D convolution operations in the channel dimension. By learning the interactions between channels, it can capture the dependencies between channels, thereby dynamically adjusting the importance weight of each channel.

[0107] F forward =Conv1D(F[:,h,w])

[0108] Conv1D is a 1D convolution operation with a kernel size of 3 and input features. It is forward channel attention.

[0109] Backward channel attention: To capture the relationships between channels more comprehensively, backward channel attention is performed in parallel. First, the pooled feature maps are flipped (i.e., flipped along the last dimension). Then, a 1D convolution is applied to the flipped feature maps, allowing the model to learn the dependencies between channels from the reversed channel order.

[0110] F backweard =σ(Conv1D(flip(F[:,h,w])))

[0111] Among them, F backward It represents backward channel attention, σ is the activation function Sigmoid, and flip indicates the flipping operation on the feature map.

[0112] Combining forward and backward attention: After concatenating the bidirectional attention weights, global channel weights are generated through a lightweight fully connected layer (compression ratio of 8).

[0113] X channel =σ(Fc(concat(F forward F backward )))

[0114] Where σ is the activation function Sigmoid, FC is fully connected, and concat is the channel concatenation operation.

[0115] Combining spatial attention and channel attention, we finally obtain:

[0116] Xoutput =X aug ×X channel

[0117] Among them, X output X represents the enhanced feature map after dual modulation by spatial attention and channel attention. channel This represents the global channel weight.

[0118] Depthwise separable convolutional layers: These layers are used to extract and transform features, reducing the number of parameters compared to regular convolutional layers. They pass feature maps between attention modules and learn spatial features of the image through convolutional operations. Convolutional layers are typically accompanied by batch normalization and non-linear activation functions (such as ReLU) to stabilize the training process and introduce non-linearity.

[0119] Pixel Shuffling Layer: The pixel shuffling layer is used for image upsampling. It rearranges the spatial and channel dimensions of the feature map, thereby increasing the spatial resolution of the image. In this method, the pixel shuffling layer is applied to the tail module of the network to generate the final super-resolution image.

[0120] Bilinear Mechanism: The bilinear mechanism is used as an auxiliary means of upsampling. It enlarges the low-resolution image to the target size through bilinear interpolation and then combines it with features processed by the attention module to generate a higher-quality super-resolution image.

[0121] Loss function: The loss function is used to measure the difference between the model output and the target image. By optimizing the loss function, the model can learn how to minimize this difference, thereby improving the reconstruction quality of the super-resolution image.

[0122] Mean Squared Error (MSE): The squared difference amplifies the difference between the model's output image and the target image, allowing the model to pay more attention to larger errors.

[0123]

[0124] in, The image is the output of the model, X is the target image, and h and W are... The height and width, (i, j) represent the position of a pixel in the image that is uniquely determined by the pixel row index i and column index j.

[0125] Multi-scale MSE loss: MSE is calculated in the 2× and 4× upsampling stages to enhance detail recovery.

[0126]

[0127] Where α and β represent weighting coefficients, used to control the contribution of the MSE loss in the 2x and 4x upsampling stages to the total loss, respectively. 2× and Y 4× These represent the target (real) images at the 2x and 4x upsampling stages, respectively. and These represent the predicted output images of the model at the 2x and 4x upsampling stages, respectively.

[0128] Perceptual loss: Features are extracted using a pre-trained lightweight MobileNetV2, and feature map differences are calculated.

[0129]

[0130] Where, φ l (Y) represents the feature map activated at layer l after the real high-definition image Y is input into the pre-trained lightweight MobileNetV2 network. l (X) represents the feature map activated at layer l after the model output image X is input into the pre-trained lightweight MobileNetV2 network.

[0131] Edge enhancement loss: Extract image edges using the Sobel operator and calculate the MSE of the edge regions.

[0132]

[0133] Wherein, Sobel(Y) represents the edge intensity map obtained by applying the Sobel operator to the real high-resolution image Y. This represents the edge intensity map obtained after applying the Sobel operator to the model's output image.

[0134] Joint loss function:

[0135]

[0136] Wherein, λ1, λ2 and λ3 represent weight coefficients, which can be adaptively adjusted according to the training stage (e.g., MSE is the main factor in the initial stage, and perceptual loss weights are added in the later stage). It is the regularization loss.

[0137] S3. Utilize the cloud to process the rendered frames to generate low-quality images, which are then encoded and transmitted to the client via WebRTC.

[0138] Low-quality images output from the cloud need to be encoded (e.g., H.264) before being transmitted to the client. Generally, when installing a cloud application, models from the cloud model library are downloaded to the client's model library. However, if models in the cloud model library need updating, models of a specific scene need to be transferred from the cloud to the client's model library. In short, model transfer is a low-frequency event, and most of the time, it involves transmitting low-quality images. The transmission process uses the mature WebRTC technology.

[0139] After receiving a "low-quality image," the client first performs video decoding before proceeding with subsequent operations. Once the client receives a video clip and its corresponding miniature super-resolution model, it uses a video decoder to decode the keyframes in the video clip and temporarily stores them in a Decoded Picture Buffer (DPB) for use in decoding subsequent P-frames and B-frames.

[0140] Keyframe Super-Resolution: After decoding the keyframes, the decoding process of subsequent P-frames and B-frames is paused, and super-resolution processing is performed on the keyframes in the DPB. Since the keyframes in the DPB are in YUV format by default, while the miniature SR model accepts RGB format, format conversion is required first. After converting the YUV format keyframes to RGB format, the super-resolution model is used to enhance the quality of the keyframes.

[0141] Decode other frames: Convert the enhanced keyframes back to YUV format and resume the decoding process so that P-frames and B-frames can be decoded with reference to the enhanced keyframes.

[0142] If the model version in the "cloud model library" is updated, it needs to be updated in the "client model library" via the network. If the "cloud" application adds a new virtual scene, the newly trained model needs to be synchronously transferred to the "client model library". Except for these two cases, the "cloud" transmits "low-quality images" to the "client" at other times, thus significantly reducing bandwidth requirements and lowering latency.

[0143] Model updates require a full transfer of the overfitted model for the new scene. When the virtual scene is frequently updated, full model transfer can lead to bandwidth bottlenecks. Therefore, this method employs a knowledge distillation update method based on parameter differences. The specific steps are as follows:

[0144] Cloud-based parameter difference extraction: Extracting the new model θ in the cloud. B With cloud-based benchmark model θ A (this θ) A The significant difference parameters between models stored in the cloud model library.

[0145] (1) Calculation of differences:

[0146] Δθraw =θ B -θ A

[0147] Where, θ A It is the parameter vector of the baseline model, θ B It is the parameter vector of the new model.

[0148] (2) Sparsification filtering:

[0149]

[0150] in, It is a threshold filtering function. To retain only the parameters that change significantly, the dynamic threshold λ can be set to Δθ. raw 95th percentile of absolute value.

[0151] Client-side knowledge distillation and fusion: fusing the difference parameter Δθ into the local baseline model θ A (this θ) A For models stored in the client-side model library, the cloud-based θ... A With client θ A The models are the same, only the storage location is different, resulting in the enhanced model:

[0152] θ′ A =θ A +Δθ.

[0153] To improve the performance of the fusion model, a distillation loss function is used for constraint (only updating the parameters corresponding to Δθ):

[0154]

[0155] Where x is the input low-quality image (received from the cloud), f θ (·) is the forward inference output of the super-resolution model, p θ (x) is the feature distribution output by the model in the intermediate layer. D KL (p||q) represents the KL divergence, used to measure the difference between distributions p and q. α consis These are the weighting coefficients of the output consistency loss (such as α). consis =0.7), β feature These are the weighting coefficients (e.g., β) of the feature distribution alignment loss. feature =0.3).

[0156] Dynamic bandwidth allocation strategy: The priority of model updates is defined by access frequency and the intensity of model changes. The updated model is selected based on the priority (specifically, if the real-time bandwidth is less than a certain bandwidth threshold, only the difference parameter θ with a priority greater than 0.8 is transmitted).A Otherwise, transmit all updated parameters). Priority calculation method:

[0157]

[0158] Where Δθ is the difference parameter between the new model and the baseline model, S Rate This indicates the frequency of access to this scenario on the client side, MAX. Rate This is the maximum access frequency threshold set by the system. ||Δθ|| F is the Frobenius norm of the difference parameter matrix, used to measure the strength of model changes. w1 and w2 are weight coefficients (in this embodiment, w1 = 0.6 and w2 = 0.4 are selected).

[0159] S4. The client receives a low-quality image, processes it using a super-resolution model from the client's model library to generate a 4K output image, and then displays it.

[0160] Before using a super-resolution model to process images, preprocessing is required. A lightweight classification network determines the complexity of the input image (such as texture richness) and selects shallow or deep paths accordingly. For simple images, some network modules are skipped; for complex images, all network modules are activated to enhance reconstruction capabilities.

[0161] For dynamic virtual scenes (such as sudden changes in lighting, rapid movement of objects, etc.), this embodiment designs a neural network to determine the complexity of the dynamic scene.

[0162] First, a feature extraction network is used to extract scene complexity representations from the input frames in real time:

[0163]

[0164] Among them, I LR It is the input low-resolution image (YUV format). C, H, and W represent the number of channels, height, and width of a low-resolution image, respectively. φ It is a lightweight network consisting of 3 layers of depthwise separable convolutions, where z is a 128-dimensional implicit code that represents the complexity of the scene.

[0165] The lightweight network structure consisting of the above 3 depthwise separable convolutions is as follows:

[0166] First layer: Conv dw (3×3, stride=2), output 64 channels;

[0167] Second layer: for Conv dw (3×3, stride=2), output 128 channels;

[0168] Third layer: for Conv dw (3×3, stride=1), then input to the global average pooling layer, and finally output 128 channels through a fully connected layer.

[0169] Then, dynamically adjust the convolution kernel parameters: dynamically adjust the convolution kernel parameters according to z (only applied to 20% of key layers, such as attention modules).

[0170] ΔW k =Reshape(Linear) k (z))

[0171] M′ k =W k +α W ·ΔW k

[0172] Among them, W k It is the convolution kernel of the k-th layer of the network, ΔW k W′ is the weight offset of the k-th convolutional kernel. k Represents the dynamically adjusted weights of the k-th layer convolutional kernel; Reshape represents reshaping the vector into a convolutional kernel tensor; Linear k This represents a fully connected layer, taking 128-dimensional features as input and outputting... and It is the number of channels for the input and output features of the k-th layer network. α W It is a learnable scaling factor (initial value α) W =0.01).

[0173] Adaptive adjustment of convolution kernel size: Dynamically adjust the convolution kernel size according to the complexity of the scene.

[0174] W dynamic =FC1(z)+FC2(GAP(X))

[0175]

[0176] Among them, W dynamic K represents the weight offset of the dynamic convolution kernel. size The kernel size is dynamically adjusted. GAP(X) is the global average pooling of the feature map X. FC1 and FC2 are fully connected layers that fuse implicit codes and features. FC3 is the output scalar of the fully connected layer. σ is the sigmoid activation function (output [0,1]). Indicates rounding down. K size It is a dynamic convolution kernel size (such as 3×3, 5×5, 7×7). Use a 3×3 convolution kernel for simple scenarios and a 7×7 convolution kernel for complex scenarios.

[0177] Adjust the loss function: Add a regularization constraint to the original loss function.

[0178]

[0179] Wherein, ΔW k It is the weight offset of the k-th convolutional kernel, ||·|| F This represents the Frobenius norm, where γ is the regularization coefficient (γ = 10). -5 ).

[0180] The total loss function is:

[0181]

[0182] After preprocessing, the input image first passes through a "hybrid module," which consists of convolutional layers and batch normalization layers, to perform further feature extraction and transformation in preparation for subsequent feature extraction.

[0183] The premixed image is input into the "Spatial and Channel Integrated Attention Module," where it is processed through multiple attention blocks. Within each attention block, spatial and channel attention mechanisms operate on the feature map to capture crucial information in both spatial and channel dimensions. It's important to note that, based on "image complexity determination," simple images only require one attention block to achieve the final "4K output image" through two-stage upsampling. Complex images (such as those with numerous objects or complex lighting conditions in a virtual scene) require multiple attention blocks (as shown in the architecture diagram, a minimum of 3N), connected by skip connections, before finally undergoing two-stage upsampling to output the final "4K output image."

[0184] The feature maps processed by the attention module then enter the "hybrid module," which, like the previous one, consists of convolutional layers and batch normalization layers for further feature extraction and transformation. During this process, non-linear activation functions (such as ReLU) are applied to the output of the convolutional layers to introduce non-linear characteristics.

[0185] To ensure stable training and preserve critical information, the entire network employs residual connections. These residual connections directly pass the output of the previous layer to subsequent layers, thereby helping gradients flow better during training and avoiding the vanishing gradient problem.

[0186] Upsampling: A two-stage progressive upsampling process is used. The first stage employs pixel shuffle for a 2x upsampling to generate intermediate high-resolution features. The second stage combines bilinear interpolation and 1×1 convolution to enhance the details of the intermediate features, generating the final 4K output. The flowchart of this embodiment is shown below. Figure 1 As shown.

[0187] Example 2

[0188] This embodiment also provides a 4K ultra-high-definition real-time cloud streaming and rendering system, including: a cloud module, a transmission module, a client module, and a generation module; the cloud module is used to construct a super-resolution model and store the super-resolution model and label data in a cloud model library; the transmission module is used to synchronously download the super-resolution model in the cloud model library to the client module; the generation module is used to process the rendering frames using the cloud module to generate low-quality images, which are then encoded and transmitted to the client via WebRTC; the client module is used to receive the low-quality images, process them using the super-resolution model to generate 4K output images, and display them.

[0189] The process of building a super-resolution model includes: after a new virtual scene is uploaded to the cloud module, training data is generated from the rendered frames. The training data includes low-quality images and corresponding 4K ultra-high-definition images; using the generated training data, a super-resolution model is trained for each virtual scene until a preset training stopping condition is met.

[0190] The embodiments described above are merely preferred embodiments of the present invention and are not intended to limit the scope of the present invention. Various modifications and improvements made to the technical solutions of the present invention by those skilled in the art without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.

Claims

1. A 4K ultra-high-definition real-time cloud streaming and rendering method, characterized in that, Includes the following steps: For each virtual scene, a super-resolution model is built in the cloud, and the super-resolution model and label data are stored in the cloud model library; Synchronously download the super-resolution models from the cloud model library to the client model library; The rendering frames are processed in the cloud to generate low-quality images, which are then encoded and transmitted to the client via WebRTC. The client receives low-quality images, processes them using super-resolution models from the client's model library to generate 4K output images, and then displays them. The super-resolution model is updated using a knowledge distillation update method based on parameter differences. The steps include: Extracting new models from the cloud Compared with the benchmark model Perform difference calculations to obtain the difference parameters: Then, sparsity filtering is performed: in, This represents the threshold filtering function; Indicates the difference parameters between the new model and the baseline model; Difference parameters in the client Integrate into the baseline model The enhanced model is obtained as follows: And a distillation loss function is used for constraint: in, x It is a low-quality input image. It is the forward inference output of the super-resolution model. It is the feature distribution output by the model in the intermediate layer. Denotes KL divergence, These are the weighting coefficients for the output consistency loss. These are the weighting coefficients of the feature distribution alignment loss; Finally, the priority of model updates is defined by access frequency and model change intensity, and the model to be updated is selected based on the priority. The priority calculation method is as follows: in, This indicates the frequency of access to this scenario on the client side. It is the maximum access frequency threshold set by the system. It is the Frobenius norm of the difference parameter matrix; The overall network structure of the super-resolution model updated in the client includes: spatial and channel integrated attention module, depthwise separable convolutional layer, pixel shuffling layer and bilinear mechanism; The spatial and channel integrated attention module contains several attention blocks, each combining a spatial attention mechanism and an efficient bidirectional channel attention mechanism. The spatial attention mechanism is used to capture pixel-level relationships in the feature map, while the bidirectional channel attention mechanism uses forward and backward convolutions to extract importance information between channels. The depthwise separable convolutional layer is used to extract and transform features; The pixel shuffling layer is used for image upsampling; The bilinear mechanism is used to assist the upsampling operation; Before the super-resolution models in the client model library can process low-quality images, preprocessing is required. The steps include: For static virtual scenes, a lightweight classification network is used to determine the complexity of the input image and select a shallow or deep path; for simple images, some network modules are skipped; for complex images, all network modules are activated. For dynamic virtual scenes, the complexity of the dynamic scene is determined by designing a neural network.

2. The 4K ultra-high-definition real-time cloud streaming and rendering method according to claim 1, characterized in that, The steps for building a super-resolution model using the cloud include: After uploading a new virtual scene to the cloud, training data is generated from the rendering frames. The training data includes low-quality images and corresponding 4K ultra-high-definition images. Using the generated training data, train a super-resolution model for each virtual scene until the preset training stopping condition is met.

3. The 4K ultra-high-definition real-time cloud streaming and rendering method according to claim 1, characterized in that, The steps involved in designing a neural network to determine the complexity of a dynamic scene include: First, a feature extraction network is used to extract scene complexity representations from the input frame in real time: in, It is a low-resolution input image. , C , H and W These represent the number of channels, height, and width of the low-resolution image, respectively. It is a lightweight network consisting of 3 depthwise separable convolutional layers. It is a 128-dimensional implicit code that represents the complexity of the scene; Then according to Dynamically adjust the convolution kernel parameters of the neural network: in, It is the convolutional kernel of the k-th layer of the network. It is the weight offset of the k-th convolutional kernel. This means reshaping the vector into a convolution kernel tensor. This represents a fully connected layer, taking 128-dimensional features as input and outputting... , and It represents the number of channels for the input and output features of the k-th layer network. , ; It is a learnable scaling factor; Dynamically adjust the convolution kernel size of the neural network according to the complexity of the scene: in, It is a global average pooling of the feature map. and It is a fully connected layer that fuses implicit coding and features. It is the output scalar of the fully connected layer. It is the Sigmoid activation function. Indicates rounding down. It is the dynamic convolution kernel size; Finally, the loss function is adjusted, and the total loss function is: in, This represents the multi-scale MSE loss. Indicates perceived loss. Indicates edge enhancement loss, This represents a regularization constraint. , and This represents the weighting coefficient.

4. The 4K ultra-high-definition real-time cloud streaming and rendering method according to claim 1, characterized in that, Spatial attention mechanisms include: in, , Indicates the dynamic convolution kernel weights. It is the input feature map Perform global average pooling. It is fully connected. This represents depthwise separable convolution. This indicates that the original features are concatenated with the features obtained from dynamic convolution, through... Convolution fuses multi-scale information. It is the activation function Sigmoid.

5. The 4K ultra-high-definition real-time cloud streaming and rendering method according to claim 1, characterized in that, The efficient bidirectional channel attention mechanism generates global channel weights by combining forward and backward attention: in, It is forward channel attention. It is backward channel attention. It is the activation function Sigmoid. FC It is fully connected. Channel splicing operation.

6. A 4K ultra-high-definition real-time cloud streaming and rendering system, the system being used to implement the method according to any one of claims 1-5, characterized in that, include: Cloud module, transmission module, client module, and generation module; The cloud module is used to construct super-resolution models and store the super-resolution models and label data in the cloud model library; The transmission module is used to synchronously download the super-resolution models in the cloud model library to the client module; The generation module is used to process the rendering frame using the cloud module to generate a low-quality image, which is then encoded and transmitted to the client via WebRTC. The client module is used to receive low-quality images, process the low-quality images using the super-resolution model to generate 4K output images, and then display them.

7. The 4K ultra-high-definition real-time cloud streaming and rendering system according to claim 6, characterized in that, The process of constructing the super-resolution model includes: After the cloud module uploads a new virtual scene, it generates training data from the rendering frame. The training data includes low-quality images and corresponding 4K ultra-high-definition images. Using the generated training data, train a super-resolution model for each virtual scene until the preset training stopping condition is met.