A T2X satellite-ground cooperative image reconstruction method, device and equipment

By masking and feature fusion of vehicle images and using an entropy model for encoding, the problem of image transmission under the limitation of satellite communication bandwidth was solved, and high-quality image reconstruction was achieved under the constraint of extremely low data volume.

CN120302015BActive Publication Date: 2026-06-19TSINGHUA UNIVERSITY +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TSINGHUA UNIVERSITY
Filing Date
2025-03-27
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In areas with insufficient cellular network coverage, satellite communication bandwidth resources are limited, which severely restricts the amount of raw image data transmitted back for vehicle driver monitoring services, affecting the real-time performance and effectiveness of data transmission.

Method used

By adding a mask to the target region in the original image, global features and mask features are extracted and fused in the channel and spatial dimensions. An entropy model is used to generate a probability distribution for encoding, and a bit stream is sent for image reconstruction.

Benefits of technology

Under the constraint of extremely low data volume, the local peak signal-to-noise ratio of the reconstructed image was improved, information redundancy was reduced, and the clarity of the reconstructed image and model performance were enhanced.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120302015B_ABST
    Figure CN120302015B_ABST
Patent Text Reader

Abstract

This application provides an image reconstruction method, apparatus, and device for T2X satellite-ground collaboration. The method involves adding a mask to the target region of the original image to obtain a mask image. Feature extraction is performed on both the original image and the mask image to obtain global features and mask features, respectively. The global features and mask features are fused in the channel dimension and spatial dimension, respectively. The fusion results are then fused in the channel and spatial dimensions according to fusion weights to obtain the latent space representation of the original image. The latent space representation is quantized to obtain a quantized latent space representation. This quantized latent space representation is encoded based on a probability distribution generated by an entropy model and sent to the receiving end for decoding to obtain the reconstructed image. By using a mask to sparsify the original image, the sparsity of the extracted latent space features is enhanced, reducing the amount of latent space feature data. This improves the local peak signal-to-noise ratio of the reconstructed image under extremely low data constraints, thereby enhancing the clarity of the reconstructed image at extremely low bit rates.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of image processing technology, and in particular to an image reconstruction method, apparatus and equipment for T2X satellite-ground collaboration. Background Technology

[0002] In existing vehicle transportation safety management systems, real-time monitoring of abnormal driver behavior has become a core element in ensuring freight safety. However, in mountainous areas and other regions with insufficient cellular network coverage, signal blind spots are prone to occur, leading to data transmission interruptions or delays. This makes it impossible to provide seamless network services for vehicles solely relying on cellular networks. In this context, satellite communication, as a technology capable of providing all-weather communication services, combined with cellular networks, can effectively compensate for this deficiency and provide seamless network coverage for vehicles.

[0003] Nevertheless, when vehicles are in areas where cellular networks are unavailable, data transmission can only rely on satellite communication. However, the problem lies in the extremely limited bandwidth resources of satellite communication. This severely restricts the amount of raw image data transmitted back for vehicle driver monitoring under bandwidth constraints. Therefore, a contradiction arises between the high bandwidth requirements of vehicle driver monitoring under extremely low data volume constraints and the low bandwidth of satellite communication. This contradiction directly affects the real-time performance and effectiveness of data transmission. Summary of the Invention

[0004] In view of this, this application provides an image reconstruction method, apparatus and device for T2X satellite-ground cooperation to address the shortcomings of related technologies.

[0005] In a first aspect of this application, an image reconstruction method for T2X satellite-ground collaboration is provided, the method comprising:

[0006] A mask is added to the target region in the original image to obtain a mask image. Feature extraction is then performed on the original image and the mask image to obtain global features and mask features, respectively.

[0007] The global features and the mask features are fused in the channel dimension and the spatial dimension, respectively, and the fusion results are fused in the channel dimension and the spatial dimension according to the fusion weight to obtain the latent space representation of the original image.

[0008] The latent space representation is quantized to obtain a quantized latent space representation, and the quantized latent space representation and the latent space representation are input into a preset entropy model to obtain the probability distribution of the quantized latent space representation;

[0009] The quantized latent space representation is encoded based on the probability distribution, and the encoding result is sent to the receiving end in the form of a bit stream so that the receiving end can decode the bit stream to obtain the reconstructed image.

[0010] According to one embodiment of this application, the original image is a driver monitoring image, and the method further includes:

[0011] The original image is semantically segmented, and the image region with the semantic category of "non-driver" is determined as the target region.

[0012] According to one embodiment of this application, the fusion of the global features and the mask features in the channel dimension and the spatial dimension, respectively, includes:

[0013] The global feature and the mask feature are concatenated and adjusted in the channel dimension to obtain the first feature;

[0014] Local spatial features are extracted from the first feature, and location encoding information is added.

[0015] According to one embodiment of this application, the step of inputting the quantized latent space representation and the latent space representation into a preset entropy model to obtain the probability distribution of the quantized latent space representation includes:

[0016] The quantized latent space representation is divided into multiple slices along the channel dimension;

[0017] The multiple slices and the latent space representation are input into the entropy model, so that the entropy model sequentially processes each slice to obtain the probability distribution of each slice.

[0018] The encoding of the quantized latent space representation based on the probability distribution includes:

[0019] For each slice, the slice is encoded based on the probability distribution of the slice.

[0020] According to one embodiment of this application, based on the input latent space representation, a super-prior context information is generated through a super-prior network;

[0021] For the current input slice, based on the current slice and all slices input before the current slice, the context network generates the channel context information, local context information, intra-slice global context information, and inter-slice global context information of the current slice.

[0022] Based on the aforementioned prior context information, as well as the current slice's channel context information, local context information, intra-slice global context information, and inter-slice global context information, the probabilistic prediction subnetwork g is used. ep The probability distribution for generating the current slice.

[0023] According to one embodiment of this application, sending the encoding result to the receiving end in the form of a bitstream, so that the receiving end can decode the bitstream to obtain a reconstructed image, includes:

[0024] The encoding result is sent to the receiving end in the form of a bit stream, so that the receiving end decodes each of the plurality of slices sequentially based on the same probability distribution as the sending end, reconstructs the quantized latent space representation, and generates a reconstructed image.

[0025] In a second aspect of this application, an image reconstruction apparatus for T2X satellite-ground collaboration is provided, the apparatus comprising:

[0026] The extraction unit is used to add a mask to the target region in the original image to obtain a mask image, and to extract features from the original image and the mask image respectively to obtain global features and mask features;

[0027] The fusion unit is used to fuse the global features and the mask features in the channel dimension and the spatial dimension, respectively, and to fuse the fusion results in the channel dimension and the spatial dimension according to the fusion weights to obtain the latent space representation of the original image.

[0028] The processing unit is configured to quantize the latent space representation to obtain a quantized latent space representation, and input the quantized latent space representation and the latent space representation into a preset entropy model to obtain the probability distribution of the quantized latent space representation;

[0029] The encoding unit is used to encode the quantized latent space representation based on the probability distribution and send the encoding result to the receiving end in the form of a bit stream so that the receiving end can decode the bit stream to obtain the reconstructed image.

[0030] In a third aspect of this application, an electronic device is provided, including a processor and a memory, the memory storing machine-executable instructions executable by the processor, the processor executing the machine-executable instructions to implement the steps of the method proposed in the above embodiments.

[0031] In a fourth aspect of this application, a machine-readable storage medium is provided, wherein machine-executable instructions are stored therein, and when executed by a processor, the machine-executable instructions implement the steps of the method proposed in the above embodiments.

[0032] In a fifth aspect of this application, a computer program product is provided, including a computer program / instructions that, when executed by a processor, implement the steps of the method proposed in the above embodiments.

[0033] As can be seen from the above technical solution, a mask image is obtained by adding a mask to the target region in the original image. Global features and mask features are then extracted from both the original image and the mask image. These global and mask features are fused along the channel and spatial dimensions, respectively. The fusion results are then fused along the channel and spatial dimensions according to fusion weights to obtain the latent space representation of the original image. The latent space representation is quantized to obtain a quantized latent space representation. The quantized latent space representation and the mask image are input into a predefined entropy model to obtain the probability distribution of the quantized latent space representation. The quantized latent space representation is then encoded based on the probability distribution, and the encoded result is sent to the receiving end in the form of a bitstream. The receiving end decodes the bitstream to obtain the reconstructed image. Masking sparsifies the original image, resulting in stronger sparsity of the extracted latent space features and reducing the amount of data for latent space features. This improves the local peak signal-to-noise ratio of the reconstructed image under extremely low data constraints, enhancing the clarity of the reconstructed image at extremely low bitrates. Furthermore, fusing features along the channel and spatial dimensions according to fusion weights reduces information redundancy caused by mask feature extraction, improving model performance.

[0034] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit this application. Attached Figure Description

[0035] Figure 1 This is a schematic flowchart of an image reconstruction method for T2X satellite-ground collaboration provided in an embodiment of this application;

[0036] Figure 2 This is a schematic diagram of a truck-to-ground cooperative communication system provided in an embodiment of this application;

[0037] Figure 3 This is a schematic flowchart of an image reconstruction method for T2X satellite-ground collaboration provided in an embodiment of this application;

[0038] Figure 4 This is a performance comparison chart provided in an embodiment of this application;

[0039] Figure 5 This is a visualization comparison chart of latent space features and bit allocation provided in an embodiment of this application;

[0040] Figure 6 This is a visualization comparison chart of the latent space features and bit allocation of the five channels with the highest entropy provided in an embodiment of this application;

[0041] Figure 7 This is a schematic diagram of the structure of an image reconstruction device for T2X satellite-ground collaboration provided in an embodiment of this application;

[0042] Figure 8 This is a schematic diagram of the hardware structure of an electronic device illustrated in an exemplary embodiment of this application. Detailed Implementation

[0043] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0044] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used in this application and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise.

[0045] To enable those skilled in the art to better understand the technical solutions provided in the embodiments of this application, and to make the above-mentioned objectives, features and advantages of the embodiments of this application more apparent and understandable, the technical solutions in the embodiments of this application will be further described in detail below with reference to the accompanying drawings.

[0046] In the existing vehicle transportation safety management system, real-time monitoring of abnormal driver behavior has become a core element in ensuring freight safety. Taking trucks as an example, trucks generate continuous communication needs during road transportation. The communication needs of various truck services are defined as Truck-to-Everything (T2X). For instance, truck drivers need timely and reliable communication with transportation companies, customers, and other vehicles to ensure the safe arrival of goods at their destination. Simultaneously, truck drivers need to obtain real-time road condition information and accurate navigation and positioning information to improve transportation efficiency and reduce delays and losses. For freight transportation companies and customers, tracking and monitoring the location and status of goods is crucial. Trucks need to communicate and provide real-time feedback on the status and location of goods to ensure safe transportation. Logistics companies need to schedule, allocate, and manage transportation tasks, as well as manage drivers and vehicles. Trucks need to receive real-time tracking and management information for transportation tasks. Management departments need to monitor truck drivers' driving behavior in real time and intervene and warn them promptly for abnormal driving behavior.

[0047] However, in mountainous areas and other regions with insufficient cellular network coverage, signal blind spots can easily occur, leading to data transmission interruptions or delays. This makes it impossible to provide seamless network services for vehicles by relying solely on cellular networks. In this situation, satellite communication, as a technology capable of providing all-weather communication services, can effectively compensate for this deficiency when combined with cellular networks, providing seamless network coverage for vehicles.

[0048] Nevertheless, when vehicles are in areas where cellular networks are unavailable, data transmission can only rely on satellite communication. However, the problem lies in the extremely limited bandwidth resources of satellite communication. This severely restricts the amount of raw image data transmitted back for vehicle driver monitoring under bandwidth constraints. Therefore, a contradiction arises between the high bandwidth requirements of vehicle driver monitoring under extremely low data volume constraints and the low bandwidth of satellite communication. This contradiction directly affects the real-time performance and effectiveness of data transmission.

[0049] In view of this, this application discloses an image reconstruction method for T2X satellite-ground collaboration to address the shortcomings of related technologies.

[0050] like Figure 1 As shown, Figure 1 This is a flowchart illustrating an image reconstruction method for T2X satellite-ground collaboration provided in an embodiment of this application. The method can be executed by a computing device. For example, the computing device can be an embedded system, a distributed computing node, or a smart terminal device with data processing capabilities, etc. This application does not limit the type of computing device. For example, the computing device can be mounted on a vehicle, drone, or other vehicle. This application does not limit the deployment method of the computing device.

[0051] For example, this image reconstruction method for T2X satellite-to-ground collaboration can be applied in scenarios with extremely low data volume constraints, such as satellite-to-ground collaborative communication and UAV emergency communication scenarios. Taking truck satellite-to-ground collaborative communication as an example, such as... Figure 2 As shown, Figure 2 This is a schematic diagram of a truck-to-ground collaborative communication system provided in an embodiment of this application. Its core architecture consists of a ground station (truck), a geostationary satellite, and a receiver. The truck, equipped with a satellite antenna, uploads image data for driver monitoring services to the geostationary satellite via a satellite link. The data is then relayed by the geostationary satellite to the receiver for image reconstruction. However, satellite communication bandwidth resources are typically extremely limited. Under extremely low data volume constraints, there is a contradiction between the high bandwidth requirements of truck driver monitoring services and the low bandwidth limitations of satellite communication.

[0052] This T2X satellite-ground collaborative image reconstruction method may include the following steps:

[0053] S201: Add a mask to the target region in the original image to obtain a mask image, and extract features from the original image and the mask image respectively to obtain global features and mask features.

[0054] A mask is added to the target region in the acquired original image to obtain a mask image. After obtaining the mask image, it is processed in parallel by two pipelines: one pipeline is used to extract features from the original image to obtain global features, and the other pipeline is used to extract features from the mask image to obtain mask features.

[0055] In some embodiments, semantic segmentation can be performed on the original image to determine the target region in the original image, and then a mask can be added to the target region in the original image to obtain a masked image. For example, the target region can be a specific region of interest or non-interest in the original image.

[0056] In some embodiments, the original image may be a vehicle driver monitoring image, and the target area may be a specific area in the original image that is not of interest, such as a non-driver area.

[0057] Specifically, semantic segmentation can be performed on vehicle driver monitoring images, classifying each pixel of the vehicle driver monitoring image into a specific semantic category, such as driver, window, steering wheel, etc. After semantic segmentation, it is identified which pixels belong to the driver semantic category, thereby classifying other pixels into the non-driver semantic category. The area where the pixels with the semantic category of driver are located is determined as the driver region, and the area where the pixels with the semantic category of non-driver are located is determined as the non-driver region.

[0058] After determining the non-driver region, a mask matrix can be generated based on the semantic classification results. In this mask matrix, the matrix element value of the driver region is 1, and the matrix element value of the non-driver region is 0. Then, the mask matrix and the vehicle driver monitoring image are multiplied element by element to obtain the mask image.

[0059] In this embodiment, by adding a mask to the non-driver region, the original image is made sparsified, which in turn makes the latent space features sparsified, thereby reducing the amount of latent space feature data and improving model performance.

[0060] In some embodiments, for a model used to extract mask features, the image regions that the model can learn can be explicitly specified through mask constraints during the training phase of the model. This guides the model's learning behavior during the training phase, forces the model to learn the regional distribution information of the image, and learns which regions need more bits to represent and which regions can be allocated fewer bits, thereby optimizing the bit allocation regions and proportions of the image.

[0061] For example, in a driver monitoring image, it can be divided into a driver region and a non-driver region. A mask is added to the non-driver region, setting the pixel values ​​of these regions to 0 (inactive state), while the pixel values ​​of the driver region remain unchanged. Through the mask constraint, the model can only learn the features of the driver region during the training phase, while the features of the non-driver region are effectively masked, thus ensuring that the mask feature extraction is strictly limited to the area defined by the mask constraint.

[0062] For example, to implement the above mask constraint, it can be flexibly configured at different layers such as the input layer or intermediate feature layers of the model. For instance, in a convolutional layer, activation values ​​in the masked region can be suppressed through channel-wise multiplication. Specifically, the mask is multiplied with the feature map element-wise, causing feature values ​​outside the masked region to be set to zero, thereby constraining the model's learning behavior.

[0063] S202: The global features and the mask features are fused in the channel dimension and the spatial dimension, respectively, and the fusion results are fused in the channel dimension and the spatial dimension according to the fusion weight to obtain the latent space representation of the original image.

[0064] Latent space representation is an intermediate representation generated by neural networks. It has a more compact dimension and is able to capture the key features of the data.

[0065] The lack of information interaction between global features extracted from the original image and mask features extracted from the mask image results in information redundancy between the two. In the embodiments of this application, global features and mask features are fused in the channel and spatial dimensions respectively. Then, based on the fusion weights, the fusion result is simultaneously fused in both the channel and spatial dimensions to obtain the latent space representation of the original image. This allows for adaptive feature fusion from both the channel and spatial dimensions, reducing information redundancy between features and improving model performance.

[0066] In some embodiments, the global feature and the mask feature are fused in the channel dimension and the spatial dimension, respectively, including: performing feature concatenation and dimensional adjustment on the channel dimension of the global feature and the mask feature to obtain a first feature; extracting local spatial features from the first feature and adding position encoding information.

[0067] Specifically, the global features and mask features can be concatenated along the channel dimension. This involves merging two feature tensors into a new feature tensor. During this process, the width (W) and height (H) of the feature tensor remain unchanged, while the number of channels (C) is added. For example, the global features have a dimension of C1×H×W, and the mask features have a dimension of C2×H×W. Concatenating these two feature tensors along the channel dimension results in a new feature tensor with a dimension of (C1+C2)×H×W. This new feature tensor contains all channels of both the global and mask features, while the height and width remain unchanged. This allows the model to utilize information from both the global and mask features simultaneously, thereby improving its representational power and performance.

[0068] The concatenated features are dimensionally adjusted along the channel dimension to obtain the first feature. A 1x1 convolutional kernel can be used to process the concatenated features, adjusting the number of channels to a preset number M. Essentially, a 1x1 convolutional kernel linearly combines the channels at each spatial location (H×W). Since the kernel size is 1x1, it does not change the spatial dimensions of the feature map (i.e., height H and width W), but it fuses and reorganizes information from different channels along the channel dimension. To adjust the number of channels in the concatenated feature tensor to the preset number M, M 1x1 convolutional kernels can be used, resulting in a feature tensor with dimensions M×H×W.

[0069] Local spatial features are extracted from the first feature. This can be achieved using 3×3 depthwise separable convolution. Depthwise separable convolution is an efficient convolution operation where each input channel is assigned a corresponding convolution kernel. Assuming the feature tensor has dimensions M×H×W, depthwise convolution uses M 3×3 kernels, each operating only on its corresponding input channel. Through these 3×3 kernels, local spatial information within a 3×3 neighborhood of each pixel can be captured, such as edges and corners.

[0070] Adding positional encoding information. Positional encoding information can be directly added to features through 1x1 convolutions and / or 3x3 depthwise separable convolution operations. For example, convolution operations can implicitly add positional encoding information through zero padding and boundary effects. For example, the positional information for each spatial location can be absolute positional information (such as the row and column position of a pixel in an image) or relative positional information (such as the relative distance between a pixel and other pixels). In this embodiment, by adding positional encoding information, the model can more effectively process and understand spatial relationships.

[0071] In the embodiments of this application, after fusing the global features and the mask features in the channel dimension and the spatial dimension respectively, the fusion result will be fused in the channel dimension and the spatial dimension according to the fusion weight.

[0072] The fusion weights include fusion weights for different channel dimensions and fusion weights for different spatial dimensions. Regarding the fusion weights for the channel dimension, the weight of each channel represents the contribution of that channel's features to the current task. For example, in driver behavior monitoring, channels related to the driver, such as those related to the driver's face and hands, will be assigned higher weights, while channels related to background interference will be assigned lower weights. Regarding the fusion weights for the spatial dimension, the fusion weight of each spatial location represents the importance of that region to the current task. For example, in driver behavior monitoring, areas related to the driver, such as the facial area and the area where hands interact with the steering wheel, will be assigned higher weights, while non-driver areas such as background areas will be assigned lower weights.

[0073] In some embodiments, the fusion weights can be fixed or dynamically calculated.

[0074] In some embodiments, a designated module can be used to dynamically calculate the fusion weights, and the fusion results can be fused from the channel dimension and the spatial dimension according to the fusion weights. For example, the designated module may be Swing Transformer, Swing Transformer v2, residual Swing Transformer v2, etc., and the embodiments of this application are not limited thereto.

[0075] Taking the residual Swing Transformer v2 as an example, it can adjust the fusion weights of different channel dimensions based on multi-head self-attention (MSA), multilayer perceptron (MLP) layers, and image patch merging. At the same time, it can adjust the fusion weights of different spatial dimensions based on window-based self-attention and shifted window mechanisms.

[0076] S203: Quantize the latent space representation to obtain a quantized latent space representation, and input the quantized latent space representation and the latent space representation into a preset entropy model to obtain the probability distribution of the quantized latent space representation.

[0077] Quantization is the process of mapping continuous numerical values ​​or vectors to discrete values ​​or intervals, thereby reducing the complexity of data representation, storage requirements, or computational load. The discrete latent space representation obtained through quantization is called the quantized latent space representation.

[0078] Entropy models are models used to estimate the probability distribution of data. In the embodiments of this application, a pre-trained entropy model can output the probability distribution of the values ​​of the quantized latent space representation based on the input latent space representation and the quantized latent space representation.

[0079] In some embodiments, the quantized latent space representation and the latent space representation are input into a preset entropy model to obtain the probability distribution of the quantized latent space representation, including: dividing the quantized latent space representation into multiple slices along the channel dimension; inputting the multiple slices and the latent space representation into the entropy model so that the entropy model sequentially processes each slice to obtain the probability distribution of each slice.

[0080] Specifically, the quantized latent space representation can be equally divided along the channels to obtain multiple slices. For example, if the quantized latent space representation has 320 channels, it can be equally divided along the channels to obtain slices 1 to 10, each slice having 32 channels.

[0081] The multiple slices and their latent space representations are input into the entropy model. The entropy model processes each slice one after another, and the processing of each slice depends on the other slices that have already been processed. For example, slices 1 to 10 and their latent space representations are input into the entropy model. The entropy model processes slices 1 to 10 sequentially to obtain the probability distribution of each slice.

[0082] In some embodiments, the preset entropy model is specifically used for:

[0083] S2031: Based on the latent space representation of the input, generate super-prior context information through a super-prior network.

[0084] Specifically, the hyperprior network includes the hyperprior analytic transformation network h. a and the super-prior synthesis transformation network h s The latent space representation is input into the hyperprior analytical transform network h. a The auxiliary information z is obtained, which is then quantized to obtain quantized auxiliary information z. This quantized auxiliary information z is encoded by an arithmetic encoder (AE) to obtain a bitstream, which is then decoded by an arithmetic decoder (AD) and finally passed through a priori synthesis transform network h. s Obtain the prior context information.

[0085] S2032: For the current input slice, based on the current slice and all slices input before the current slice, generate the channel context information, local context information, intra-slice global context information, and inter-slice global context information of the current slice through the context network.

[0086] Specifically, the context network can be composed of channel context modules, local context modules, intra-slice global context modules, and inter-slice global context modules.

[0087] For the current slice, the channel context module can capture channel context information from all slices input before the current slice. Each slice can include anchor and non-anchor portions. The local context module can generate local context information based on the anchor portion of the current slice. The intra-slice global context module can generate intra-slice global context information based on the anchor portion of the current slice and the correlation between the anchor and non-anchor portions of the previous slice. The inter-slice global context module can capture the global correlation between the anchor portion of the current slice and the previous slice to generate inter-slice global context information.

[0088] S2033: Based on the prior context information and the current slice's channel context information, local context information, intra-slice global context information, and inter-slice global context information, a probabilistic prediction subnetwork g is used. ep The probability distribution for generating the current slice.

[0089] Probabilistic prediction subnetwork g ep It can be a pre-trained neural network that can generate the probability distribution of the current slice based on the input context information.

[0090] For the current slice, the channel context information, local context information, intra-slice global context information, inter-slice global context information, and prior context information of the current slice are input into the probability prediction subnetwork g. ep The probability distribution for generating the current slice.

[0091] S204: Encode the quantized latent space representation based on the probability distribution, and send the encoding result to the receiving end in the form of a bit stream, so that the receiving end can decode the bit stream to obtain the reconstructed image.

[0092] The probability distribution generated by the entropy model is used to entropy encode the quantized latent space representation. The encoded data is then converted into a bitstream and sent to the receiving end. After receiving the bitstream, the receiving end uses the same probability distribution as the sending end to decode it, reconstruct the quantized latent space representation, and generate a reconstructed image.

[0093] In some embodiments, the quantization latent space representation is divided into multiple slices along the channel dimension.

[0094] The quantized latent space representation is then encoded based on the probability distribution, including: for each slice, encoding the slice based on the probability distribution of that slice.

[0095] The encoded result is sent to the receiving end in the form of a bitstream, so that the receiving end can decode the bitstream to obtain the reconstructed image, including:

[0096] The encoding result is sent to the receiving end in the form of a bit stream, so that the receiving end decodes each of the multiple slices in sequence based on the same probability distribution as the sending end, reconstructs the quantized latent space representation, and generates the reconstructed image.

[0097] Specifically, the decoding of each slice depends on the decoding of other slices that have already been decoded.

[0098] In the embodiments of this application, a mask image is obtained by adding a mask to the target region in the original image. Global features and mask features are then extracted from both the original image and the mask image. The global features and mask features are fused along the channel and spatial dimensions, respectively. The fusion results are then fused along the channel and spatial dimensions according to fusion weights to obtain the latent space representation of the original image. The latent space representation is quantized to obtain a quantized latent space representation. The quantized latent space representation and the latent space representation are input into a preset entropy model to obtain a probability distribution of the quantized latent space representation. The quantized latent space representation is encoded based on the probability distribution, and the encoded result is sent to the receiving end in the form of a bitstream. The receiving end then decodes the bitstream to obtain the reconstructed image. By using a mask, the original image becomes sparse, resulting in stronger sparsity of the extracted latent space features and reducing the amount of data in the latent space features. This improves the local peak signal-to-noise ratio of the reconstructed image under extremely low data constraints, thereby enhancing the clarity of the reconstructed image at extremely low bitrates. Furthermore, fusing features along the channel and spatial dimensions according to fusion weights reduces information redundancy caused by mask feature extraction, improving model performance.

[0099] like Figure 3 As shown, Figure 3 This is a schematic flowchart illustrating a T2X satellite-ground cooperative image reconstruction method provided in an embodiment of this application. This T2X satellite-ground cooperative image reconstruction method can be applied to applications such as... Figure 2 In the scenario of satellite-ground collaborative communication with extremely low data volume constraints, as shown, this method can be executed by a computing device installed on a vehicle.

[0100] Semantic segmentation is performed on the driver monitoring image x, classifying each pixel of the driver monitoring image x into a specific semantic category, such as driver, window, steering wheel, etc. After semantic segmentation, it is identified which pixels belong to the driver semantic category, and other pixels are assigned to the non-driver semantic category. The area where the pixels with the semantic category of non-driver are located is determined as the non-driver area.

[0101] Add a mask to the non-driver area to obtain the mask image x. masked .

[0102] The driver monitoring image x is input into a pre-trained global feature extraction module (GlobalExtraction) for processing to obtain global features, and the mask image x is then processed. masked The input is processed by a pre-trained masked feature extraction module to obtain masked features.

[0103] The residual block "Residual,N,↑ / ↓" in the global feature extraction module and the mask feature extraction module contains the following components:

[0104] 1. Conv3×3,N,↑ / ↓: This indicates a 3×3 convolutional layer, where “N” represents the number of output channels, “↑” indicates upsampling, and “↓” indicates downsampling.

[0105] 2. GELU: GELU (Gaussian Error Linear Unit) represents the Gaussian Error Linear Unit, which is an activation function used to introduce nonlinear characteristics.

[0106] 3. Conv3×3,N, / : This indicates a 3×3 convolutional layer, where "N" represents the number of output channels and " / " is a parameter separator.

[0107] 4. GDN / IGDN: GDN (Generalized Divisive Normalization) transforms image data into a form closer to a Gaussian distribution, reducing the correlation between different data points. Gaussianization is crucial for image data compression because the Gaussian distribution has simpler statistical properties, allowing compression models to more easily learn the data structure. IGDN (Inverse Generalized Divisive Normalization) is the inverse of GDN. During compression and decompression, IGDN effectively restores the compressed Gaussianized data to the original data, ensuring the reversibility and high fidelity of the compression and decompression processes.

[0108] 5. Addition operator (+): Represents a residual connection. Residual connections add the input directly to the output of the module, which helps to alleviate the gradient vanishing problem in deep networks and improve the training effect of the model.

[0109] The residual block "Residual,N," in the global feature extraction module and the mask feature extraction module contains the following components: "Conv3×3,N, / ", "GELU", and the addition operator (+). Further details are omitted here.

[0110] For the mask feature extraction module, during the training phase of the model, the image region (i.e., the driver region) that the model can learn is explicitly specified through mask constraints. This guides the model's learning behavior during the training phase, forcing the model to learn the regional distribution information of the image, and to learn which regions need more bits to represent and which regions can be allocated fewer bits, thereby optimizing the bit allocation region and ratio of the image.

[0111] For example, to implement the above mask constraint, it can be flexibly configured at different layers such as the input layer or intermediate feature layers of the model. For instance, in a convolutional layer, activation values ​​in the masked region can be suppressed through channel-wise multiplication. Specifically, the mask is multiplied with the feature map element-wise, causing feature values ​​outside the masked region to be set to zero, thereby constraining the model's learning behavior.

[0112] After obtaining the global and mask features, the lack of information interaction between them leads to information redundancy. The global and mask features can be input into a Masked-Global Fusion module. Within this module, the global and mask features are first concatenated along the channel dimension, i.e., chained together. Then, a 1x1 convolution kernel is used to adjust the number of channels to M, followed by a 3x3 depthwise separable convolution to extract local spatial features and add positional encoding information. Finally, the residual Swin Transformer v2 module adaptively fuses the features from both the channel and spatial dimensions according to the fusion weights, yielding the latent space representation y of the driver monitoring image x.

[0113] The residual Swing Transformer v2 module contains the following components: Feature Embedding (FE), multiple Swing Transformer v2 modules, Feature Unembedding (FU), and the addition operator (+).

[0114] The latent space representation y is input into the entropy model, and then processed through the super-prior analytical transformation network h within the entropy model. a The auxiliary information z is obtained, and after quantization, the quantized auxiliary information is obtained. Quantization auxiliary information is processed using an arithmetic encoder AE. Encoding involves sending the encoded result as a bitstream to the receiving end, while the sending end itself performs A / D decoding using an arithmetic decoder, followed by a super-prior synthesis transform network h. s Obtain the prior context information.

[0115] The latent space representation y is quantized to obtain the discrete form of the latent space representation, i.e., the quantized latent space representation. Quantize the latent space representation Divide the slices equally along the channel dimension to obtain slices.

[0116] All slices are input into the entropy model, which then processes each slice sequentially to obtain the probability distribution for each slice. Specifically, for the current slice... The current slice is obtained through the channel-wise context module, local context module, intra-global context module, and inter-global context module within the entropy model. The channel context information, local context information, intra-slice global context information, and inter-slice global context information. The current slice... The input probability prediction subnetwork g contains channel context information, local context information, intra-slice global context information, inter-slice global context information, and hyper-prior context information. ep This will generate the current slice. The probability distribution.

[0117] For each slice, it is encoded based on its probability distribution. Finally, the encoded result is sent to the receiving end as a bit stream.

[0118] The receiver decodes each slice sequentially based on the same probability distribution as the transmitter, reconstructing the quantized latent space representation. And generate the reconstructed image x.

[0119] Regarding decoding, the receiver is configured with an entropy model, which has the same model parameters as the entropy model used by the transmitter. The receiver can receive auxiliary information. The corresponding bitstream is decoded by the arithmetic decoder (AD) within the entropy model, and then processed by the super-prior synthesis transform network h. s Obtain the prior context information.

[0120] For the current slice, channel context information, local context information, intra-slice global context information, and inter-slice global context information are generated based on other slices that have already been decoded. For details on how to generate this context information, please refer to S2032; it will not be repeated here. It should be noted that the anchor point of the current slice used in generating the above context information can be generated based on the prior context information and channel context information.

[0121] The prior context information, channel context information, local context information, intra-slice global context information, and inter-slice global context information are input into the probability prediction subnetwork g. ep Generate the probability distribution of the current slice and decode the current slice based on the probability distribution of the current slice.

[0122] In the embodiments of this application, the original image is sparsified by masking, thereby making the extracted latent space features more sparsity, reducing the amount of data for latent space features, and thus improving the local peak signal-to-noise ratio of the reconstructed image under extremely low data constraints, thereby improving the clarity of the reconstructed image at extremely low bit rates. Furthermore, features are fused from the channel and spatial dimensions according to the fusion weights, thereby reducing information redundancy caused by masked feature extraction and improving model performance.

[0123] The image reconstruction method for T2X satellite-ground collaboration proposed in this application is applied in, for example... Figure 2 In the scenario of satellite-to-ground cooperative communication with extremely low data volume constraints, assuming the original service image resolution transmitted by the driver monitoring service is 720P, the peak signal-to-noise ratio (PNSR) of the reconstructed image is simulated at different bit rates (bits per pixel, bpp). The performance of this scheme is compared with that of the maskless guided training global feature extraction scheme (MGALIC w / omask), as well as existing schemes such as MLIC, ELIC, WACNN, and STF. The comparison results are as follows. Figure 4 As shown, the MGALIC curve marked with an asterisk represents the simulation results of this technical solution. It can be seen that this technical solution can effectively reduce the amount of latent space feature data in the image and improve the local PSNR of the reconstructed image.

[0124] like Figure 5 As shown, Figure 5A visual comparison of the latent space features and bit allocation of this technical solution and other solutions was conducted. It can be seen that the latent space features extracted by this technical solution are more sparsity than those of other solutions. Furthermore, the latent space features extracted by this solution can effectively eliminate spatial correlation of features and reduce spatial information redundancy. Compared to other solutions, the bit allocation region of this solution is more concentrated and efficient.

[0125] like Figure 6 As shown, Figure 6 The role of masking in guiding model training was analyzed by visualizing the latent space features and bit allocation of the five channels with the highest entropy. Figure 6 As can be seen, the mask effectively constrains the model's learning behavior. Mask feature extraction is completely confined to the mask-constrained region. By sparsifying the original input using the mask, the mask features are effectively sparsified. Global features are not sparse and have spatial information redundancy with mask features. The mask-global feature fusion module effectively fuses features from both the channel and spatial dimensions, reducing spatial information redundancy and simultaneously sparsifying the latent space features. The mask enables the model to learn the regional distribution information of the image during the training phase, ensuring that bit allocation is concentrated within the mask-constrained region.

[0126] The above description describes the method provided in this application. The following description describes the apparatus provided in this application:

[0127] Please see Figure 7 This is a schematic diagram of the structure of an image reconstruction device for T2X satellite-ground collaboration provided in an embodiment of this application.

[0128] like Figure 7 As shown, the device may include:

[0129] The extraction unit 710 is used to add a mask to the target region in the original image to obtain a mask image, and to extract features from the original image and the mask image respectively to obtain global features and mask features.

[0130] The fusion unit 720 is used to fuse the global features and the mask features in the channel dimension and the spatial dimension, respectively, and to fuse the fusion results in the channel dimension and the spatial dimension according to the fusion weights to obtain the latent space representation of the original image.

[0131] The processing unit 730 is used to quantize the latent space representation to obtain a quantized latent space representation, and input the quantized latent space representation and the latent space representation into a preset entropy model to obtain the probability distribution of the quantized latent space representation;

[0132] The encoding unit 740 is used to encode the quantized latent space representation based on the probability distribution and send the encoding result to the receiving end in the form of a bit stream so that the receiving end can decode the bit stream to obtain the reconstructed image.

[0133] Optionally, the original image is a driver monitoring image, and the extraction unit 710 is further configured to:

[0134] The original image is semantically segmented, and the image region with the semantic category of "non-driver" is determined as the target region.

[0135] Optionally, the fusion unit 720 is specifically used for:

[0136] The global feature and the mask feature are concatenated and adjusted in the channel dimension to obtain the first feature;

[0137] Local spatial features are extracted from the first feature, and location encoding information is added.

[0138] Optionally, the processing unit 730 is specifically used for:

[0139] The quantized latent space representation is divided into multiple slices along the channel dimension;

[0140] The multiple slices and the latent space representation are input into the entropy model, so that the entropy model sequentially processes each slice to obtain the probability distribution of each slice.

[0141] The encoding of the quantized latent space representation based on the probability distribution includes:

[0142] For each slice, the slice is encoded based on the probability distribution of the slice.

[0143] Optionally, the entropy model is specifically used for:

[0144] Based on the latent space representation of the input, a priori context information is generated through a priori network;

[0145] For the current input slice, based on the current slice and all slices input before the current slice, the context network generates the channel context information, local context information, intra-slice global context information, and inter-slice global context information of the current slice.

[0146] Based on the aforementioned prior context information, as well as the current slice's channel context information, local context information, intra-slice global context information, and inter-slice global context information, the probabilistic prediction subnetwork g is used. ep The probability distribution for generating the current slice.

[0147] Optionally, the encoding result can be sent to the receiving end in the form of a bit stream, so that the receiving end can decode each of the plurality of slices sequentially based on the same probability distribution as the sending end, reconstruct the quantized latent space representation, and generate a reconstructed image.

[0148] The specific implementation process of the functions and roles of each unit in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.

[0149] This application also provides a hardware structure. See [link to relevant documentation]. Figure 8 , Figure 8 This is a structural diagram of an electronic device provided in an embodiment of this application. Figure 8 As shown, the hardware structure may include: a processor and a machine-readable storage medium, the machine-readable storage medium storing machine-executable instructions that can be executed by the processor; the processor is used to execute the machine-executable instructions to implement the method disclosed in the above example of this application.

[0150] Based on the same application concept as the above method, this application embodiment also provides a machine-readable storage medium storing a plurality of computer instructions, which, when executed by a processor, can implement the method disclosed in the above examples of this application.

[0151] For example, the aforementioned machine-readable storage medium can be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, etc. For instance, machine-readable storage media can be: RAM (Random Access Memory), volatile memory, non-volatile memory, flash memory, storage drives (such as hard disk drives), solid-state drives, any type of storage disk (such as optical discs, DVDs, etc.), or similar storage media, or combinations thereof.

[0152] It should be noted that, in this document, relational terms such as "objective" and "target" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0153] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application.

Claims

1. A T2X space-ground cooperative oriented image reconstruction method, characterized in that, The method includes: A mask is added to the target region in the original image to obtain a mask image. Feature extraction is then performed on the original image and the mask image to obtain global features and mask features, respectively. The global features and the mask features are fused in the channel dimension and the spatial dimension, respectively, and the fusion results are fused in the channel dimension and the spatial dimension according to the fusion weight to obtain the latent space representation of the original image. The latent space representation is quantized to obtain a quantized latent space representation, and the quantized latent space representation and the latent space representation are input into a preset entropy model to obtain the probability distribution of the quantized latent space representation; The quantized latent space representation is encoded based on the probability distribution, and the encoding result is sent to the receiving end in the form of a bit stream so that the receiving end can decode the bit stream to obtain the reconstructed image. The step of fusing the global features and the mask features in the channel dimension and spatial dimension, respectively, includes: The global feature and the mask feature are concatenated along the channel dimension, and the number of channels of the concatenated feature is adjusted to a preset number to obtain the first feature; local spatial features are extracted from the first feature, and position encoding information is added; The step of inputting the quantized latent space representation and the latent space representation into a preset entropy model to obtain the probability distribution of the quantized latent space representation includes: The quantized latent space representation is divided into multiple slices along the channel dimension; the multiple slices and the latent space representation are input into the entropy model so that the entropy model sequentially processes each slice to obtain the probability distribution of each slice. Specifically, the entropy model is used to: generate super-prior context information through a super-prior network based on the input latent space representation; for the input current slice, generate channel context information, local context information, intra-slice global context information, and inter-slice global context information for the current slice through a context network based on the current slice and all slices input before the current slice; and generate the probability distribution of the current slice through a probability prediction sub-network gep based on the super-prior context information and the channel context information, local context information, intra-slice global context information, and inter-slice global context information of the current slice.

2. The method of claim 1, wherein, The original image is a driver monitoring image, and the method further includes: The original image is semantically segmented, and the image region with the semantic category of "non-driver" is determined as the target region.

3. The method of claim 1, wherein, The encoding of the quantized latent space representation based on the probability distribution includes: For each slice, the slice is encoded based on the probability distribution of the slice.

4. The method of claim 3, wherein, The step of sending the encoding result to the receiving end in the form of a bitstream, so that the receiving end can decode the bitstream to obtain the reconstructed image, includes: The encoding result is sent to the receiving end in the form of a bit stream, so that the receiving end decodes each of the plurality of slices sequentially based on the same probability distribution as the sending end, reconstructs the quantized latent space representation, and generates a reconstructed image.

5. An image reconstruction device for T2X space-ground collaboration, characterized in that, The device includes: The extraction unit is used to add a mask to the target region in the original image to obtain a mask image, and to extract features from the original image and the mask image respectively to obtain global features and mask features; The fusion unit is used to fuse the global features and the mask features in the channel dimension and the spatial dimension, respectively, and to fuse the fusion results in the channel dimension and the spatial dimension according to the fusion weights to obtain the latent space representation of the original image. The step of fusing the global features and the mask features in the channel dimension and spatial dimension, respectively, includes: The global feature and the mask feature are concatenated along the channel dimension, and the number of channels of the concatenated feature is adjusted to a preset number to obtain the first feature; local spatial features are extracted from the first feature, and position encoding information is added; The processing unit is configured to quantize the latent space representation to obtain a quantized latent space representation, and input the quantized latent space representation and the latent space representation into a preset entropy model to obtain the probability distribution of the quantized latent space representation; The step of inputting the quantized latent space representation and the latent space representation into a preset entropy model to obtain the probability distribution of the quantized latent space representation includes: The quantized latent space representation is divided into multiple slices along the channel dimension; the multiple slices and the latent space representation are input into the entropy model so that the entropy model sequentially processes each slice to obtain the probability distribution of each slice. Specifically, the entropy model is used to: generate super-prior context information through a super-prior network based on the input latent space representation; for the input current slice, generate channel context information, local context information, intra-slice global context information, and inter-slice global context information for the current slice through a context network based on the current slice and all slices input before the current slice; and generate the probability distribution of the current slice through a probability prediction sub-network gep based on the super-prior context information and the channel context information, local context information, intra-slice global context information, and inter-slice global context information of the current slice. The encoding unit is used to encode the quantized latent space representation based on the probability distribution and send the encoding result to the receiving end in the form of a bit stream so that the receiving end can decode the bit stream to obtain the reconstructed image.

6. An electronic device, comprising: It includes a processor and a memory, the memory storing machine-executable instructions that can be executed by the processor, the processor being used to execute the machine-executable instructions to implement the method as described in any one of claims 1-4.

7. A machine-readable storage medium, characterized in that, The machine-readable storage medium stores machine-executable instructions, which, when executed by a processor, implement the method as described in any one of claims 1-4.

8. A computer program product comprising a computer program / instructions, characterized in that, When the computer program / instructions are executed by the processor, they implement the steps of the method according to any one of claims 1-4.