Lightweight compressive imaging method based on frequency-spatial domain modulation of attention

By employing a lightweight single-pixel imaging method that modulates attention in the frequency and spatial domains and utilizing a dual-domain collaborative attention deep reconstruction network, this method addresses the issues of high computational complexity and insufficient robustness in high-resolution reconstruction of traditional single-pixel imaging methods, thereby achieving efficient and high-quality image reconstruction in edge devices.

CN122049083BActive Publication Date: 2026-06-26TIANJIN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TIANJIN UNIV
Filing Date
2026-04-17
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Traditional single-pixel imaging methods suffer from high computational complexity, long reconstruction time, strong dependence on prior information, and insufficient robustness when reconstructing at high resolutions, making them difficult to adapt to complex scenarios. Furthermore, existing deep learning methods consume high computational resources and are difficult to deploy in edge devices.

Method used

A lightweight single-pixel imaging method based on frequency-spatial domain modulation attention is adopted. Image reconstruction is performed through a dual-domain collaborative attention deep reconstruction network. Global structural information is extracted using frequency domain features and local detail information is extracted using spatial domain features. The reconstruction accuracy and robustness are improved by a residual-guided update module and a multi-scale-depth dynamic convolution fusion module.

Benefits of technology

It significantly reduces computational complexity and the number of parameters, improves reconstruction accuracy and robustness, and is suitable for deployment in resource-constrained edge devices, enabling efficient and high-quality image reconstruction.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122049083B_ABST
    Figure CN122049083B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of image processing, and discloses a light-weight single-pixel imaging method based on frequency domain-space domain modulation attention, which comprises a single-pixel image data generator and a calculation reconstruction generator; an initial inversion estimation is obtained through inverse mapping of a measurement matrix; then, the initial inversion estimation is input into a dual-domain collaborative attention deep reconstruction network; the network is composed of a plurality of dual-domain collaborative attention module cascades connected in series, a multi-head spectrum calibration module in the module realizes long-distance information reconstruction through dual-domain feature information fusion interaction; a multi-scale-depth dynamic convolution fusion module extracts features in a space domain, accurately processes high-frequency edges, and significantly enhances the reconstruction capability of texture details; and the features are corrected in combination with a residual error guiding update module. Through the dual-domain collaborative mechanism, the reconstruction precision and the detail recovery capability of single-pixel imaging are effectively improved while the network parameter quantity and the calculation complexity are greatly reduced, and the application is suitable for edge deployment.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image processing technology, and in particular to a lightweight single-pixel imaging method based on frequency-spatial domain modulation attention. Background Technology

[0002] Single-pixel imaging (SPI) is a novel imaging method combining spatial light modulation and compressed sensing theory. In the absence of high-resolution two-dimensional imaging devices, it encodes the target image multiple times using a modulation matrix, acquires the projection signal using a single photodetector with no spatial resolution, and finally reconstructs the original image using a reconstruction algorithm. Traditional SPI methods are mainly based on compressed sensing (CS) theory. They encode the target scene using a set of preset or pseudo-random spatial modulation patterns (such as Hadamard, random binary, or sine wave patterns), then acquire the modulated light intensity response using a single-pixel detector, and finally reconstruct the original image using a mathematical inversion algorithm.

[0003] This imaging method has unique advantages in specific wavelength bands (such as terahertz, infrared, ultraviolet, etc.) and extreme environments (low light, high scattering, strong interference), and is widely used in remote sensing, biomedical imaging, low light field imaging, security monitoring and military reconnaissance, thus having important research value and engineering application prospects.

[0004] With the ever-increasing performance requirements of single-pixel imaging systems, especially the growing demand for high-resolution reconstructed images in tasks such as target detection and structural analysis, traditional SPI systems face the following three key challenges:

[0005] (1) Traditional single-pixel reconstruction methods have low reconstruction efficiency: Common image reconstruction algorithms based on compressed sensing include gradient descent, alternating direction multiplier (ADMM), total variation (TV) reconstruction, and sparse regularization (such as L1 norm) methods. These methods perform well when the sampling rate is high, but they have the following technical bottlenecks: ① High reconstruction computational complexity and long reconstruction time: Most traditional compressed sensing reconstruction algorithms rely on iterative optimization solutions, which have a large amount of computation and slow convergence speed, making it difficult to meet the needs of high resolution or real-time processing; ② Strong dependence on prior information: Compressed sensing methods usually assume that the image has sparsity in a certain transform domain (such as DCT, wavelet, etc.). This prior assumption does not always hold in practical applications and is prone to reconstruction artifacts or loss of detail; ③ Insufficient robustness: In the presence of optical distortion, noise interference, or modulation error, the reconstruction quality of traditional methods is significantly reduced, and they lack adaptability to complex scenes.

[0006] (2) Limited imaging resolution and difficulty in efficiently supporting large-size reconstruction: Due to limitations in optical modulation efficiency, system signal-to-noise ratio and reconstruction algorithm complexity, traditional SPI systems can typically achieve image reconstruction resolutions between 32×32 and 128×128 pixels. Once the target imaging size exceeds 512×512 pixels, the reconstruction calculation of conventional neural networks under high-dimensional data becomes extremely large and non-convergent, directly affecting the system's real-time performance and reconstruction accuracy.

[0007] (3) Current single-pixel imaging networks are complex and consume high computational resources: In view of the above limitations, in recent years researchers have begun to introduce deep learning methods to model and solve the single-pixel imaging problem. Higher quality image restoration can be achieved by introducing the Transformer architecture built with multi-head attention mechanism as the core. However, networks represented by Vision Transformer (ViT) usually contain a large number of parameters and a highly coupled global self-attention mechanism. Its computational complexity increases quadratically with the increase of image size. Its computational and memory requirements are far higher than those of traditional structures such as convolutional neural networks (CNNs). This structural redundancy not only causes its computational and memory overhead to skyrocket when processing high-resolution images, but also severely limits its application in practical applications such as mobile platforms, unmanned systems, and portable devices, especially in deployment scenarios for edge devices or embedded systems.

[0008] To address the aforementioned issues, existing research has attempted improvements from multiple angles. For example, some methods reduce model computation by constructing lightweight CNN architectures (such as MobileNet and ShuffleNet); others simplify the computational structure of Transformers through strategies like sparse attention and adaptive windowing, improving their adaptability to high-resolution images. However, these methods still suffer from limitations such as insufficient model expressive power, complex network structure design, or limited support for high-quality reconstruction, making it difficult to meet practical engineering needs.

[0009] Therefore, lightweight design based on the Transformer architecture is an important research direction. We need to build a lightweight deep single-pixel image reconstruction neural network with low computational complexity, deployability and high accuracy of image reconstruction, so as to realize an efficient single-pixel imaging method for practical application scenarios while ensuring the quality of image computation and reconstruction. Summary of the Invention

[0010] The present invention aims to at least solve one of the technical problems existing in related technologies. To this end, the present invention provides a lightweight single-pixel imaging method based on frequency-spatial domain modulation attention.

[0011] A lightweight single-pixel imaging method based on frequency-spatial domain modulation attention, comprising the following steps:

[0012] Acquire single-pixel compressed measurement data of the target scene;

[0013] The single-pixel compressed measurement data is inversely mapped using the transpose mapping matrix obtained by transposing the measurement matrix to generate an initial reconstruction estimate of the single-pixel image;

[0014] The initial reconstruction estimate of the single-pixel image is input into the dual-domain collaborative attention deep reconstruction network; the dual-domain collaborative attention deep reconstruction network contains several levels of cascaded dual-domain collaborative attention modules;

[0015] In the dual-domain collaborative attention module, frequency domain feature extraction and spatial domain feature extraction are performed in parallel on the input feature information, and the frequency domain features and spatial domain features are fused; wherein, the frequency domain feature extraction is used to capture global structural information, and the spatial domain feature extraction utilizes dynamic convolution to reconstruct local detail information;

[0016] The internal structure of the dual-domain collaborative attention module includes, in sequence: a residual-guided update module, a multi-head spectral calibration module, and a multi-scale-depth dynamic convolution fusion module.

[0017] The residual-guided update module performs tensor gradient update correction on the measurement residual between the single-pixel compressed measurement data and the current reconstruction estimate of each level of dual-domain collaborative attention module;

[0018] The multi-head spectral calibration module modulates the features output by the residual-guided update module with dual-domain feature information. Through dual-domain feature information fusion and interaction, it realizes long-distance structural modeling of input image information. The dual-domain feature information includes frequency domain features and spatial domain features.

[0019] The multi-scale-depth dynamic convolution fusion module performs multi-scale dynamic convolution in the spatial domain on the features output by the multi-head spectrum calibration module to realize multi-scale feature information expression, enabling the network to recover high-frequency texture and edge local detail region feature information;

[0020] The output of the multi-head spectral calibration module is fused with the output of the multi-scale-depth dynamic convolution fusion module to serve as the output of the current dual-domain collaborative attention module.

[0021] The outputs of the dual-domain collaborative attention deep reconstruction network are integrated to obtain the final single-pixel reconstructed image.

[0022] Furthermore, the method of inversely mapping the single-pixel compressed measurement data using the transpose mapping matrix obtained by transposing the measurement matrix to generate an initial reconstruction estimate of the single-pixel image is specifically calculated using the following formula:

[0023]

[0024] in, This is the initial reconstruction estimate for a single-pixel image. The data refers to compressed single-pixel measurement data acquired by a single-pixel imaging system; the compressed single-pixel measurement data is in two-dimensional form. and These are the preset left-multiplication and right-multiplication of the measurement matrix, respectively. and Multiply by the transpose of the mapping matrix on the left and on the transpose of the mapping matrix on the right.

[0025] The single-pixel compressed measurement data is derived from the single-pixel original image. Through formula Calculated;

[0026] The training process of the dual-domain collaborative attention deep reconstruction network adopts a hybrid loss function, which includes pixel-level reconstruction error, structural similarity loss and orthogonality constraint loss.

[0027] The orthogonality constraint loss is used to constrain the structural orthogonality of the measurement matrix.

[0028] Furthermore, the dual-domain collaborative attention deep reconstruction network includes an input preprocessing layer connected in sequence, a chain structure composed of several levels of the dual-domain collaborative attention modules connected in series, and an output reconstruction layer;

[0029] The input preprocessing layer performs convolution processing on the initial reconstruction estimate of the single-pixel image and outputs initial features;

[0030] In the chain structure, the output of the previous dual-domain collaborative attention module serves as the input of the next dual-domain collaborative attention module.

[0031] The output reconstruction layer processes the output of the last dual-domain collaborative attention module and superimposes it with the initial reconstruction estimate of the single-pixel image via a skip connection to generate the final single-pixel reconstructed image.

[0032] Furthermore, the specific processing steps of the residual-guided update module include:

[0033] Calculate the measurement residual between the observed value corresponding to the input feature of the current dual-domain collaborative attention module and the single-pixel compressed measurement data;

[0034] The input features of the current dual-domain collaborative attention module are the reconstructed estimates calculated and output by the previous dual-domain collaborative attention module;

[0035] The measurement residual is mapped back to the feature space by multiplying it by the transpose mapping matrix on the left and on the right, to obtain the residual back-projection signal;

[0036] Based on the learnable gradient descent step size parameter, the residual back projection signal is used to perform gradient update on the current input features, and the corrected feature information tensor is output.

[0037] Furthermore, the multi-head spectral calibration module includes parallel spatial domain branches and frequency domain branches;

[0038] The processing steps of the frequency domain branch include: after the features output by the residual-guided update module are normalized by the layer, a two-dimensional fast Fourier transform is performed to convert them to the frequency domain, feature information is extracted in the frequency domain, and then the inverse fast Fourier transform is used to convert them back to the spatial domain to obtain the frequency domain reprojection features.

[0039] The frequency domain reprojection features and spatial domain features are divided into multiple non-overlapping image blocks. Element-level local linear attention modulation is performed at the image block level, and local linear attention is calculated.

[0040] The calculated local linear attention is spliced ​​within the block and along the channel direction, and position encoding is introduced to obtain the output of the multi-head spectrum calibration module.

[0041] Furthermore, the specific process of the element-level local linear attention modulation is as follows:

[0042] The numerical vectors obtained by convolutional coding of the frequency domain reprojection features and spatial domain branches are reshaped and divided into non-overlapping image blocks.

[0043] Element-wise multiplication is performed on the features in each image patch to compute local linear attention;

[0044] The element-wise product is the Hadamard product.

[0045] Furthermore, the specific processing steps of the multi-scale-depth dynamic convolutional fusion module include:

[0046] The input features are divided into multiple sub-feature groups along the channel direction;

[0047] For each sub-feature group, channel description vectors are generated using global average pooling, and corresponding dynamic convolutional kernel weights are generated through a dynamic convolutional kernel generation network.

[0048] By utilizing the generated dynamic convolution kernel weights, dynamic depth convolutions of different scales are applied to the corresponding sub-feature groups to obtain multi-path dynamic convolution output features;

[0049] The multi-channel dynamic convolution output features are concatenated and shuffled in the channel dimension. After convolutional fusion, the module output is obtained.

[0050] Furthermore, the number of the multiple sub-feature groups is 3, and the kernel sizes of the corresponding dynamic depthwise convolutions at different scales are respectively... , and ;

[0051] The dynamic convolutional kernel generation network includes a linear transformation layer activated by GELU and a linear transformation layer activated by Sigmoid.

[0052] Furthermore, the processing steps of the output reconstruction layer include:

[0053] Perform convolution operations on the fused features to expand the channels;

[0054] After the activation function is applied, the convolution operation is performed again to convert the number of channels to the preset number of color channels.

[0055] The calculation results are superimposed with the initial reconstruction estimate of the single-pixel image through a skip connection, and the final single-pixel reconstructed image is obtained through a dimensionality compression operation.

[0056] The above-described one or more technical solutions in the embodiments of the present invention have at least one of the following technical effects:

[0057] Significantly reduces computational complexity and parameter count, enabling lightweight deployment and eliminating the quadratic growth of computational complexity found in traditional Transformers. This network employs a global self-attention mechanism and innovatively utilizes a multi-head spectral calibration module (MSC). By using Fast Fourier Transform (FFT) and element-wise multiplication for feature modulation in the frequency domain, the complexity of global information interaction is reduced to the linear-logarithmic level. This significantly reduces the parameter storage size and computational resource consumption when processing high-resolution images, making it more suitable for integration into resource-constrained edge devices or embedded systems.

[0058] To significantly improve reconstruction accuracy while considering both global structure and local details, a dual-domain collaborative attention mechanism is proposed. Spatial and frequency domain information are addressed in parallel within the multi-head spectral calibration module (MSC). By utilizing Fast Fourier Transform and element-level local linear attention modulation in the frequency domain, the MSC directly achieves information fusion and interaction between frequency and spatial domain features, successfully realizing global modeling of the overall image structure and long-range information. Based on the global modeling of dual-domain information completed by the MSC, a multi-scale-depth dynamic convolutional fusion module (MDDC) is connected in series in the spatial domain. This module generates dynamic convolutional kernels to adaptively extract features based on their spatial location, specifically designed for accurately processing high-frequency texture details and edge information. Experiments show that the proposed "first dual-domain global fusion, then spatial local refinement" approach outperforms existing mainstream methods in peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) at different sampling rates, effectively solving the problems of blurred details and severe artifacts in traditional methods.

[0059] To enhance the robustness of the reconstruction process, physical model constraints are introduced by embedding a Residual Guided Update (RGU) module in each level of the deep network. This module uses a physical measurement model to calculate the observation residuals and back-projects them back into the feature space to correct intermediate features of the network. This hybrid "physical-driven + data-driven" approach ensures that the reconstruction results closely approximate the actual observation data, improving the network's adaptability and stability under different sampling rates and noise environments.

[0060] Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description

[0061] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0062] Figure 1 This is a flowchart of the lightweight single-pixel imaging method based on frequency-spatial domain modulation attention according to the present invention.

[0063] Figure 2 This is a structural diagram of the dual-domain collaborative attention module (DSAB) of the present invention.

[0064] Figure 3 This is a structural diagram of the multi-head spectral calibration module (MSC) of the present invention.

[0065] Figure 4This is a structural diagram of the Multi-Scale-Depth Dynamic Convolution Fusion Module (MDDC) of the present invention.

[0066] Figure 5 This is an example of a single-pixel reconstructed image tested on the Set11 dataset using different sampling rates, based on the method of this invention.

[0067] Figure 6 This is a set of images comparing the method of the present invention with classic single-pixel reconstruction methods or networks on the Set11 dataset with a sampling rate of 10%. Detailed Implementation

[0068] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. Based on the embodiments of this invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this invention. The following embodiments are used to illustrate this invention but cannot be used to limit the scope of this invention.

[0069] The following is combined with Figures 1 to 4 Describe the technical solution of the present invention; such as Figure 1 As shown, the overall algorithm flow of the technical solution of the present invention is illustrated; as Figures 2 to 4 The diagram illustrates the structure of the main functional modules in the technical solution of this invention.

[0070] like Figure 1 As shown, this embodiment of the invention provides a lightweight single-pixel imaging method based on frequency-spatial domain modulation attention, which mainly includes a single-pixel image data generator and a computational reconstruction generator.

[0071] The single-pixel image data generator is based on the Kronecker Product Single-Pixel Imaging (Kronecker SPI) model theory to obtain simulated single-pixel image compressed measurement data. This is used for subsequent network model training.

[0072] The computational reconstruction generator mainly consists of two parts: the Inverse Mapping Preliminary Inversion Estimation Module (IMPIE) and the Dual-Domain Collaborative Attention Deep Reconstruction Network.

[0073] The Inverse Mapping Preliminary Inversion Estimation Module (IMPIE) primarily utilizes a learnable measurement matrix. and The inverse mapping of the transpose of the transpose mapping matrix is ​​achieved by left-multiplying the transpose mapping matrix. And right multiply the transpose mapping matrix This yields preliminary estimates of the initial reconstruction of a single-pixel image.

[0074] The dual-domain collaborative attention deep reconstruction network mainly achieves global modeling of the overall image structure through the interaction of dual-domain feature information in the frequency domain and spatial domain. It adopts the local large kernel convolution method to replace the calculation between query and key value in the general Transformer, which significantly reduces the computational complexity and parameter storage scale of the network. Furthermore, it significantly enhances the ability to reconstruct texture and details by realizing spatial location adaptive feature extraction through multi-scale deep dynamic convolution.

[0075] The method of this invention significantly reduces the parameter storage size of the network model without affecting the overall image reconstruction accuracy, thus achieving a lightweight network structure that is more suitable for integration into edge devices or real-time imaging scenarios.

[0076] The specific steps are as follows:

[0077] Step (1): Construct a single-pixel image data generator based on the Kronecker Product (Kronecker SPI) model. This module can generate simulated single-pixel data based on a general image dataset, which is used to simulate the encoding process of single-pixel imaging.

[0078] The single-pixel imaging method is mainly divided into two parts: a single-pixel image data generator and a computational reconstruction generator. The computational reconstruction generator mainly includes two parts: an inverse mapping preliminary inversion estimation module (IMPIE) and a dual-domain collaborative attention deep reconstruction network.

[0079] The image data generator is primarily based on the Kronecker Product (KRP) single-pixel imaging model, encapsulating the physical model as a computational module. First, large-sized 2D images of inconsistent sizes from the original public dataset are randomly cropped to a specified scale in each iteration of each training epoch, resulting in the original input image to be reconstructed into single pixels. This serves as the input image data for subsequent networks. The height and width of the image are defined, such as: or ,in, Represents the original input image The total number of pixels is also the total number of data points in the reconstructed image.

[0080] Based on the Kronecker Product Single Pixel Imaging (KSPI) model theory, two measurement matrices with simulated compressed sampling that can be learned during training are constructed. and , respectively used for the original input image Independent compression modulation in the horizontal and vertical directions of the matrix. Measurement matrix. and The matrix, determined to be learnable during training, can be updated through backpropagation of the loss function, thereby simultaneously optimizing the measurement matrix and reconstruction results. M represents the total number of sampled data in the final single-pixel acquisition system, i.e., the number of data after compressed sampling measurement. and This represents the height or width of the matrix.

[0081] in To achieve compressed sampling rate The sparse observations; where the value of M is determined by different sampling rates (SR) and the input single-pixel original image. The sampling rate is determined by the total pixel value N, and the following values ​​can be selected: .

[0082] for example Calculated according to different sampling rates Then the original input image Simultaneously multiply by the measurement matrix. Right multiply the measurement matrix It is possible to obtain simulated single-pixel image data. The calculation formula is:

[0083]

[0084] It is also important to note that when using data acquired by an actual single-pixel acquisition system, the measurement matrix of the actual acquisition system... Specific structural designs can be selected, such as Hadamard matrices and their variants (e.g., cake-cutting Hadamard matrices), random Gaussian matrices, and random Bernoulli matrices, to detect and sample targets and obtain actual compressed measurement signals. , This is a one-dimensional undersampled vector. Represents the real number field. It still represents the total number of sampled data in the final single-pixel acquisition system, that is, the number of data after compressed sampling measurement. It still represents the total number of pixels in a single-pixel image, that is, the total number of data points in the reconstructed image. The data is the raw data acquired by the actual single-pixel acquisition system, and is a one-dimensional vector.

[0085] Before entering the single-pixel image reconstruction network, to ensure that the data is consistent with the input format during the training phase, it is reshaped into a two-dimensional matrix. The image is then fed into the proposed single-pixel image reconstruction network for reconstruction; similarly, it calculates different sampling rates. ( After that, a performance test was conducted on the refactored components.

[0086] Where M is the number of data points after compressed sampling measurement. The total number of data points for reconstructing the image.

[0087] Step (2): Construct a computational reconstruction generator that includes: an inverse mapping preliminary inversion estimation module based on the transpose mapping matrix after the measurement matrix is ​​transposed (IMPIE), a multi-level cascaded dual-domain collaborative attention module (DSAB) based on the fusion and interaction of frequency domain and spatial domain feature information, a multi-layer reconstruction convolutional layer, and a multi-level skip connection memory enhancement, and outputs a single-pixel computational reconstruction image.

[0088] First, compress the single-pixel measurement data. Input, the initial inversion estimation module, uses a measurement matrix that can be learned through training iterations. and The transpose of the mapping matrix is ​​then used for the inverse mapping operation, i.e., by left-multiplying the transpose mapping matrix. And right multiply the transpose mapping matrix This generates a preliminary inversion estimate of the input image, yielding an initial estimate of the single-pixel image reconstruction. The calculation formula is:

[0089]

[0090] in, . Representing the real number field, this indicates that all elements in the measurement matrix are real numbers. The measurement matrix... and Dimensions and Depends on single-pixel compression measurement data Compared with the original input two-dimensional image Dimensions.

[0091] Will The collected data were rearranged into a two-dimensional matrix. Participated in single-pixel reconstruction work.

[0092] Where N represents the total number of pixels in a single-pixel image, that is, the total number of data points in the reconstructed image; M represents the total number of sampled data points in the final single-pixel acquisition system, that is, the number of data points after compressed sampling measurement.

[0093] This step provides a rough and reasonable starting point for the subsequent deep reconstruction network. The output is then fed into the deep reconstruction network.

[0094] The deep reconstruction network consists of an input preprocessing layer, several cascaded dual-domain collaborative attention modules (DSAB), and an output reconstruction layer.

[0095] The initial reconstruction estimate of the single-pixel image calculated by the Inverse Mapping Preliminary Inversion Estimation Module (IMPIE) The input first passes through an input preprocessing layer (3×3 convolution), and its output serves as the input to a chain of several levels of cascaded dual-domain collaborative attention modules (DSAB).

[0096] The dual-domain collaborative attention module (DSAB) chain consists of Composed of cascaded units with identical structures, the reconstruction accuracy and efficiency were compared in experiments. Higher layer numbers resulted in better reconstruction performance, but also increased the number of parameters. Therefore, the number of layers in the Dual Domain Collaborative Attention Module (DSAB) can be adjusted. This level number represents the optimal equilibrium solution in this implementation scheme.

[0097] like Figure 2 As shown, the first The internal structure of the dual-domain collaborative attention module (DSAB) includes, in sequence: a residual guided update module (RGU), a multi-head spectral calibration module (MSC), and a multi-scale-depth dynamic convolutional fusion module (MDDC). It also incorporates the prior guidance information output by the residual guided update module (RGU). The output of the multi-head spectral calibration module (MSC) and the multi-scale-depth dynamic convolution fusion module (MDDC) is compared with the output of the multi-head spectral calibration module (MSC) and the multi-scale-depth dynamic convolution fusion module (MDDC). Form residual connection output At the same time, the next higher level (i.e. Output of the dual-domain collaborative attention module (DSAB) As at this level ( The input to the dual-domain collaborative attention module (DSAB) at the (level) level; After being processed sequentially by the dual-domain collaborative attention module (DSAB), its output is then passed through two 3×3 convolutional layers to form a refined reconstruction result. The final output reconstruction layer will refine the reconstruction results. Initial reconstruction estimate of single-pixel image from the preliminary inversion estimation module By setting skip connections between the two feature information, a fusion operation is performed to generate the final network output single-pixel reconstructed image. .

[0098] The specific steps of the Residual Guided Update Module (RGU) include: in the... In the dual-domain collaborative attention module (DSAB), the previous level (i.e., The current reconstruction estimate output by the dual-domain collaborative attention module (DSAB) at the (level 1) level. As at this level ( The input to the dual-domain collaborative attention module (DSAB) is at the level of [level missing]. This is a classic tensor description notation in deep learning. These represent the height, width, and number of channels of the current information feature map, respectively.

[0099] For example, in a specific embodiment, set That is, the current reconstruction estimate at this point. The height is Width is With the number of channels Calculate single-pixel compressed measurement data. Current reconstruction estimate with the previous level dual-domain collaborative attention module (DSAB) The error residuals between them are obtained by left-multiplying the transpose mapping matrix. And right multiply the transpose mapping matrix The residual signal is mapped back to the feature space for residual back projection, based on the learnable gradient descent step size parameter. Based on the aforementioned measurement residuals, a tensor gradient update is performed to obtain an iterative correction result for the current image reconstruction estimate, and prior guidance information is provided to subsequent reconstruction network modules (Multi-head Spectral Calibration Module (MSC) and Multi-scale-Depth Dynamic Convolutional Fusion Module (MDDC)). In one specific embodiment , where represents the current prior guidance information tensor The height is Width is With the number of channels Therefore, the first In the dual-domain collaborative attention module (DSAB), the prior guidance information output by the residual guided update module (RGU) is... The process calculation formula is:

[0100]

[0101] Among them, the learnable gradient descent step size parameter ,in For the specific gradient factor, initially set to a trainable single-valued tensor. The current reconstructed estimate is then... To better match the observation data The direction of the correction ensures data consistency. The next higher level (i.e.) The output of the dual-domain collaborative attention module (DSAB). and These are left-multiplying the transpose mapping matrix and right-multiplying the transpose mapping matrix, respectively. This is compressed measurement data for a single pixel.

[0102] The following section details the specific steps for implementing the functions of each module.

[0103] like Figure 3 As shown, the specific steps of the multi-head spectral calibration module (MSC) include: converting the prior guidance information output by the residual guided update module (RGU) into... After layer normalization, the data are fed into the spatial domain branch and the frequency domain branch, respectively.

[0104] The spatial domain branch includes a 1×1 convolution layer to achieve feature modulation. The modulated feature information is then distributed along the channel direction according to a preset number of heads. The division is performed. In one embodiment, the number of multiple heads... Furthermore, based on a preset image segmentation strategy, the feature information in each head is divided into... The non-overlapping sub-blocks are In one embodiment The above parameters can be set to optimal empirical values ​​through comparative experiments; the values ​​given here are one of the optimal solutions for one embodiment.

[0105] The frequency domain branch includes a fast two-dimensional fast Fourier transform unit, a single-layer 1×1 convolution to extract feature information, and a two-dimensional discrete inverse real fast Fourier transform unit to obtain frequency domain reprojection features. The process is as follows:

[0106]

[0107] in, This represents the two-dimensional discrete real number fast Fourier transform. Wherein, Represents the two-dimensional discrete inverse real fast Fourier transform; This is a 1×1 convolution operation; For layer normalization operation; This is prior guiding information.

[0108] The obtained frequency domain reprojection features Using the same method as spatial domain branching, multi-head partitioning and block division are performed to obtain... Non-overlapping image patches The size of each image patch is In one embodiment, it can be set to ,in, It is numerically equal to the original input image. Total number of pixels In one embodiment .

[0109] The above parameters can be set to optimal empirical values ​​through comparative experiments; the values ​​given here are one of the optimal solutions for one embodiment. A large kernel attention mechanism with computationally simple element-wise multiplication is used instead of the computationally complex multi-head self-attention mechanism: drawing inspiration from window multi-head attention design, the block features of the frequency domain reprojection features are... Block features with spatial domain features Element-level local linear attention modulation is performed at the image patch level to calculate the local linear attention. .

[0110] The specific steps of the patch-level local element linear attention mentioned above are: reprojecting the frequency domain features. Spatial domain features encoded by convolution Reshape its tensor dimension to That is, the current frequency domain reprojection characteristics. Spatial domain features encoded by convolution The height is ,width With the number of channels ;

[0111] Additionally, Head represents the number of heads receiving multi-head attention, which can be set to... C represents the number of channels, which is 64 in this embodiment. The total number of pixels; the above features are divided into There are 3 non-overlapping image patches, each image patch being 1. In one embodiment, it can be set to .

[0112] It should be noted that the frequency domain reprojection features here... A Spatial domain features encoded by convolution V The resulting blocks are all the same size, P×P, but they have all been reduced to half their original size, that is, from 256×256 to 128×128.

[0113] It is worth noting that the parameter "4" here (i.e., reduced by 1 / 2) is an empirical value and can be changed for performance optimization. This value is one of the optimal values ​​in this embodiment.

[0114] Then, the block features of the frequency domain reprojection features in each image patch are processed. Block features with spatial domain features Perform element-wise multiplication to compute the local linear attention representation as follows: ;

[0115] The obtained local linear attention calculation results are concatenated sequentially within blocks and along channels to form joint features. Then, the joint features are weighted using a Softmax normalized exponential function, and after introducing learnable convolutional position encoding units (CPEs), they are fed into a 1×1 convolutional layer. Residual skip connections are then set to fuse prior guidance information. Later used as the output of the Multi-Head Spectrum Calibration Module (MSC) The entire process can be represented as follows:

[0116]

[0117] Among them, the Hadamard product The expression describes element-wise multiplication. This is a 1×1 convolution operation. This is a splicing operation along the channel. It is a normalized exponential function. This is a learnable convolutional position encoding term. The final output features of the multi-head spectral calibration module (MSC) , indicating its height is ,width With the number of channels .

[0118] like Figure 4 As shown, the specific steps of the multi-scale-depth dynamic convolutional fusion module (MDDC) include: the output features of the multi-head spectral calibration module (MSC). The input features of the multi-scale-depth dynamic convolutional fusion module (MDDC) are first preprocessed by passing through a 1×1 convolutional layer and a batch normalization layer. This feature is retained as a short-circuit connection branch of the module for subsequent long-distance skip connection superposition and integration operations of feature information.

[0119] This process can be represented as: .

[0120] in, This is a 1×1 convolution operation. This is for batch normalization operations.

[0121] In another branch of deep feature extraction, the input features of this module are... After normalization, the data is expanded to the specified channel dimension by 1×1 convolution and then input into the channel partitioning unit. That is, the height of the feature information at this time is ,width With the number of channels The number of channels is a characteristic of the input of this module. Three times that of the channel segmentation unit. The channel segmentation unit divides the feature into three sub-feature groups along the channel direction, denoted as follows: That is, at this time All heights The width is Both the number of channels are Each sub-feature group undergoes global average pooling (GAP) to obtain channel description vectors. Then, dynamic convolutional kernels are generated, and these description vectors are input into a dynamic convolutional kernel generation network consisting of a GELU-activated linear transform layer and a Sigmoid-activated linear transform layer. A channel-based grouping strategy is employed, dividing all channels into... Groups, if possible The convolution kernel at each spatial location The above generates an expanded size of [size missing] for each group. Dynamic convolution kernel weights , It is the size of the convolution kernel (e.g.) , , ), and set the number of candidate cores. .

[0122] in, Represents the coordinates of the local space within each feature group; These are the displacement coordinates within the local space.

[0123] Follow the steps described above for each sub-feature group Apply different scales (e.g.) , , Dynamic depthwise convolution, for each position of each sub-feature group. Extract one The window is defined, and each element in the window is multiplied by a dynamic kernel weight. Finally, three dynamic convolutional output features are generated, denoted as follows: .by For example (i.e.) The calculation process is as follows:

[0124]

[0125] in, This is the second sub-feature group previously divided along the channel direction. This represents the output feature of the dynamic convolution of this branch. This indicates the dynamic convolution operation performed on this feature path. This refers to the number of groups divided according to the channel. Represents the coordinates of the local space within each sub-feature group. Represents the displacement coordinates in local space The range. Representing the displacement coordinates within the local space, the dynamically generated convolutional kernel weights can perform position-wise weighted aggregation of local spatial features, thereby effectively modeling local spatial correlations.

[0126] The three dynamically generated convolutional output features, after the above processing, are concatenated along the channels to perform a shuffling operation along the channel dimension to enhance cross-channel interaction and suppress channel redundancy; and through... Convolution further fuses feature information from different receptive fields; subsequently, the preprocessed features retained from the input multi-scale-depth dynamic convolutional fusion module (MDDC) are extracted. Long-range skip connections are stacked and integrated into the final output features of the multi-scale-depth dynamic convolutional fusion module (MDDC). That is, its height at this time is ,width With the number of channels The calculation process is as follows:

[0127]

[0128] in, This is a 1×1 convolution operation. This refers to the shuffling operation performed on the channel dimension. This is a splicing operation along the channel. The three dynamic convolutional output features are generated for the three branches in the middle of the module.

[0129] Finally, the calculation results output from the multi-head spectral calibration module (MSC) and the multi-scale-depth dynamic convolution fusion module (MDDC) will be used. The prior guidance information calculated by the Residual Guidance Update Module (RGU) is then used. Long-distance skip connections are superimposed on the computational results from the multi-head spectral calibration module (MSC) and the multi-scale-depth dynamic convolutional fusion module (MDDC). In the middle, and integrated into the final output features This result is the current [number]. The output of the first-level dual-domain collaborative attention module (DSAB) is used as the input for the next-level dual-domain collaborative attention module (DSAB) for computational training, until the number of levels of dual-domain collaborative attention modules (DSAB) is reached. Updated from 1 to .

[0130]

[0131] in, For the current number Output feature information of the dual-domain collaborative attention module (DSAB). The feature information is output by the multi-head spectral calibration module (MSC) and the multi-scale-depth dynamic convolution fusion module (MDDC). The prior guidance information is calculated by the Residual Guided Update Module (RGU).

[0132] The specific steps of the output reconstruction layer include: processing the final output features (in, ) Execute once Convolution operation, channel expansion; then executed again after GELU activation function. The convolution operation converts the number of channels into a preset number of color channels. For example, setting the final number of color channels to 1 enables the image calculation output result. The final calculation results Compared with the initial reconstruction estimate A skip connection stacking operation is performed, and the final single-pixel reconstructed image is obtained through a dimensionality compression operation. :

[0133]

[0134] in, This is a 1×1 convolution operation. This is a 3×3 convolution operation. This indicates that the height of the output reconstructed image is... ,width With the number of channels This is the final reconstructed single-pixel image.

[0135] Step (3): Construct a hybrid loss function based on pixel-level reconstruction error (MSE), structural similarity (SSIM) and orthogonality constraint (ortho); train the proposed single-pixel image data generator and compute the reconstruction generator using a regular image dataset.

[0136] The hybrid loss function used to train the proposed single-pixel image data generator and computational reconstruction generator includes a hybrid loss function of pixel-level reconstruction error (MSE), structural similarity (SSIM), and orthogonality constraint (ortho). Its expression is:

[0137]

[0138] in, These are all weight coefficients of the loss function. Pixel-level reconstruction error loss function. MeanSquared Error (MSE) measures the pixel-by-pixel difference between the reconstructed image and the original input image, and is defined as: ,in, Indicates the total number of pixels; The original input image; The final single-pixel reconstruction network outputs the image.

[0139] Structural similarity loss function The Structural Similarity Index (MSE), as a structure-aware metric, can compensate for the insufficient sensitivity of the MSE to changes in image texture, edges, and contrast. It is defined as follows: ,in, .in, It is an image The average pixel value; It is the variance of the two images; It is the covariance of the two images; It is a stability constant to prevent division by zero.

[0140] Orthogonality constraint loss function The term is used to constrain the measurement matrix in the single-pixel sensing process. and The structural orthogonality of [the structure] helps improve the stability and information decoupling capability of the compressed sensing process, and is defined as: .in, Denotes the Frobenius norm; for The identity matrix. and Multiply by the measurement matrix on the left and on the measurement matrix on the right; and This represents left-multiplying the transpose mapping matrix and right-multiplying the transpose mapping matrix.

[0141] Step (4): Use the trained single-pixel image computation reconstruction generator to perform computation reconstruction tests on the actual collected one-dimensional undersampled single-pixel data and public datasets at different sampling rates, and evaluate the reconstruction accuracy at different sampling rates based on peak signal-to-noise ratio (PSNR) and structural similarity (SSIM).

[0142] In summary, the above single-pixel reconstruction network model was constructed based on the PyTorch framework and optimized using the Adam optimizer, with its hyperparameters set to default values. , Batch size and number of training epochs were uniformly set to 4 and 200, respectively. The initial learning rate was set to... Furthermore, the learning rate is reduced by 50% every 20 epochs of training. Gradient factor The initial setting is 0.001. The sampling rate SR is set to... The corresponding sampled data size is calculated to be Configure the number of levels in the Dual Domain Collaborative Attention Module (DSAB). The number of heads for multi-head attention in the multi-head spectral calibration module (MSC) can be set to... Dynamic kernel size in the multi-scale-depth dynamic convolution fusion module (MDDC) Set as 3 branches respectively , , The channels are divided into Groups, the number of candidate kernels is set to The final output image channel size can be set to... Configure as needed. Set the weight coefficients of the loss function to... .

[0143] Using the BSDS400 dataset at different sampling rates The network is trained using the following methods; when training a network for large-size image reconstruction, the DIV2K dataset with 2K resolution can be used for training, and the input size of the network structure can be matched by cropping and scaling. The Set11 dataset is used at different sampling rates. The following reconstruction performance test will be conducted.

[0144] All training and testing were performed on a machine with an AMD Ryzen 9 5950x 3.4GHz CPU (64GB RAM) and an RTX 3090 GPU (24GB RAM), featuring 16 cores and 32 threads. Image reconstruction performance was measured using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM), which quantify the similarity between the reconstructed image and the original input image, reflecting the reconstruction ability and structure preservation capability of the single-pixel reconstruction network.

[0145] Table 1 shows the results of using different sampling rates (SR) on the Set11 evaluation dataset. The average PSNR / SSIM values ​​of the proposed method are compared with those of classical single-pixel image reconstruction methods, as well as their parameter storage size. All comparative results were obtained using the same test machine, dataset, and unified evaluation process, objectively reflecting the performance advantages of the proposed method.

[0146] Table 1

[0147]

[0148] In this context, the bolded values ​​are the extreme values ​​for horizontal comparison, the __ values ​​are the second maximum values, and the - indicates that the original network did not provide a model at this sampling rate.

[0149] The proposed method was evaluated and compared on several representative single-pixel image reconstruction networks using the Set11 dataset. The method demonstrated superior performance in terms of sampling rate (SR). The invention achieves optimal (or tied-optimal) PSNR / SSIM scores under various conditions, demonstrating consistent reconstruction advantages and stability across different sampling intensities. Compared to lightweight iterative unfolding networks (such as ISTA-Net+ and OPINE-Net), this invention significantly improves reconstruction accuracy while maintaining a lower parameter size. Compared to high-parameter networks such as the Transformer architecture built around multi-head attention mechanisms (such as HATNet and IDM-Net), this invention achieves superior or comparable reconstruction performance with a significantly reduced parameter count, demonstrating higher feature representation efficiency and a better complexity-performance ratio.

[0150] like Figure 5 As shown, on the Set11 dataset, when the sampling rates are 1%, 4%, 10%, 25%, and 50%, the method of the present invention can stably recover the main structural information and detailed features of the target image under different sampling conditions. With the increase of the sampling rate, the sharpness and local texture of the reconstructed image gradually improve, and it can still maintain good structural integrity under low sampling rate conditions, indicating that the method of the present invention has good robustness to changes in the sampling rate.

[0151] like Figure 6 As shown, taking a sampling rate of 10% as an example, the method of this invention is compared with various classic single-pixel reconstruction methods and existing network models. The present invention shows superior performance in terms of image sharpness, edge continuity, and local detail fidelity. Especially in detailed areas such as the "mouth" (the area marked by the red box in the figure), the method of this invention can more accurately recover grayscale level changes, greatly improving blurring and artifact phenomena.

[0152] Figure 6 In the diagram, the bolded values ​​are the extreme values ​​for horizontal comparison, and the __ values ​​are the second maximum values.

[0153] Therefore, combining quantitative indicators and subjective visual effects, it can be seen that the method of this invention can achieve high-quality single-pixel reconstruction under different sampling rates, effectively improving detail recovery capabilities while ensuring image structural consistency, thus verifying the effectiveness and practicality of the method in single-pixel imaging reconstruction tasks. Furthermore, this invention effectively reduces model structural complexity and deployment overhead while ensuring reconstruction quality, making it more suitable for efficient single-pixel image reconstruction and compressed sensing reconstruction applications on resource-constrained platforms.

[0154] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0155] (1) To address the issues of high computational complexity and large computational and memory overhead in current single-pixel reconstruction networks, this invention prioritizes lightweight design by introducing a local large-kernel convolution method to replace the computation between query and key values ​​in a typical Transformer. This mechanism combines multi-head and local block partitioning for feature attention computation, extracts local weight features, and modulates the value features using element-wise multiplication (Hadamard product), thereby achieving spatial correlation modeling within local regions. This method mitigates the high computational burden of a typical Transformer, allowing the network to maintain a lightweight structure without affecting reconstruction accuracy, making it more suitable for integration into edge devices or real-time imaging scenarios.

[0156] (2) To address the problem of limited receptive field in traditional deep reconstruction models for large-size images, this invention introduces a large receptive field attention mechanism based on frequency-spatial domain feature modulation. By fusing and interacting with dual-domain feature information in the frequency and spatial domains, global modeling of the overall structure of the image is achieved, obtaining a large receptive field without significantly increasing the computational load, thus improving the ability to model the long-distance structure of the input image information;

[0157] (3) In order to overcome the problem of limited detail representation in complex structural scenes by traditional fixed convolution kernels, this invention achieves spatial location adaptive feature extraction through multi-scale depth dynamic convolution, which significantly enhances the ability to reconstruct textures and details. With the help of the multi-scale-depth dynamic convolution fusion module (MDDC), this invention can generate corresponding convolution kernels for different positions according to changes in spatial content, and has multi-scale representation capabilities, enabling the network to process high-frequency textures, edges and other local areas more finely;

[0158] (4) In order to improve the stability and convergence speed of the reconstruction process, this invention introduces an iterative inversion structure guided by residual descent. By using the residual-guided update module (RGU) to integrate the physical imaging model constraints into the deep reconstruction process, it helps to improve the interpretability of the reconstruction and the stability of the optimization process, reduce the dependence of the deep model on the training data, and improve the generalization ability under different measurement conditions;

[0159] (5) This invention effectively improves the reconstruction degradation problem of traditional single-pixel reconstruction methods under high noise, high compression rate and high dimension measurement conditions by combining frequency domain-spatial domain feature modulation, dynamic convolution and iterative inversion, thereby obtaining higher image restoration accuracy and structure preservation capability, and can better adapt to complex light fields and measurement scenarios in practical applications.

[0160] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A lightweight single-pixel imaging method based on frequency-spatial domain modulation attention, characterized in that, The steps include, Acquire single-pixel compressed measurement data of the target scene; The single-pixel compressed measurement data is inversely mapped using the transpose mapping matrix obtained by transposing the measurement matrix to generate an initial reconstruction estimate of the single-pixel image; The initial reconstruction estimate of the single-pixel image is input into the dual-domain collaborative attention deep reconstruction network; the dual-domain collaborative attention deep reconstruction network contains several levels of cascaded dual-domain collaborative attention modules; In the dual-domain collaborative attention module, frequency domain feature extraction and spatial domain feature extraction are performed in parallel on the input feature information, and the frequency domain features and spatial domain features are fused; wherein, the frequency domain feature extraction is used to capture global structural information, and the spatial domain feature extraction utilizes dynamic convolution to reconstruct local detail information; The internal structure of the dual-domain collaborative attention module includes, in sequence: a residual-guided update module, a multi-head spectral calibration module, and a multi-scale-depth dynamic convolution fusion module. The residual-guided update module performs tensor gradient update correction on the measurement residual between the single-pixel compressed measurement data and the current reconstruction estimate of each level of dual-domain collaborative attention module; The multi-head spectral calibration module modulates the features output by the residual-guided update module with dual-domain feature information. Through dual-domain feature information fusion and interaction, it realizes long-distance structural modeling of input image information. The dual-domain feature information includes frequency domain features and spatial domain features. The multi-scale-depth dynamic convolution fusion module performs multi-scale dynamic convolution in the spatial domain on the features output by the multi-head spectrum calibration module to realize multi-scale feature information expression, enabling the network to recover high-frequency texture and edge local detail region feature information; The output of the multi-head spectral calibration module is fused with the output of the multi-scale-depth dynamic convolution fusion module to serve as the output of the current dual-domain collaborative attention module. The outputs of the dual-domain collaborative attention deep reconstruction network are integrated to obtain the final single-pixel reconstructed image.

2. The lightweight single-pixel imaging method based on frequency-spatial domain modulation attention as described in claim 1, characterized in that, The method involves using the transpose mapping matrix obtained by transposing the measurement matrix to perform inverse mapping on the single-pixel compressed measurement data to generate an initial reconstruction estimate of the single-pixel image. The specific calculation formula is as follows: in, This is the initial reconstruction estimate for a single-pixel image. The data refers to compressed single-pixel measurement data acquired by a single-pixel imaging system; the compressed single-pixel measurement data is in two-dimensional form. and These are the preset left-multiplication and right-multiplication of the measurement matrix, respectively. and Multiply by the transpose of the mapping matrix on the left and on the transpose of the mapping matrix on the right. The single-pixel compressed measurement data is derived from the single-pixel original image. Through formula Calculated; The training process of the dual-domain collaborative attention deep reconstruction network adopts a hybrid loss function, which includes pixel-level reconstruction error, structural similarity loss and orthogonality constraint loss. The orthogonality constraint loss is used to constrain the structural orthogonality of the measurement matrix.

3. The lightweight single-pixel imaging method based on frequency-spatial domain modulation attention as described in claim 1, characterized in that, The dual-domain collaborative attention deep reconstruction network includes an input preprocessing layer connected in sequence, a chain structure composed of several levels of the dual-domain collaborative attention modules connected in series, and an output reconstruction layer. The input preprocessing layer performs convolution processing on the initial reconstruction estimate of the single-pixel image and outputs initial features; In the chain structure, the output of the previous dual-domain collaborative attention module serves as the input of the next dual-domain collaborative attention module. The output reconstruction layer processes the output of the last dual-domain collaborative attention module and superimposes it with the initial reconstruction estimate of the single-pixel image via a skip connection to generate the final single-pixel reconstructed image.

4. The lightweight single-pixel imaging method based on frequency-spatial domain modulation attention as described in claim 1, characterized in that, The specific processing steps of the residual-guided update module include: Calculate the measurement residual between the observed value corresponding to the input feature of the current dual-domain collaborative attention module and the single-pixel compressed measurement data; The input features of the current dual-domain collaborative attention module are the reconstructed estimates calculated and output by the previous dual-domain collaborative attention module; The measurement residual is mapped back to the feature space by multiplying it by the transpose mapping matrix on the left and on the right, to obtain the residual back-projection signal; Based on the learnable gradient descent step size parameter, the residual back projection signal is used to perform gradient update on the current input features, and the corrected feature information tensor is output.

5. The lightweight single-pixel imaging method based on frequency-spatial domain modulation attention according to claim 1, characterized in that, The multi-head spectrum calibration module includes parallel spatial domain branches and frequency domain branches; The processing steps of the frequency domain branch include: after the features output by the residual-guided update module are normalized by the layer, a two-dimensional fast Fourier transform is performed to convert them to the frequency domain, feature information is extracted in the frequency domain, and then the inverse fast Fourier transform is used to convert them back to the spatial domain to obtain the frequency domain reprojection features. The frequency domain reprojection features and spatial domain features are divided into multiple non-overlapping image blocks. Element-level local linear attention modulation is performed at the image block level, and local linear attention is calculated. The calculated local linear attention is spliced ​​within the block and along the channel direction, and position encoding is introduced to obtain the output of the multi-head spectrum calibration module.

6. The lightweight single-pixel imaging method based on frequency-spatial domain modulation attention according to claim 5, characterized in that, The specific process of the element-level local linear attention modulation is as follows: The numerical vectors obtained by convolutional coding of the frequency domain reprojection features and spatial domain branches are reshaped and divided into non-overlapping image blocks. Element-wise multiplication is performed on the features in each image patch to compute local linear attention; The element-wise product is the Hadamard product.

7. The lightweight single-pixel imaging method based on frequency-spatial domain modulation attention according to claim 1, characterized in that, The specific processing steps of the multi-scale-depth dynamic convolutional fusion module include: The input features are divided into multiple sub-feature groups along the channel direction; For each sub-feature group, channel description vectors are generated using global average pooling, and corresponding dynamic convolutional kernel weights are generated through a dynamic convolutional kernel generation network. By utilizing the generated dynamic convolution kernel weights, dynamic depth convolutions of different scales are applied to the corresponding sub-feature groups to obtain multi-path dynamic convolution output features; The multi-channel dynamic convolution output features are concatenated and shuffled in the channel dimension. After convolutional fusion, the module output is obtained.

8. The lightweight single-pixel imaging method based on frequency-spatial domain modulation attention according to claim 7, characterized in that, The number of the multiple sub-feature groups is 3, and the kernel sizes of the corresponding dynamic depthwise convolutions at different scales are respectively... , and ; The dynamic convolutional kernel generation network includes a linear transformation layer activated by GELU and a linear transformation layer activated by Sigmoid.

9. The lightweight single-pixel imaging method based on frequency-spatial domain modulation attention according to claim 3, characterized in that, The processing steps of the output reconstruction layer include: Perform convolution operations on the fused features to expand the channels; After the activation function is applied, the convolution operation is performed again to convert the number of channels to the preset number of color channels. The calculation results are superimposed with the initial reconstruction estimate of the single-pixel image through a skip connection, and the final single-pixel reconstructed image is obtained through a dimensionality compression operation.