MSI super-resolution reconstruction method based on bimodal adaptive registration and high-fidelity fusion

The MSI super-resolution reconstruction method, which combines dual-modal adaptive registration with high-fidelity fusion, solves the spatial resolution and texture problems in mass spectrometry imaging technology, and realizes high-quality MSI image reconstruction under low-quality input conditions. It is suitable for disease research, drug evaluation and tissue heterogeneity analysis.

CN122265038APending Publication Date: 2026-06-23JINAN UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JINAN UNIVERSITY
Filing Date
2026-05-07
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing mass spectrometry imaging techniques suffer from limited spatial resolution, low signal-to-noise ratio, and low acquisition efficiency. In particular, high-resolution acquisition is time-consuming, costly, and results in significant sample loss. Furthermore, cross-modal image reconstruction methods rely on large amounts of training data, and the reconstruction results are too smooth, making it difficult to present the metabolic details of true high-resolution MSI images.

Method used

We employ a super-resolution MSI reconstruction method based on dual-modal adaptive registration and high-fidelity fusion. We use a convolutional neural network for image registration and feature extraction, and combine Laplacian operator enhancement, dual encoder DIP network and spatial attention matrix to achieve multi-scale feature fusion and texture fidelity. We then inject real metabolic textures using high-resolution MSI reference images.

Benefits of technology

Without relying on a large amount of training data, MSI super-resolution reconstruction with consistent structure and realistic texture is achieved, which improves the reliability and detail fidelity of the reconstructed image and is suitable for batch sample or multi-region imaging scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122265038A_ABST
    Figure CN122265038A_ABST
Patent Text Reader

Abstract

The application provides a MSI super-resolution reconstruction method based on bimodal adaptive registration and high-fidelity fusion, comprising the following steps S1, image registration; S2, mapping; S3, multi-scale fusion; S4, texture fidelity; first, an intermediate fusion image with clear structure and smooth texture is generated through adaptive fusion; then a high-resolution MSI reference image is introduced, and an unsupervised texture migration strategy is used to inject metabolic texture in the reference image into the fusion image, so that the texture reconstruction of the high-resolution MSI is realized. In the fusion process, spatial attention and brightness adaptive modulation are combined, the structure is more accurate, and the texture is more true. The method has good adaptability, and multiple low-resolution MSI images under the same scene can share the same HE tissue image and high-resolution MSI reference image as a unified structure and texture source, without the need to prepare training data or label information separately, and is suitable for data scarce scenes, and the generated result is highly consistent with the real high-resolution MSI image in structure and texture.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of mass spectrometry imaging image processing and cross-modal image reconstruction technology, specifically to an MSI super-resolution reconstruction method based on dual-modal adaptive registration and high-fidelity fusion. Background Technology

[0002] Mass spectrometry imaging (MSI) is an important analytical technique for acquiring the spatial distribution of molecules on tissue sections, and it has wide applications in disease research, drug evaluation, and tissue heterogeneity analysis. However, due to the physical limitations of imaging equipment, MSI typically suffers from limited spatial resolution, low signal-to-noise ratio, and low acquisition efficiency. Especially in high-resolution mode, long acquisition times, high experimental costs, and significant sample loss make acquiring high-resolution MSI data very difficult in practical applications, resulting in a scarcity of usable high-quality data.

[0003] To overcome the aforementioned limitations, we utilized hematoxylin-eosin (HE) stained tissue images as auxiliary information to attempt to reconstruct high-resolution images from low-resolution MSI data. However, MSI and HE images often originate from different devices, different slice layers, or contain minor deformations, resulting in common spatial misalignment, local tissue shifts, or morphological distortions between the two types of images. This type of cross-modal error introduces structural inconsistencies during the fusion process, reducing the reliability of the reconstructed images.

[0004] Existing cross-modal reconstruction methods typically rely on deep learning models, requiring large amounts of paired, fully labeled, and strictly aligned training data. However, obtaining large-scale, paired, and high-quality MSI-HE data under real-world experimental conditions is extremely difficult, limiting the training and generalization performance of the models. Furthermore, these methods often over-rely on the smooth structural features of histological images during reconstruction, failing to effectively recover the unique metabolic texture information (such as graininess) found in MSI images. Therefore, while many existing methods produce reconstructed results with clear structures, the textures are often too smooth, making it difficult to represent the rich metabolic details found in true high-resolution MSI images.

[0005] In summary, existing technologies generally suffer from the following problems: deep learning relies on a large amount of training data, high-resolution MSI data is difficult to obtain, multimodal images suffer from spatial misalignment, and the reconstruction results lack texture realism. How to achieve structurally consistent and texture-realistic MSI super-resolution reconstruction under low-quality input conditions without relying on a large amount of training data remains a key technical challenge that needs to be addressed. Summary of the Invention

[0006] The purpose of this invention is to provide an MSI super-resolution reconstruction method based on dual-modal adaptive registration and high-fidelity fusion to solve the problems mentioned in the background art.

[0007] To achieve the above objectives, the present invention provides the following technical solution: A method for MSI super-resolution reconstruction based on dual-modal adaptive registration and high-fidelity fusion includes the following steps: S1. Image Registration: Input the low-resolution MSI image and the high-resolution HE image into the spatial transformation network module. The feature extraction layer of the convolutional neural network captures the deep tissue structure features, and the parameter regressor outputs the affine transformation matrix. The spatially aligned corrected image is obtained by sampling through a differentiable grid. S2, Mapping: The aligned and corrected image is input into a lightweight convolutional mapping module consisting of a head convolutional unit, a main convolutional unit, and a tail mapping unit connected in series. While maintaining the spatial resolution, the image is mapped from the pixel space to the target channel space to obtain a preliminary predicted image. S3. Multi-scale fusion: Apply Laplacian operator convolution to the preliminary prediction image to obtain a structure-enhanced image; input the low-resolution MSI image and the structure-enhanced image into a dual-encoder DIP network respectively, and extract multi-scale feature sequences layer by layer; after channel dimensionality reduction of the output features of each layer of the dual encoder, construct a spatial attention matrix based on the dimensionality-reduced HE features, spatially weight the dimensionality-reduced MSI features, concatenate the weighted MSI features with the original HE features in the channel dimension, and then compress to the target number of channels to obtain multi-scale fusion features; upsample the multi-scale fusion features layer by layer through the decoder, and make skip connections with the features of the corresponding encoder layer to gradually reconstruct a smooth structure reconstruction map; input the smooth structure reconstruction map into the correction network for local compensation and fine-tuning to obtain the structure correction result; through a weakly gated color correction mechanism, adaptively weight and fuse the structure correction result with the low-frequency components extracted by upsampling and local mean pooling of the low-resolution MSI image using the gated weights obtained after Sigmoid activation of the learnable parameters to obtain a smooth fusion map; S4. Texture Preservation: Local mean filtering is applied to the reference image to extract high-frequency residuals as initial grain texture; the initial grain texture is then subjected to mean removal and standardization, and the amplitude is adjusted using the texture intensity coefficient to obtain a standardized texture; based on the local brightness information of the smooth fusion image, the standardized texture is adaptively modulated using a brightness modulation coefficient to enhance the texture in dark areas and weaken the texture in bright areas; the modulated texture is injected into the smooth fusion image, and then global brightness correction is performed to ensure that the output image is consistent with the low-resolution MSI image at the channel mean level, ultimately obtaining a high-resolution MSI super-resolution reconstructed image.

[0008] Preferably, in step S1, the spatial transformation network module includes a localization network, a parameter regressor, and a spatial sampler. The localization network captures deep tissue structure features based on the feature extraction layer of a convolutional neural network. After compression and flattening by an adaptive average pooling layer, it is input into the parameter regressor, which consists of two fully connected layers. The regressor outputs affine transformation parameters and reassembles them into an affine transformation matrix. The spatial sampler constructs a sampling grid based on the affine transformation matrix. It performs spatial transformation on the input image through differentiable grid sampling to obtain a corrected image. This achieves automatic correction of translation, rotation, and local nonlinear deformation between HE images and MSI images, providing a spatially consistent alignment basis for subsequent cross-modal fusion.

[0009] Preferably, in step S1, the convolutional neural network extracts spatial geometric features through three sets of sequentially connected convolution-pooling-activation structures, thereby gradually compressing and focusing the features on higher-level space, capturing deep spatial features of the organizational structure at different scales, and providing a more discriminative feature representation for accurate regression of affine parameters.

[0010] Preferably, in step S2, in the lightweight convolutional mapping module, the head convolutional unit performs initial feature mapping on the input corrected image; the main convolutional unit consists of three consecutively stacked convolutional modules, which further model the output features; the tail mapping unit maps and compresses the channel dimension of the high-dimensional features output by the main unit, continuously expanding the receptive field without introducing downsampling, effectively mapping HE information to the MSI feature space, and providing an initial reconstruction result with consistent structure and no resolution loss for subsequent fusion.

[0011] Preferably, in step S3, each encoder in the dual encoder DIP network consists of a four-layer convolutional and pooling structure. Each convolutional block contains two convolution operations, and each convolutional block is followed by a max pooling operation to compress the feature map space size layer by layer. The output feature maps of each layer of the encoder are collected to form a multi-scale feature sequence, and multi-scale features from shallow details to deep semantics of MSI and HE images are extracted respectively, providing a complete feature hierarchy for cross-modal complementary information fusion.

[0012] Preferably, in step S3, the spatial attention matrix is ​​constructed based on the dimensionality-reduced HE features and generated through convolution and sigmoid activation. The spatial attention matrix is ​​used to perform element-wise multiplication and weighting on the dimensionality-reduced MSI features, so that the MSI color features in the regions with significant HE structural information are suppressed, while the MSI color information in the regions with insignificant structure is preserved. The cross-modal fusion is guided by the HE structure, which maintains the accuracy of tissue structure while avoiding the loss of MSI metabolic information.

[0013] Preferably, in step S3, the decoder reconstructs layer by layer starting from the deepest fused features, and uses bilinear interpolation for upsampling at each level; and after each stitch, feature reconstruction is performed through two consecutive convolutional blocks, and the ReLU activation function is used to gradually restore image detail information, gradually integrating deep semantic information with shallow spatial details to generate a structurally reconstructed image with clear edges and overall smoothness.

[0014] Preferably, in step S4, the spatial scale of the extracted texture particles is controlled by adjusting the window size of the local mean filter, where a larger window corresponds to coarser texture particles and a smaller window corresponds to finer texture features; the overall amplitude of the injected texture is controlled by the texture intensity coefficient, so as to achieve flexible and controllable control over the coarseness and intensity of the metabolic texture particles and adapt to different resolution requirements and tissue types.

[0015] Preferably, steps S1 and S2 are designated as the first stage, and step S3 as the second stage; a phased training strategy is adopted; the first stage is trained separately, and the second stage does not participate in model parameter updates at this time; after pre-training is completed, the first stage and the second stage jointly participate in joint training; step S4 is only used in the inference stage and does not participate in loss calculation and model parameter updates during the training process. Structural fusion is performed on the basis of stable alignment and mapping first, so as to avoid optimization conflicts in end-to-end training and improve the stability and quality of reconstruction results.

[0016] Compared with the prior art, the beneficial effects of the present invention are: This invention, through deep image prior networks and unsupervised texture transfer strategies, eliminates the dependence on large-scale, rigorous registration training data, and still has good feasibility and applicability in data-scarce scenarios.

[0017] This invention introduces a spatial transformation network into the cross-modal alignment process, which can adaptively correct translation, rotation and local nonlinear deformation, providing a stable and reliable spatial foundation for subsequent fusion.

[0018] This invention effectively preserves key tissue boundaries and detailed structures through the synergistic effects of Laplacian enhancement, dual encoder DIP, spatial attention, and multi-scale fusion. At the same time, based on the graininess transfer strategy of high-pass residual, it extracts and injects real metabolic textures from high-resolution MSI reference images, avoiding excessive texture smoothing or artificial artifacts.

[0019] This invention allows multiple low-resolution MSI images in the same scene to share the same HE image and texture reference source, avoiding the need to prepare reference data separately for each image, and is suitable for batch sample or multi-region imaging scenarios. Attached Figure Description

[0020] Figure 1 This is a complete flowchart of the present invention.

[0021] Figure 2 High-resolution HE image of the first group of experiments (mouse brain tissue sample slice 1).

[0022] Figure 3 This is a low-resolution MSI image of the first experimental group (mouse brain tissue sample slice 1).

[0023] Figure 4 High-resolution MSI image of the first group of experiments (mouse brain tissue sample slice 1).

[0024] Figure 5 This is a high-resolution MSI texture reference image for the first group of experiments (mouse brain tissue sample slice 1).

[0025] Figure 6 The image shows the alignment mapping result obtained after dual-modal adaptive registration for the first group of experiments (mouse brain tissue sample slice 1).

[0026] Figure 7 This is a smoothed MSI super-resolution image obtained from the first three stages of reconstruction of the first group of experiments (mouse brain tissue sample slice 1).

[0027] Figure 8 This is a high-quality MSI super-resolution image obtained from the final reconstruction of the first group of experiments (mouse brain tissue sample slice 1).

[0028] Figure 9 This is a high-resolution HE image of the second group of experiments (slice 2 of mouse brain tissue).

[0029] Figure 10 This is a low-resolution MSI image of the second group of experiments (slice 2 of mouse brain tissue).

[0030] Figure 11 This is a high-resolution MSI image of the second group of experiments (slice 2 of mouse brain tissue).

[0031] Figure 12 This is a high-resolution MSI texture reference image for the second group of experiments (mouse brain tissue sample slice 2).

[0032] Figure 13 The image shows the alignment mapping result obtained after dual-modal adaptive registration for the second group of experiments (mouse brain tissue sample slice 2).

[0033] Figure 14 This is a smoothed MSI super-resolution image obtained from the first three stages of reconstruction of the second group of experiments (mouse brain tissue sample slice 2).

[0034] Figure 15This is a high-quality MSI super-resolution image obtained from the final reconstruction of the second group of experiments (mouse brain tissue sample slice 2). Detailed Implementation

[0035] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0036] In the description of this invention, it should be noted that, unless otherwise explicitly specified and limited, the terms "installed," "equipped with," "sleeved with," "connected," etc., should be interpreted broadly. For example, "connection" can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection or an indirect connection through an intermediate medium; it can be a connection within two components. For those skilled in the art, the specific meaning of the above terms in this invention can be understood according to the specific circumstances.

[0037] Example: Please see Figures 1 to 15 The present invention provides a technical solution: This embodiment establishes an experimental platform in a Python environment to implement and verify the overall process and reconstruction effect of the MSI super-resolution reconstruction method. The data used in the experiment consisted of two sets of slice data from mouse brain tissue samples, corresponding to different slice morphologies, to verify the stability and applicability of the method under different experimental conditions.

[0038] S1. Image Registration: Due to the significant differences between MSI images and HE images in terms of imaging principles and spatial resolution, this invention first designs a feature extraction layer based on a convolutional neural network (CNN) to capture deep features representing tissue structures. Specifically, the network extracts spatial geometric features through three sequentially connected "convolution-pooling-activation" structures: the first group uses a 7×7 convolution kernel to transform the input image from 3 channels to 8 channels, and then uses a max pooling layer with a window size of 4 and a stride of 4 in conjunction with the ReLU activation function for dimensionality reduction; the second group uses a 5×5 convolution kernel to increase the number of channels to 10, and repeats the same pooling and activation process to expand the receptive field; the third group uses a 1×1 convolution kernel to adjust the features back to 8 channels, and then uses max pooling and ReLU activation again to gradually compress the features and focus them on higher spatial layers.

[0039] To achieve automatic spatial alignment of the input image, this invention introduces a spatial transformation network module, which mainly consists of a localization network, an affine parameter regressor, and a spatial sampler. The feature map processed by the aforementioned feature extraction layer is compressed into a 1×1×8 feature vector through an adaptive average pooling layer and flattened before being input into the parameter regressor, which consists of two fully connected layers. In the parameter regression stage, the first fully connected layer maps the 8-dimensional features to 16 dimensions and utilizes the ReLU activation function to enhance nonlinear expressive power; the second fully connected layer outputs six scalar parameters, which are then recombined into a 2×3 affine transformation matrix.

[0040] After obtaining the affine parameters, a sampling grid is constructed from the target coordinate system to the source image coordinate system. A differentiable grid sampling operation is then used to transform the input image. The resulting corrected image includes automatically compensated rotation, scaling, and translation errors, allowing the subsequent fusion network to learn under a more consistent spatial reference. This corrected image is denoted as... And it continues to be used as input for deep feature extraction in subsequent steps.

[0041] S2, Mapping: Obtaining the spatially corrected image Subsequently, this invention further employs a three-layer convolutional mapping structure to perform feature extraction and deep texture modeling on the image. This structure consists of a head convolutional unit, a main convolutional unit, and a tail mapping unit connected in series.

[0042] First, the head convolutional unit processes the input corrected image. Initial feature mapping is performed. This unit employs a convolutional module with a three-channel input and a 64-channel output, featuring a 3×3 kernel and a stride of 1, and using symmetrical padding to maintain the spatial resolution of the feature map. This convolutional module consists of convolution operations, normalization operations, and non-linear activation functions, used to map the input image from the original pixel space to a 64-dimensional feature space, thereby enhancing the ability to express local structure, texture details, and cross-channel information.

[0043] Subsequently, the main convolutional unit further models the aforementioned features. The main part consists of three consecutively stacked convolutional modules, each employing a 3×3 convolutional structure with 64 input and 64 output channels, a stride of 1, and maintaining the same spatial dimensions. Through the progressive stacking of multiple convolutional layers, the network can continuously expand its receptive field and enhance its contextual modeling ability without introducing downsampling, thereby achieving the joint extraction of deep semantic information and fine-grained texture features.

[0044] After extracting the main features, the tail mapping unit maps and compresses the high-dimensional features along the channel dimension. This unit also uses a 3×3 convolutional structure to map the 64-channel deep features back to the required number of output channels for the task, adapting to subsequent MSI fusion or restoration tasks. This step realizes the process of returning from the high-dimensional semantic space to the original task channel space, enabling the network to output the final fusion or correction result, which is denoted as... .

[0045] S3. Multi-scale fusion: An improved deep image prior network is adopted and combined with a multi-scale feature fusion strategy to generate MSI reconstructed images with consistent structure, clear edges and overall smoothness.

[0046] To further enhance the edge and local structure information in the prediction results, the predicted image obtained in the previous stage... Building upon this foundation, a structure information enhancement operation based on the Laplacian operator is introduced. The Laplacian convolution kernel used is defined as follows:

[0047] By predicting images Convolution with a Laplacian kernel yields a structure-enhanced image. .

[0048] This step effectively highlights the edge contours and local structural changes in the image, providing a more discriminative structural prior for subsequent multi-scale feature extraction and fusion.

[0049] In the structure reconstruction module, a dual-encoder DIP network is used to process the low-resolution MSI image and the HE image after structure enhancement, respectively. Feature extraction is performed to fully exploit the complementary information of the two modalities at different scales. The encoder is used to extract multi-scale features layer by layer, from low-level details to high-level semantics, providing a basic representation for subsequent fusion and decoding.

[0050] Specifically, low-resolution MSI images HE images with structural enhancement The inputs are fed into the corresponding encoders. Each encoder consists of four convolutional and pooling layers. Each convolutional block contains two 3×3 convolution operations with a stride of 1 and padding of 1, and uses the ReLU activation function. The number of output channels in each convolutional block is set sequentially to... To enhance feature representation capabilities layer by layer, a 2×2 max pooling operation is applied after each convolutional block to compress the feature map space size layer by layer. This enhances the network's ability to model large-scale organizational structures while preserving significant structural information, providing a stable multi-scale representation for cross-modal feature fusion.

[0051] The output feature maps of each layer of the encoder are collected to form a multi-scale feature sequence:

[0052] in These are the shallowest features, possessing high spatial resolution and a relatively small number of channels. It represents the deepest features, with smaller spatial resolution and richer semantic information.

[0053] To reduce the computational complexity of the multi-scale fusion process, channel dimensionality reduction is performed on the output features of each layer of the encoder. Specifically, for the... MSI features output by layer encoder and All features are reduced in dimensionality through 1×1 convolution projection to obtain the reduced features:

[0054] In this embodiment, the number of channels after dimensionality reduction in each layer is set to the target number of channels. The 1×1 convolution is used to project the original number of channels to the target number of channels after dimensionality reduction. This means that the dimensions of the encoder output are reduced to one-quarter of their original size.

[0055] The dimensionality-reduced MSI features and HE features are input into a multi-scale fusion engine for cross-modal fusion. Before fusion, spatial attention is first extracted from the HE features to highlight their dominant role in structural information.

[0056] Let the number of HE feature channels be... A spatial attention subnetwork is constructed using convolution and ReLU activation. This subnetwork first compresses the number of channels to:

[0057] Then through The convolutional mapping is converted into single-channel features, which are then activated by Sigmoid to obtain the spatial attention matrix. The calculation formula is as follows:

[0058] Spatial weighting of MSI features is performed using this spatial attention matrix:

[0059] Where ⊙ denotes element-wise multiplication. At locations where HE information is significant, attention weights are smaller to suppress MSI color features; at locations where the structure is less significant, attention weights are larger to preserve MSI color information. Subsequently, the weighted MSI features are concatenated with the original HE features along the channel dimension to obtain the fused features:

[0060] HE features play a dominant structural role in the fusion process. The concatenated fusion features are then compressed back to the MSI output channel number using a 1×1 convolution. This is so that the decoder can perform subsequent processing.

[0061] The decoder performs upsampling and feature reconstruction layer by layer based on multi-scale fused features. The decoding process starts from the deepest feature layer. Initially, bilinear interpolation is used for upsampling at each level, and skip connections are made with the features of the corresponding encoder layer. In cases of spatial size mismatch, the feature maps are cropped to ensure stitching capability.

[0062] After each concatenation, feature reconstruction is performed using two consecutive convolutional blocks with a kernel size of [size missing]. The step size is 1, the padding is 1, and the ReLU activation function is used to gradually recover image detail information. Finally, through... Convolution maps the number of channels to the target output channels. That is, the number of channels is consistent with the number of MSI image channels, thus generating the final smooth reconstructed image.

[0063] Based on the smooth structure map output by the decoder, a correction network is introduced to further improve the fidelity of local details and tissue edges. This network is used to perform local compensation and fine-tuning on the decoder output, improving structural accuracy without changing the number of output channels.

[0064] Correcting the network input to the decoder output image:

[0065] in The number of channels is consistent with the MSI (Medium Input Sequence) channel count. This network consists of two convolutional layers; the first convolutional layer takes the input channel count as input. Projected to The kernel size is The first convolution has a stride of 1, padding of 1, and uses the LeakyReLU activation function; the second convolution restores the number of channels to [previous value]. The kernel size is also... The step size is 1, the padding is 1, and there is no additional activation function. Its forward computation process can be represented as:

[0066] After obtaining the corrected network output, this invention further introduces a weakly gated color correction mechanism to further improve the matching degree between the output image and the low-resolution MSI image in terms of brightness distribution and color consistency.

[0067] First, the low-resolution MSI image is upsampled to the same spatial size as the output of the correction network, resulting in... Then, local mean pooling is used to extract the low-frequency components:

[0068] Next, learnable parameters are introduced. and through With activation constraints between [0,1], we obtain the Gate weights:

[0069] The final color correction output is a weighted fusion of the correction network output and the low-frequency MSI:

[0070] Among them, weak gating parameters This is used to learn the proportion of structural or color information to be selected in the reconstructed image at each location, achieving adaptive fusion of structure and color, thereby improving the naturalness and stability of the final reconstructed image.

[0071] S4. Texture Preservation: To further enhance the realism of the reconstructed MSI image in terms of high-frequency details and metabolic textures, this embodiment introduces a texture preservation module, which injects texture information that conforms to the characteristics of real MSI into the smooth reconstruction result while maintaining structural consistency.

[0072] First, using a high-resolution MSI reference image R as the texture prior, its high-frequency residual components are extracted through local pass-mean filtering to obtain the initial grain texture representation:

[0073] in, Indicates by window size The local mean filtering operation was performed. This is used to control the spatial scale of texture extraction. Larger window sizes correspond to coarser and more prominent texture grains, while smaller windows are better for capturing fine-grained texture features.

[0074] To adapt the extracted texture information to different image scales and region distributions, and to enhance the model's controllability over texture intensity, the high-pass residual is subjected to mean removal and standardization, and a texture intensity control coefficient is introduced. This yields a standardized texture representation:

[0075] in, and These are the mean and standard deviation, respectively. This is a numerically stable term. Through this step, the texture distribution is normalized to a uniform scale and can be controlled via parameters. Adaptively adjusts the overall texture intensity.

[0076] Statistical analysis of MSI data revealed that dark areas exhibit more pronounced texture grain, while bright areas show weaker texture. To model this negative correlation between brightness and texture, this embodiment introduces a local brightness adaptive modulation mechanism.

[0077] Specifically, the average brightness of the smooth structure map S is calculated within each local region. And modulate the standardized texture based on this brightness value:

[0078] in, is the luminance modulation coefficient, used to control the degree to which luminance suppresses texture intensity. This mechanism enhances texture in dark areas and appropriately weakens it in bright areas, thus making the generated texture distribution more consistent with the metabolic characteristics of real MSI.

[0079] Subsequently, the texture is injected into the smooth structure map S to obtain the texture fusion result:

[0080] This step generates an image that, while maintaining overall structural consistency, introduces high-frequency metabolic textures that conform to the characteristics of real MSI.

[0081] Since texture addition may change the overall brightness and intensity distribution, this invention further performs global brightness correction to make the generated result consistent with the original low-quality MSI image L in a statistical sense.

[0082] For each channel Adjust the mean to make the generated image consistent with the low-quality MSI image in terms of channel mean:

[0083] This global correction operation ensures that the final output image is more consistent with the real MSI imaging pattern in terms of overall brightness distribution, channel intensity mean, and metabolic signal statistical characteristics.

[0084] Steps S1 and S2 are designated as the first stage, and step S3 as the second stage. A phased training strategy is adopted. In the first stage of training, in order to simultaneously constrain the consistency of the prediction results in the overall intensity distribution and edge structure features, a joint loss function that integrates pixel consistency constraints and edge structure constraints is constructed.

[0085] At the pixel level, the prediction results Calculate the MSE loss using the low-resolution MSI image L:

[0086] This loss term is used to constrain the structure map generated by the mapping module to maintain consistency with the original low-resolution MSI image in terms of overall intensity distribution and low-frequency information, thereby avoiding global brightness shift or metabolic intensity distortion.

[0087] At the structural constraint level, this embodiment introduces an edge consistency loss based on the Sobel operator to enhance the ability of the prediction results to preserve tissue contours and local structural details.

[0088] The Sobel operator calculates the gradient of the image in the horizontal and vertical directions using discrete convolution kernels in two directions, respectively. The forms of the convolution kernels are as follows:

[0089] Let the input image be Then its gradient response in the horizontal and vertical directions can be expressed as follows:

[0090] Furthermore, by synthesizing the gradients in the two directions mentioned above, the edge intensity representation of the image is obtained:

[0091] In this embodiment, the prediction results are respectively... The Sobel edge detection operation described above is performed on the low-resolution MSI image L, and the mean squared error loss of the edges is calculated based on the corresponding edge intensity map to constrain the consistency of the prediction results in terms of structural contours and local details.

[0092] Combining pixel consistency constraints and edge structure constraints, the total loss function for the first stage is defined as:

[0093] By jointly optimizing the two types of loss terms, the network is guided to effectively enhance its ability to express tissue structure and edge details while maintaining the consistency of global metabolic distribution in low-resolution MSI.

[0094] To further optimize the performance of the fused image at the structural and texture levels, this invention designs pixel-level reconstruction loss and high-frequency feature consistency loss for the fusion module to constrain the rationality of the fusion result in terms of overall strength and high-frequency texture.

[0095] The first is the L1 pixel reconstruction loss, and its formula is:

[0096] in For the predicted values ​​of the fusion map, It is a low-resolution MSI image. This represents the total number of pixels.

[0097] This loss is used to constrain the consistency of the fused image with the low-resolution MSI image in terms of overall pixel values, thereby ensuring that the brightness and color distribution are roughly correct and providing a stable basis for high-frequency structure constraints.

[0098] The second is the high-frequency edge consistency loss, which extracts high-frequency features (including edge and texture information) from the fused image and the HE image respectively using the Laplacian operator, and then calculates the L1 loss based on the results. The calculation formula is as follows:

[0099] This loss term is used to guide the fused image to maintain consistency with the HE image at the high-frequency structural level, thereby enhancing the clarity of tissue boundaries and the realism of texture in the fused result.

[0100] When training the fusion module, the total loss function for the second stage is defined as:

[0101] Through the above loss design, the fusion result can effectively introduce cross-modal structural and texture information while maintaining the consistency of metabolic intensity, thereby improving the structural credibility and visual consistency of the generated MSI image.

[0102] All other parts of this invention not described herein are the same as existing technologies, or are known technologies, or can be implemented using existing technologies, and will not be described in detail here.

[0103] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and variations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A method for MSI super-resolution reconstruction based on dual-modal adaptive registration and high-fidelity fusion, characterized in that, Includes the following steps: S1. Image Registration: Input the low-resolution MSI image and the high-resolution HE image into the spatial transformation network module. The feature extraction layer of the convolutional neural network captures the deep tissue structure features, and the parameter regressor outputs the affine transformation matrix. The spatially aligned corrected image is obtained by sampling through a differentiable grid. S2, Mapping: The aligned and corrected image is input into a lightweight convolutional mapping module consisting of a head convolutional unit, a main convolutional unit, and a tail mapping unit connected in series. While maintaining the spatial resolution, the image is mapped from the pixel space to the target channel space to obtain a preliminary predicted image. S3. Multi-scale fusion: Apply Laplacian operator convolution to the preliminary prediction image to obtain a structure-enhanced image; The low-resolution MSI image and the structure-enhanced image are respectively input into the dual encoder DIP network to extract multi-scale feature sequences layer by layer. After channel-wise dimensionality reduction of the output features of each layer of the dual encoder, a spatial attention matrix is ​​constructed based on the dimensionality-reduced HE features. The dimensionality-reduced MSI features are then spatially weighted, and the weighted MSI features are concatenated with the original HE features in the channel dimension. This concatenation is then compressed to the target number of channels to obtain multi-scale fusion features. The multi-scale fusion features are then upsampled layer by layer by the decoder and skip-connected with the features of the corresponding encoder layers to gradually reconstruct a smooth structure reconstruction map. The smooth structure reconstruction map is then input into a correction network for local compensation and fine-tuning to obtain the structure correction result. By using a weakly gated color correction mechanism, the structural correction result and the low-frequency components extracted from the low-resolution MSI image through upsampling and local mean pooling are adaptively weighted and fused using the gated weights obtained after Sigmoid activation of the learnable parameters, resulting in a smooth fused image. S4. Texture Preservation: Perform local mean filtering on the reference image and extract the high-frequency residual as the initial grain texture; The initial grain texture is subjected to mean removal and normalization, and the amplitude is adjusted by the texture intensity coefficient to obtain a normalized texture. Based on the local brightness information of the smooth fusion map, the normalized texture is adaptively modulated using the brightness modulation coefficient to enhance the texture in dark areas and weaken the texture in bright areas. The modulated texture is injected into the smooth fusion map, and then global brightness correction is performed to make the output image consistent with the low-resolution MSI image at the channel mean level, and finally a high-resolution MSI super-resolution reconstructed image is obtained.

2. The MSI super-resolution reconstruction method based on dual-modal adaptive registration and high-fidelity fusion according to claim 1, characterized in that, In step S1, the spatial transformation network module includes a localization network, a parameter regressor, and a spatial sampler. The localization network captures deep tissue structure features based on the feature extraction layer of a convolutional neural network. After being compressed and flattened by an adaptive average pooling layer, the features are input to the parameter regressor, which consists of two fully connected layers. The regressor outputs affine transformation parameters and reassembles them into an affine transformation matrix. The spatial sampler constructs a sampling grid based on the affine transformation matrix and performs spatial transformation on the input image through differentiable grid sampling to obtain a corrected image.

3. The MSI super-resolution reconstruction method based on dual-modal adaptive registration and high-fidelity fusion according to claim 1, characterized in that, In step S1, the convolutional neural network extracts spatial geometric features through three sets of sequentially connected convolution-pooling-activation structures, thereby gradually compressing the features and focusing them on higher-level space.

4. The MSI super-resolution reconstruction method based on dual-modal adaptive registration and high-fidelity fusion according to claim 1, characterized in that, In step S2, in the lightweight convolutional mapping module, the head convolutional unit performs initial feature mapping on the input corrected image; the main convolutional unit consists of three consecutively stacked convolutional modules, which further model the output features; and the tail mapping unit performs channel dimension mapping and compression on the high-dimensional features output by the main unit.

5. The MSI super-resolution reconstruction method based on dual-modal adaptive registration and high-fidelity fusion according to claim 1, characterized in that, In step S3, each encoder in the dual encoder DIP network consists of a four-layer convolutional and pooling structure. Each convolutional block contains two convolution operations, and each convolutional block is followed by a max pooling operation to compress the feature map space size layer by layer. The output feature map of each layer of the encoder is collected to form a multi-scale feature sequence.

6. The MSI super-resolution reconstruction method based on dual-modal adaptive registration and high-fidelity fusion according to claim 1, characterized in that, In step S3, the spatial attention matrix is ​​constructed based on the dimensionality-reduced HE features and generated through convolution and Sigmoid activation. The spatial attention matrix is ​​used to perform element-wise multiplication and weighting on the dimensionality-reduced MSI features, so that the MSI color features in the saliency region of HE structural information are suppressed, while the MSI color information in the non-saliency region is preserved.

7. The MSI super-resolution reconstruction method based on dual-modal adaptive registration and high-fidelity fusion according to claim 1, characterized in that, In step S3, the decoder reconstructs the image layer by layer, starting from the deepest fused features, and uses bilinear interpolation for upsampling at each level. After each stitching, feature reconstruction is performed through two consecutive convolutional blocks, and the ReLU activation function is used to gradually restore the image detail information.

8. The MSI super-resolution reconstruction method based on dual-modal adaptive registration and high-fidelity fusion according to claim 1, characterized in that, In step S4, the spatial scale of the extracted texture particles is controlled by adjusting the window size of the local mean filter, where a larger window corresponds to coarser texture particles and a smaller window corresponds to finer texture features; the overall amplitude of the injected texture is controlled by the texture intensity coefficient.

9. The MSI super-resolution reconstruction method based on dual-modal adaptive registration and high-fidelity fusion according to claim 1, characterized in that, Steps S1 and S2 are designated as the first stage, and step S3 as the second stage. A phased training strategy is adopted. The first stage is trained separately, and the second stage does not participate in the model parameter update. After the pre-training is completed, the first stage and the second stage participate in joint training. Step S4 is only used in the inference stage and does not participate in the loss calculation and model parameter update during the training process.