A medical image registration method based on a multi-layer perceptron

By employing a multilayer perceptron-based medical image registration method, which utilizes multi-scale feature extraction and associated perceptron decoder, the problem of capturing fine-grained long-distance dependencies at full resolution in existing methods is solved, achieving high-precision medical image registration and improving the feasibility of clinical applications.

CN120047504BActive Publication Date: 2026-06-12TIANJIN UNIVERSITY OF TECHNOLOGY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TIANJIN UNIVERSITY OF TECHNOLOGY
Filing Date
2025-02-21
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing medical image registration methods based on convolutional neural networks struggle to maintain boundary accuracy when processing images with significant structural boundaries or discontinuous regions, and cannot capture fine-grained long-range dependencies at full resolution, thus limiting their application in clinical practice.

Method used

A medical image registration method using multilayer perceptrons is proposed. By combining multiscale feature extraction encoder and correlation perception registration decoder, multiscale feature maps and the correlation between multiple registration steps are combined. The UCA-MLP module is used to capture the multiscale dependence of local and global features, and the unstructured correlation perception module is used to capture local deformation, thus achieving coarse-to-fine registration.

🎯Benefits of technology

It improves the accuracy and robustness of medical image registration, can capture fine-grained long-range dependencies at full resolution, solves the local deformation problem of traditional methods when processing large-scale or high-resolution images, and improves registration accuracy and model stability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120047504B_ABST
    Figure CN120047504B_ABST
Patent Text Reader

Abstract

The application discloses a medical image registration method based on a multilayer perception machine, which is realized by inputting medical images into a trained deep learning-based single-mode and multi-mode medical image registration network; the registration network is composed of a multi-scale feature extraction encoder and an associated perception registration decoder, which uses the correlation between multi-scale feature maps and multiple registration steps as complementary information to provide key context information for each registration step; in order to capture the multi-scale dependence of local and global features in the registration process, an unstructured correlation perception multilayer perception machine is designed, which solves the problem that existing methods cannot capture long-distance dependencies at full resolution; at the same time, the added correlation information between multiple registration steps enhances the robustness, ensures that each registration step can use the results of the previous step, and solves the problem of local deformation in the processing of large-scale or high-resolution images in the traditional registration method.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the interdisciplinary field of deep learning and medical image registration, and particularly to a medical image registration method based on multilayer perceptron. Background Technology

[0002] Medical image registration is a key technology in the field of medical image analysis. When registering two images, one image (moving image) is mapped onto another image (fixed image), ensuring that points with the same anatomical significance correspond to each other, thus achieving information matching and fusion. In clinical diagnosis, doctors often need to compare multiple images of a patient, such as computed tomography (CT) images, magnetic resonance imaging (MRI) images, positron emission tomography (PET) images, and ultrasound images, to obtain more comprehensive information. Through medical image registration, doctors can accurately compare the location and morphological characteristics of lesions, organ structures, or functional areas in images, thereby better understanding the development and impact of diseases.

[0003] Over the past decade, Convolutional Neural Networks (CNNs) have become a focus of medical image research. MokT.C. et al. proposed a fast symmetric differential homeomorphic image registration method based on CNNs, achieving efficient medical image registration while maintaining the symmetry and differential homeomorphic features of the displacement field. Cao et al. used CNNs to map input image blocks of a pair of 3D brain MR volume data one by one to their respective displacement vectors, then added these displacement vectors to obtain the final registration displacement field. ItoM et al. used CNNs to learn reasonable deformations to generate realistic displacement fields. DeVosB.D. et al. used unsupervised CNN methods for deformable image registration, not only calculating the image displacement field but also ensuring the consistency of anatomical structures during image deformation. Deep registration methods based on CNNs have been widely used for fast end-to-end registration. These methods learn the mapping from image pairs to spatial transformations based on a set of training data and outperform traditional methods in registration performance. Golland et al. proposed DiffPose, a self-supervised framework for differentiable registration of intraoperative 2D X-ray images with preoperative CT scans, ultimately achieving sub-millimeter registration accuracy on surgical datasets. Li et al. proposed a modality-independent structural representation learning method that utilizes deep neighborhood self-similarity (DNS) and contrastive learning to learn discriminative and contrast-invariant deep structural image representations, demonstrating good performance in multimodal registration tasks. Liu Lei et al. employed convolutional neural networks for feature detection and description while introducing graph neural networks (GNNs) for feature matching; this method achieved excellent performance on registration tasks with large-scale rotations and scale variations.

[0004] Traditional image registration methods often struggle to maintain the accuracy of boundaries when processing images with significant structural boundaries or discontinuous regions. Chen et al. proposed a Depth-Preserving Discontinuity Image Registration Network (DDIR), which effectively captures and preserves details and boundary information in images by extracting image features at multiple scales and levels, focusing on boundaries and salient feature regions, and avoiding blurring and distortion of these important areas during registration. Balakrishnan G et al. proposed VoxelMorph based on the U-Net framework for deformable registration of brain MRI. VoxelMorph consists of an encoder-decoder structure, where the encoder extracts high-level features of the image, and the decoder maps these features to a displacement field. However, existing convolutional neural network-based registration methods lack an effective model to represent the spatial correspondence between moving and stationary images, limiting the widespread application of these methods in clinical practice.

[0005] In recent years, visual Transformers have effectively addressed the shortcomings of convolutional neural networks in this area. The Transformer architecture provides a larger receptive field for the model, helping to capture long-range dependency information and achieve higher accuracy. Chen et al. proposed ViT-V-Net, which integrates the Visual Transformer (ViT) into a V-Net-style convolutional network. The ViT-V-Net model consists of a hybrid architecture of convolutional layers and Transformer layers. To effectively propagate information, it uses long-span connections between the encoder and decoder. The output of the decoder is a dense displacement field to achieve deformation of anatomical structures. To improve the limitations of traditional convolutional neural networks (CNNs), such as local receptive fields and the difficulty in handling long-range dependencies, MaM et al. proposed a new symmetric Transformer network called SymTrans, which adds a multi-head self-attention mechanism to the convolutional neural network and constructs a symmetric Transformer network architecture to model long-range spatial associations across images. Yang et al. proposed GraformerDIR, which utilizes graph convolution to effectively model non-Euclidean spatial structures in images and combines the long-range dependency capture capability of Transformers to solve the trade-off between registration accuracy and computational complexity in existing methods. However, although Transformer-based medical image registration methods can capture long-range dependencies between image features, they typically only operate on low-resolution features due to the computational and memory consumption issues of the self-attention mechanism, failing to capture fine-grained long-range dependencies at full resolution. This poses a challenge for accurate nonlinear deformable registration of complex organs such as the cerebral cortex. Secondly, existing methods primarily focus on matching features between two images, without fully exploring the relationships between coarse-to-fine registration steps. Summary of the Invention

[0006] The purpose of this invention is to provide a medical image registration method based on a multilayer perceptron that solves the above-mentioned technical problems.

[0007] Therefore, the technical solution of the present invention is as follows:

[0008] A medical image registration method based on a multilayer perceptron is proposed, which achieves registration by inputting medical images into a trained deep learning-based single-modal and multi-modal medical image registration network. The deep learning-based single-modal and multi-modal medical image registration network consists of a multi-scale feature extraction encoder and an association-based perceptual registration decoder.

[0009] The multi-scale feature extraction encoder consists of a first encoding module, a second encoding module, a third encoding module, and a fourth encoding module connected in sequence; each encoding module consists of a first convolutional layer, a second convolutional layer, a Leaky ReLU activation function, an instance normalization module, and a max pooling layer connected in sequence.

[0010] The correlation sensing registration decoder consists of a first decoding module, a second decoding module, a third decoding module, and a fourth decoding module. The first decoding module consists of a UCA-MLP, an R-Head, a displacement field module, an upsampling module, and a transform module connected in sequence. The second and third decoding modules each consist of a first UCA-MLP, a second UCA-MLP, an R-Head, a summation module, a displacement field module, an upsampling module, and a transform module connected in sequence. The fourth decoding module consists of a first UCA-MLP, a second UCA-MLP, an R-Head, a summation module, and a displacement field module connected in sequence.

[0011] The outputs of the R-Head, upsampling module, and transform module of the first decoding module are connected to the inputs of the second UCA-MLP, summing module, and first UCA-MLP of the second decoding module, respectively; the outputs of the R-Head, upsampling module, and transform module of the second decoding module are connected to the inputs of the second UCA-MLP, summing module, and first UCA-MLP of the third decoding module, respectively; the outputs of the R-Head, upsampling module, and transform module of the third decoding module are connected to the inputs of the second UCA-MLP, summing module, and first UCA-MLP of the fourth decoding module, respectively.

[0012] The output of the first encoding module is connected to the transformation module of the third decoding module and the input of the first UCA-MLP of the fourth decoding module, respectively. The output of the second encoding module is connected to the transformation module of the second decoding module and the input of the first UCA-MLP of the third decoding module, respectively. The output of the third encoding module is connected to the transformation module of the first decoding module and the input of the first UCA-MLP of the second decoding module, respectively. The output of the fourth encoding module is connected to the input of the UCA-MLP of the first decoding module.

[0013] Furthermore, in the multi-scale feature extraction encoder, the kernel size of the first and second convolutional layers of each encoding module is set to 3×3×3; the parameter of the Leaky ReLU activation function is set to 0.2; and the window size of the max pooling layer is set to 2×2×2.

[0014] Furthermore, in the correlation-sensory registration decoder, each UCA-MLP consists of a correlation layer, a layer stitching module, a third convolutional layer, a layer normalization module, a first-layer perception module, a second-layer perception module, a third-layer perception module, a first summation module, a local cross-channel attention mechanism, and a second summation module. The correlation layer, layer stitching module, third convolutional layer, and layer normalization module are connected sequentially. The output of the layer normalization module is connected to the inputs of the first-layer perception module, the second-layer perception module, the third-layer perception module, and the second summation module, respectively. The outputs of the first-layer perception module, the second-layer perception module, and the third-layer perception module are connected to the input of the first summation module, the output of the first summation module is connected to the input of the local cross-channel attention mechanism, and the output of the local cross-channel attention mechanism is connected to the input of the second summation module.

[0015] Furthermore, in each UCA-MLP, the input of the layer stitching module is connected to the output of the relevant layer and the output of the module connected to the input of the relevant layer, respectively, to obtain a stitched image formed by sequentially stitching together a fixed image, a moving image, and a relevant image obtained by processing the fixed image and the moving image; the convolution kernel of the third convolutional layer is set to 2×2×2, and its output is also connected to the input of the second summing module; the first layer perception module consists of a 3×3 sliding window, a gMLP module, and a Region Merge module connected in sequence; the second layer perception module consists of a 5×5 sliding window, a gMLP module, and a Region Merge module connected in sequence; the third layer perception module consists of a 7×7 sliding window, a gMLP module, and a Region Merge module connected in sequence; in the local cross-channel attention mechanism, K is set to 3.

[0016] Furthermore, the specific implementation steps of this multilayer perceptron-based medical image registration method are as follows:

[0017] Step 1: Obtain a set of medical images for training, which consists of several medical images with the same organ or the same structure;

[0018] Step 2: Preprocess all medical images obtained in Step 1 to have the same image specifications, and divide them into training set, test set and validation set. The medical images in each image set are composed of image pairs. In each image pair, one image is used as a moving image and the other image is used as a fixed image.

[0019] Step 3: Construct a deep learning-based single-modal and multi-modal medical image registration network;

[0020] Step 4: Use the medical images obtained from Step 2 to train the deep learning-based single-modal and multi-modal medical image registration network constructed in Step 3;

[0021] Step 5: Input the medical image to be registered into the deep learning-based single-modal and multi-modal medical image registration network trained in Step 4, and output the registration result image.

[0022] Furthermore, in step 1, the medical images are CT images and / or MRI images.

[0023] Furthermore, in step 2, the preprocessing steps for the medical image are as follows: the medical image is sequentially subjected to affine transformation, resampling, and cropping to preprocess it into an image with the same specifications as the MNI-152 meningeal plate.

[0024] Furthermore, in step 4, the loss function is defined as the unsupervised loss function L, whose expression is:

[0025] ,

[0026] In the formula, and As a regulating factor,

[0027] The similarity penalty loss is expressed as follows:

[0028] ,

[0029] In the formula, For fixed images, For moving images, express Belongs to the image domain , For fixed image exist Surrounding Neighboring points within a neighborhood region Strength at that location, For fixed image In The average intensity within the local area centered on the value. For moving images in Surrounding Neighboring points within a neighborhood region Strength at that location, For moving images in The average intensity within the local area centered on it;

[0030] The regularization loss is expressed as follows:

[0031] ,

[0032] In the formula, This represents the spatial gradient operator.

[0033] Furthermore, in the unsupervised loss function L in step S4, Set to 0.8, Set it to 0.2.

[0034] Compared to existing technologies, this multilayer perceptron-based medical image registration method utilizes multi-scale feature maps and the correlations between multiple registration steps as complementary information, providing key contextual information for each registration step. Secondly, to capture the multi-scale dependencies of local and global features during registration, an unstructured correlation-aware multilayer perceptron is designed to improve feature matching accuracy without losing important details, addressing the problem that existing methods cannot capture fine-grained long-distance dependencies at full resolution. Simultaneously, the added relevant information between multi-step registration further enhances the model's robustness, ensuring that each registration step can utilize the results of the previous step. Finally, to address the issue of local deformation in traditional registration methods when processing large-scale or high-resolution images, UCA-MLP introduces an unstructured correlation-aware module that effectively captures local deformations and multi-scale dependencies, and uses a specific partitioning strategy to acquire local information at different window sizes. Attached Figure Description

[0035] Figure 1 This is a flowchart of the medical image registration method based on a multilayer perceptron according to the present invention;

[0036] Figure 2 This is an overall framework diagram of the unstructured correlation sensing network based on a multilayer perceptron according to the present invention;

[0037] Figure 3 This is a schematic diagram of the structure of the unstructured correlation sensing multilayer sensor module of the present invention;

[0038] Figure 4 This is a comparison chart of registration results of different methods on the SR-Reg dataset in the embodiments of the present invention;

[0039] Figure 5 This is a comparison chart of registration error results for different methods on six brain datasets in an embodiment of the present invention. Detailed Implementation

[0040] The present invention will be further described below with reference to the accompanying drawings and specific embodiments, but the following embodiments are by no means intended to limit the present invention.

[0041] See Figure 1 The specific implementation steps of this multilayer perceptron-based medical image registration method are described below.

[0042] Step 1: Obtain a set of medical images for training, which consists of several medical images, specifically CT and / or MRI images of the same organ or structure.

[0043] In practical applications, the medical images in step 1 can be acquired independently or existing medical image datasets can be directly applied. In this embodiment, the medical image set uses existing datasets, specifically including the SR-Reg dataset, ADNI dataset, ABIDE dataset, ADHD dataset, IXI dataset, Mindboggle dataset, and Buckner dataset; among them, the SR-Reg dataset is a dataset containing 3D MRI and CT images of the brain, while the other datasets are datasets containing only 3D MRI images of the brain.

[0044] Based on this, two image registration tasks are designed in this embodiment: one task is a multimodal medical image registration task of 3D brain MRI images and CT images of the same patient, and the other task is a single-modal medical image registration task of 3D brain MRI images and MRI images of the same patient.

[0045] Step 2: Preprocess all medical images obtained in Step 1 to have the same image specifications, and divide them into training set, test set and validation set;

[0046] The specific implementation steps for step 2 are as follows:

[0047] Step 2.1: Perform affine transformation, resampling, and cropping on the medical image in sequence to preprocess it into an image with the same specifications as the MNI-152 meningeal plate.

[0048] In this embodiment, the SR-Reg dataset is used for multimodal medical image registration. The original size of the medical image selected from the SR-Reg dataset is 192×208×176 voxels with a resolution of 1×1×1 mm³. Based on this, the voxels are first resampled to 128×128×128 through affine transformation and resampling to align with the MNI-152 brain template, which has isotropic voxels with 1 mm³. Then, it is cropped to a size of 144×192×160 to keep the size consistent with the MNI-152 brain template.

[0049] The ADNI, ABIDE, ADHD, IXI, Mindboggle, and Buckner datasets were used for the unimodal medical image registration task. Medical images selected from these six datasets were aligned to the MNI-152 brain template by affine transformation and resampling as described above, with 1 mm³ isotropic voxels, and then cropped to the final size of 144 × 192 × 160.

[0050] Step 2.2: Divide the preprocessed images into training set, validation set, and test set;

[0051] In this embodiment, for the multimodal medical image registration task, 180 cases are randomly selected, of which 150 cases are used for training, 10 cases for validation, and the remaining 20 cases are used for testing; specifically, each case consists of CT and MRI images of the same patient's 3D brain; for the unimodal medical image registration task, 2,656 brain MRI images are randomly selected from the ADNI, ABIDE, ADHD, and IXI datasets as the training set, and 100 pairs of images are randomly selected from the Mindboggle and Buckner datasets to form 200 test image pairs as the test set;

[0052] Step 2.3: Pair the preprocessed images in the two training sets mentioned above. In each pair of preprocessed images, one image is used as the moving image. Another image serves as a fixed image. In this embodiment, the CT image in the multimodal medical image registration task is used as the moving image, and the MRI image is used as the fixed image.

[0053] Step 3: Construct a deep learning-based single-modal and multi-modal medical image registration network, UCANet, including a multi-scale feature extraction encoder and an unstructured coarse-to-fine association-aware registration decoder.

[0054] See Figure 2 and Figure 3 The specific implementation steps of step 3 are as follows:

[0055] Step 3.1: Construct a multi-scale feature extraction encoder;

[0056] Specifically, the multi-scale feature extraction encoder consists of a first encoding module S1, a second encoding module S2, a third encoding module S3, and a fourth encoding module S4 connected in sequence. Each encoding module consists of a first convolutional layer, a second convolutional layer, a Leaky ReLU activation function, an instance normalization module, and a max pooling layer connected in sequence. In this embodiment, the kernel size of the first and second convolutional layers of each encoding module is 3×3×3, and the stride is 1. The parameter of the Leaky ReLU activation function is set to 0.2. The window size of the max pooling layer is set to 2×2×2.

[0057] The processing steps of this multi-scale feature extraction encoder are as follows: Paired moving images... Moving Image and Fixed Image The (Fixed Image) is input into a multi-scale feature extraction encoder and passes through four encoding modules to sequentially process the moving image. and fixed image Generate four multi-scale features; such as Figure 2 As shown, F m 1 and F f 1 F represents the first scale features of the moving image and the stationary image generated by the first encoding module, respectively. m 2 and F f 2 F represents the second-scale features of the moving and stationary images generated by the second encoding module, respectively. f 3 and F m 3 F represents the third-scale features of the moving and stationary images generated by the third encoding module, respectively. m 4 and F f 4 These are the fourth-scale features of the moving and stationary images generated by the fourth encoding module, respectively.

[0058] Step 3.2: Construct an association-aware registration decoder:

[0059] This correlation-sensory registration decoder employs UCA-MLP (Multi-Layer Sensing Mechanism) for unstructured correlation sensing, thereby achieving coarse-to-fine registration. Specifically, the correlation-sensory registration decoder consists of a first decoding module, a second decoding module, a third decoding module, and a fourth decoding module.

[0060] The first decoding module consists of a UCA-MLP, an R-Head, a displacement field module, an upsampling module, and a transformation module connected in sequence. The second and third decoding modules each consist of a first UCA-MLP, a second UCA-MLP, an R-Head, a summation module, a displacement field module, an upsampling module, and a transformation module connected in sequence. The fourth decoding module consists of a first UCA-MLP, a second UCA-MLP, an R-Head, a summation module, and a displacement field module connected in sequence. The R-Head output of the first decoding module is also connected to the second UCA-MLP input of the second decoding module, and the output of its upsampling module is also connected to the summation module of the second decoding module. The input terminal is connected, and the output terminal of its transformation module is connected to the first UCA-MLP input terminal of the second decoding module; the R-Head output terminal of the second decoding module is also connected to the second UCA-MLP input terminal of the third decoding module, the output terminal of its upsampling module is also connected to the input terminal of the summation module of the third decoding module, and the output terminal of its transformation module is connected to the first UCA-MLP input terminal of the third decoding module; the R-Head output terminal of the third decoding module is also connected to the second UCA-MLP input terminal of the fourth decoding module, the output terminal of its upsampling module is also connected to the input terminal of the summation module of the fourth decoding module, and the output terminal of its transformation module is connected to the first UCA-MLP input terminal of the fourth decoding module.

[0061] In each decoding module, each UCA-MLP has the same structural composition. Specifically, each UCA-MLP consists of a Correlation Layer, a layer concatenation module, a third convolutional layer, a Layer Normalization module, a first-layer perception module, a second-layer perception module, a third-layer perception module, a first summation module, a local cross-channel attention mechanism, and a second summation module. The Correlation Layer, layer concatenation module, third convolutional layer, and layer normalization module are connected sequentially. The output of the layer normalization module is connected to the inputs of the first-layer perception module, the second-layer perception module, the third-layer perception module, and the second summation module, respectively. The outputs of the first-layer perception module, the second-layer perception module, and the third-layer perception module are connected to the input of the first summation module, and the first summation module... The output is connected to the input of the local cross-channel attention mechanism, and the output of the local cross-channel attention mechanism is connected to the input of the second summing module. The input of the layer stitching module is connected to the output of the relevant layer, and also to the same output, to obtain a stitched image formed by sequentially stitching together a fixed image, a moving image, and related images obtained from processing the fixed and moving images. The third convolutional layer has a 2×2×2 kernel with a stride of 1, and its output is also connected to the input of the second summing module. The first layer perception module consists of a 3×3 sliding window, a gMLP module, and a Region Merge module connected in sequence. The second layer perception module consists of a 5×5 sliding window, a gMLP module, and a Region Merge module connected in sequence. The third layer perception module consists of a 7×7 sliding window, a gMLP module, and a Region Merge module connected in sequence. In the local cross-channel attention mechanism, K is set to 3.

[0062] In each decoding module, the displacement field module contains a displacement field function to parameterize the deformable registration problem as a displacement field; the formula for the displacement field function is as follows: ( , ,in, For displacement field, These are learnable parameters, which are learned through unsupervised learning during network training.

[0063] See Figure 3 Each UCA-MLP is paired with a fixed input image F A and moving image F B The specific processing steps are as follows: First, F A and F B The input is fed into the correlation layer, which calculates the local correlation between the two feature maps in the convolutional feature space, emphasizing local details in the deep feature representation. The formula for calculating the correlation between the two is:

[0064] ,

[0065] In the formula, P i A and P i B Representing from F A and F B The central voxel of the sampled data; i and j represent the 2D center coordinates of the central voxel, respectively; ∈[−k,k] 3 express In P respectively i A and P i B Centered on [-k, k] [-k,k] Iteration within the three-dimensional neighborhood of [-k,k]; in this embodiment, k is set to 1;

[0066] Furthermore, sampling a set of points within a 3D region of size (d×d×d) can be achieved through 3D convolution; in this embodiment, d=3 is set to calculate the local correlation within the 3D neighborhood, thereby outputting the feature association feature map F. C Its shape and feature diagram F A and F B Same, number of channels d³=27;

[0067] Subsequently, F A F B and F C The data is concatenated along channels and then processed through a third convolutional layer (3×3×3) to generate a relevant perceptual feature map F. corr Then, F corr After LayerNorm processing, the input is a multilayer perceptron with different window sizes to capture multi-scale dependencies. Specifically, in this embodiment, sliding windows of sizes 3×3, 5×5, and 7×7 are used to divide the feature map into non-overlapping regions, denoted as F. RS This helps capture fine-grained local information and enhances the model's sensitivity to features at different scales, enabling local interactive operations to be performed on feature maps obtained through region partitioning; the gMLP module is used to process the output of unstructured layers, and the Region Merge module is used for region integration, thereby achieving the fusion of features from different regions through channel weighted summation;

[0068] Next, the feature fusion results from different regions interact using a local cross-channel attention mechanism to establish dependencies between different channels, thereby extracting more representative feature representations. Specifically, performing a one-dimensional convolution of size K along the channel dimension creates dependencies between different channels. Subsequently, the Sigmoid activation function is applied to generate the weights for local channel attention, the expression of which is:

[0069] ,

[0070] In the formula, C1 This represents a one-dimensional convolution with a kernel size of K. This represents the feature map obtained after region merging, and the final output. This is achieved by element-wise multiplying the attention weights with the original input feature map, resulting in an attention-weighted feature map with updated weights. As the number of channels increases, the range of local cross-channel interactions should also expand accordingly. The non-linear expression for this relationship is:

[0071] ,

[0072] In the formula, K represents the kernel size of the one-dimensional convolution. b are the scaling and offset parameters for the linear transformation;

[0073] Given the total number of channels C, the kernel size K can be adaptively determined as follows:

[0074] ,

[0075] In this embodiment, the definition is as follows: To round to the nearest odd number, and Set b to 2 and b to 1; then calculate K=3;

[0076] Furthermore, through the various UCA-MLP modules in the decoding module, the potential correspondences between them are explored through region partitioning and unstructured deformation layers, in order to specifically capture the dependencies related to the registration of deformable medical images.

[0077] Step 3.3: Construct a deep learning-based network model UCANet for single-modal and multi-modal medical image registration. This model consists of a multi-scale feature extraction encoder constructed in Step 3.1 and an association-aware registration decoder constructed in Step 3.2. The output of the first encoding module is connected to the transform module of the third decoding module and the input of the first UCA-MLP of the fourth decoding module, respectively, to process the first-scale features F of the moving image. m 1 The transformation module is input to the third decoding module, while the first scale feature F of the image is fixed. f1 The first UCA-MLP of the fourth decoding module is input; the output of the second encoding module is connected to the transform module of the second decoding module and the input of the first UCA-MLP of the third decoding module, respectively, to process the second scale feature F of the moving image. m 2 The transformation module is input to the second decoding module, while the second scale feature F of the image is fixed. f 2 The first UCA-MLP of the third decoding module is input; the output of the third encoding module is connected to the transform module of the first decoding module and the input of the first UCA-MLP of the second decoding module, respectively, to process the third scale feature F of the moving image. m 3 The transformation module is input to the first decoding module, while the third-scale feature F of the image is fixed. f 3 The first UCA-MLP of the second decoding module is input; the output of the fourth encoding module is connected to the input of the UCA-MLP of the first decoding module to input the fourth scale feature F of the moving image. m 4 and the fourth scale feature F of the fixed image f 4 Input the first UCA-MLP of the first decoding module.

[0078] In practical applications, the specific processing steps of the deep learning-based single-modal and multi-modal medical image registration network model UCANet are as follows:

[0079] The F output by the fourth encoding module m 4 and F f 4 The inputs are fed into the UCA-MLP of the first decoding module, which explores their spatial correspondences to obtain relational features, which are then input into the deformable registration head (R-Head). The output results are then input into the displacement field module of the first decoding module to map the displacement field ψ. 4 The displacement field ψ 4 F after upsampling and the output of the third encoding module m 3 Input transformation module, so that F m 3 With displacement field ψ 4 Formal representation: F m 3 ◦ψ 4 This is for use in guiding the second decoding module.

[0080] Then, the F output by the third encoding module f 3and F output by the first decoding module m 3 ◦ψ 4 The input is fed into the first UCA-MLP of the second decoding module, where the obtained relational features are then input together with the R-Head output of the first decoding module into the second UCA-MLP of the second decoding module, and subsequently into the R-Head of the second decoding module. The output of this second UCA-MLP is then combined with the upsampled displacement field ψ from the first decoding module. 4 The input summation module performs summation processing, and then the input is fed into the displacement field module of the second decoding module to map out the displacement field ψ. 3 The displacement field ψ 3 F after upsampling and the output of the second encoding module m 2 The input transformation module outputs F m 2 ◦ψ 3 Continue to be used to guide the third decoding module;

[0081] Next, the F output by the second encoding module f 2 and F output by the second decoding module m 2 ◦ψ 3 The input is fed into the first UCA-MLP of the third decoding module, where the obtained relational features are then combined with the R-Head output of the second decoding module and fed into the second UCA-MLP of the third decoding module. The output of this second UCA-MLP is then fed into the R-Head of the third decoding module, where the output is combined with the upsampled displacement field ψ from the second decoding module. 3 The input is summed by the summation module and then fed into the displacement field module of the third decoding module to map the displacement field ψ. 2 The displacement field ψ 2 After upsampling and the output of the first encoding module, F m 1 The input transformation module outputs F m 1 ◦ψ 2 Continue to be used to guide the fourth decoding module;

[0082] Finally, the F output by the first encoding module f 1 and F output by the third decoding module m 1 ◦ψ 2The input is fed into the first UCA-MLP of the fourth decoding module, where the obtained relational features are then combined with the R-Head output of the third decoding module and fed into the second UCA-MLP of the fourth decoding module. This second UCA-MLP is then fed into the R-Head of the fourth decoding module, where its output is combined with the upsampled displacement field ψ from the third decoding module. 2 The input is summed by the summation module and then fed into the displacement field module of the fourth decoding module to map the displacement field ψ. 1 The displacement field ψ 1 make I m with I f Alignment, the output of which is the final registration result.

[0083] Step 4: Use the medical images obtained from Step 2 to train the deep learning-based single-modal and multi-modal medical image registration network constructed in Step 3.

[0084] The specific implementation steps for step 4 are as follows:

[0085] Step 4.1: Define the loss function as an unsupervised loss function L, which consists of a similarity penalty loss. and regularization loss It consists of two parts; among which,

[0086] Similarity penalty loss Used to measure deformed images With fixed image The similarity between them is expressed as:

[0087] ,

[0088] In the formula, For fixed images, For moving images, express Belongs to the image domain , For fixed image exist Surrounding Neighboring points within a neighborhood region Strength at that location, For fixed image In The average intensity within the local area centered on the value. For moving images in Surrounding Neighboring points within a neighborhood region Strength at that location, For moving images in The average intensity within the local area centered on it;

[0089] Regularization loss The expression for applying regularization to the deformable registration transformation ψ to promote smooth and realistic transformations in physical space is:

[0090] ,

[0091] In the formula, Represents the spatial gradient operator;

[0092] Furthermore, the unsupervised loss function L adopts and The weighted sum, its expression is:

[0093] ,

[0094] In the formula, and This is an adjustment factor used to balance registration accuracy and smoothness; in this embodiment, =0.8 and =0.2;

[0095] Step 4.2: During the training process, the medical image registration network model UCANet is trained and optimized using an unsupervised loss function L until the unsupervised loss function L reaches its minimum value and remains stable.

[0096] Step 5: Transfer the two medical images to be registered (one as the moving image). Another image is used as a fixed image. The images are then input into the deep learning-based single-modal and multi-modal medical image registration network UCANet, which has been trained in step 4, and the registration results of the two images are output.

[0097] To further demonstrate the accuracy and usability of the multilayer perceptron-based medical image registration method of the present invention in terms of registration effect, the Dice similarity coefficient (DSC) and normalized Jacobian determinant (NJD) were used as evaluation indicators.

[0098] DSC is used to compare the similarity between the automatically registered segmented regions and the real segmented regions, while NJD measures the reversibility and smoothness of local image deformation by calculating the Jacobian determinant of the displacement field and evaluates whether the registered displacement field maintains the topological structure; 95% Hausdorff distance (HD95) is used to evaluate the accuracy of registration, and the differential properties of the displacement field are evaluated by calculating the percentage of non-positive Jacobian determinants (P|J|<=0); Mem (GB) is used to represent the amount of memory consumed during training and testing, Par (M) is used to measure the number of model parameters, and time represents the time consumption of the model execution process.

[0099] The performance comparison results of different registration methods on the SR-Reg dataset are shown in Table 1 below.

[0100] Table 1:

[0101]

[0102] As can be seen from the performance comparison in Table 1, the proposed UCANet outperforms other methods in Dice, HD95, and P metrics, achieving the best performance in all three metrics. Specifically, UCANet achieves the highest Dice score of 83.63%, indicating superior overlap between its predictions and the actual segmentation results. Furthermore, UCANet exhibits the lowest HD95 value (1.33), reflecting its greater accuracy in boundary prediction compared to VoxelMorph, TransMorph, and VMambaMorph.

[0103] Furthermore, although UCANet has slightly higher memory usage and model parameters (23.72 million), its significant performance improvement justifies the additional computational cost. UCANet's runtime (0.49 seconds) is longer than other methods, but this trade-off is reasonable considering the significant improvements in accuracy and precision.

[0104] like Figure 5 The diagram shows a visual comparison of the registration results of this application and the other two methods in Table 1 on the SR-Reg dataset; in practical use, the three methods use CT images as fixed images and MRI images as moving images in the SR-Reg dataset; from Figure 5 The visual comparison results show that, in the specially marked red and green areas, UCANet is significantly closer to the real situation, while other methods suffer from over-registration and under-registration.

[0105] The performance comparison results of different methods on six brain datasets are shown in Table 2 below.

[0106] Table 2:

[0107]

[0108] As can be seen from the performance comparison in Table 2, on the Mindboggle dataset, UCANet achieves a DSC of 0.645%, the highest among all methods, indicating high registration accuracy. Its NJD percentage is 1.915%, showing minimal topological distortion in the deformable field. On the Buckner dataset, UCANet continues to outperform other methods, with a DSC of 0.659% and an NJD of 1.825%. On GPUs, its average registration run time is 0.51 seconds, close to LapIRN (0.52 seconds) and Dual-PRNet++ (0.49 seconds), but slightly slower than TransMorph (0.32 seconds). This slight compromise in runtime is reasonable given its significant improvements in registration accuracy and smoothness, as evidenced by the higher DSC and lower NJD. In conclusion, the above comparison results clearly demonstrate the effectiveness of the proposed UCANet in addressing the challenges of deformable medical image registration.

[0109] Figure 6 shows a comparison of the registration error results of different methods on six brain datasets, demonstrating the registration error results of different methods in the single-modal brain image registration experiment. The more black areas there are, the greater the error. Obviously, the UCANet proposed in this invention significantly reduces the error and shows excellent accuracy and stability.

[0110] Although the present invention has been described above with reference to embodiments, various modifications can be made and components can be replaced with equivalents without departing from the scope of the invention. In particular, as long as there is no structural conflict, the features in the disclosed embodiments can be combined with each other in any manner. The lack of an exhaustive description of these combinations in this specification is merely for the sake of brevity and resource conservation. Therefore, the present invention is not limited to the specific embodiments disclosed herein, but includes all technical solutions falling within the scope of the claims.

[0111] References:

[0112] [1]Balakrishnan, G.; Zhao, A.; Sabuncu, MR; Guttag, J.; Dalca, AVVoxelmorph: a learning framework for deformable medical image registration. IEEE transactions on medical imaging 2019, 38, 1788–1800.

[0113] [2]Chen, J.; Frey, E.C.; He, Y.; Segars, W.P.; Li, Y.; Du, Y.Transmorph: Transformer for unsupervised medical image registration. Medicalimage analysis 2022, 82, 102615.

[0114] [3]Guo, T.; Wang, Y.; Meng, C. Mambamorph: a mamba-based backbonewith contrastive feature learning for deformable mr-ct registration. arXivpreprint arXiv:2401.13934 2024.

[0115] [4]Wang, Z.; Zheng, J.Q.; Ma, C.; Guo, T. Vmambamorph: a visualmamba-based framework with cross-scan module for deformable 3d imageregistration. arXiv preprint arXiv:2404.05105 2024.

[0116] [5]Zhu, Y.; Lu, S. Swin-voxelmorph: A symmetric unsupervised learningmodel for deformable medical image registration using swin transformer. InProceedings of the International Conference on Medical Image Computing andComputer-Assisted Intervention. Springer, 2022, pp. 78–87.

[0117] [6]Mok, T.C.; Chung, A.C. Large deformation diffeomorphic imageregistration with Laplacian pyramid networks. In Proceedings of the MedicalImage Computing and Computer Assisted Intervention–MICCAI 2020: 23rdInternational Conference, Lima, Peru, October 4–8, 2020, Proceedings, PartIII 23. Springer, 2020, pp. 211–221

[0118] [7]Shu, Y.; Wang, H.; Xiao, B.; Bi, X.; Li, W. Medical imageregistration based on uncoupled learning and accumulative enhancement. InProceedings of the International Conference on Medical Image Computing andComputer-Assisted Intervention. Springer, 2021, pp. 3–13

[0119] [8]Kang, M.; Hu, X.; Huang, W.; Scott, M.R.; Reyes, M. Dual-streampyramid registration network. Medical image analysis 2022, 78, 102379.

[0120] [9]Zhou, S.; Hu, B.; Xiong, Z.; Wu, F. Self-distilled hierarchicalnetwork for unsupervised deformable image registration. IEEE Transactions onMedical Imaging 2023, 42, 2162–2175.

[0121]

[10] Meng, M.; Bi, L.; Feng, D.; Kim, J. Non-iterative coarse-to-fineregistration based on single-pass deep cumulative learning. In Proceedings ofthe International Conference on Medical Image Computing and Computer-AssistedIntervention. Springer, 2022, pp. 88–97

[0122]

[11] Meng, M.; Bi, L.; Fulham, M.; Feng, D.; Kim, J. Non-iterativecoarse-to-fine transformer networks for joint affine and deformable imageregistration. In Proceedings of the International Conference on Medical ImageComputing and Computer-Assisted Intervention. Springer, 2023, pp. 750–760.

Claims

1. A medical image registration method based on a multilayer perceptron, characterized in that, This is achieved through a deep learning-based single-modal and multi-modal medical image registration network; The encoder consists of a first encoding module, a second encoding module, a third encoding module, and a fourth encoding module connected in sequence; each encoding module consists of a first convolutional layer, a second convolutional layer, a Leaky ReLU activation function, an instance normalization module, and a max pooling layer connected in sequence. The decoder consists of a first decoding module, a second decoding module, a third decoding module, and a fourth decoding module connected sequentially. The first decoding module consists of a UCA-MLP, an R-Head, a displacement field module, an upsampling module, and a transform module connected sequentially. The second and third decoding modules each consist of a first UCA-MLP, a second UCA-MLP, an R-Head, a summation module, a displacement field module, an upsampling module, and a transform module connected sequentially. The fourth decoding module consists of a first UCA-MLP, a second UCA-MLP, an R-Head, a summation module, and a displacement field module connected sequentially. The output terminals of the R-Head and upsampling modules of the first decoding module are also connected to the input terminals of the second UCA-MLP and the summation module of the second decoding module, respectively. The output terminals of the R-Head and upsampling modules of the second decoding module are also connected to the input terminals of the second UCA-MLP and the summation module of the third decoding module, respectively. The output terminals of the R-Head and upsampling modules of the third decoding module are also connected to the input terminals of the second UCA-MLP and the summation module of the fourth decoding module, respectively. The output of the first encoding module is connected to the transformation module of the third decoding module and the first UCA-MLP input of the fourth decoding module, respectively. The output of the second encoding module is connected to the transformation module of the second decoding module and the first UCA-MLP input of the third decoding module, respectively. The output of the third encoding module is connected to the transformation module of the first decoding module and the first UCA-MLP input of the second decoding module, respectively. The output of the fourth encoding module is connected to the UCA-MLP input of the first decoding module. Each UCA-MLP consists of a correlation layer, a layer splicing module, a third convolutional layer, a layer normalization module, a first-layer perception module, a second-layer perception module, a third-layer perception module, a first summation module, a local cross-channel attention mechanism, and a second summation module. The correlation layer, layer splicing module, third convolutional layer, and layer normalization module are connected sequentially. The output of the layer normalization module is connected to the inputs of the first, second, third, and second layer perception modules, respectively. The outputs of the first, second, and third layer perception modules are connected to the input of the first summation module, and the output of the first summation module is connected to... The input of the local cross-channel attention mechanism is connected, and the output of the local cross-channel attention mechanism is connected to the input of the second summing module. In each UCA-MLP, the input of the layer stitching module is connected to the output of the relevant layer and the output of the module connected to the input of the relevant layer, respectively, to obtain a stitched image formed by sequentially stitching together a fixed image, a moving image, and a relevant image obtained by processing the fixed image and the moving image. The convolution kernel of the third convolutional layer is set to 2×2×2, and its output is also connected to the input of the second summing module. The first layer perception module consists of a 3×3 sliding window, a gMLP module, and a Region Merge module connected in sequence. The second layer perception module consists of a 5×5 sliding window, a gMLP module, and a Region Merge module connected in sequence. The third layer perception module consists of a 7×7 sliding window, a gMLP module, and a Region Merge module connected in sequence. In the local cross-channel attention mechanism, K is set to 3.

2. The medical image registration method based on a multilayer perceptron according to claim 1, characterized in that, In the multi-scale feature extraction encoder, the kernel size of the first and second convolutional layers of each encoding module is set to 3×3×3; the parameter of the Leaky ReLU activation function is set to 0.2; and the window size of the max pooling layer is set to 2×2×2.

3. The medical image registration method based on multilayer perceptron according to claim 1, characterized in that, The steps are as follows: Step 1: Obtain a set of medical images for training, which consists of several medical images with the same organ or the same structure; Step 2: Preprocess all medical images obtained in Step 1 to have the same image specifications, and divide them into training set, test set and validation set. The medical images in each image set are composed of image pairs. In each image pair, one image is used as a moving image and the other image is used as a fixed image. Step 3: Construct a deep learning-based single-modal and multi-modal medical image registration network; Step 4: Use the medical images obtained from Step 2 to train the deep learning-based single-modal and multi-modal medical image registration network constructed in Step 3; Step 5: Input the medical image to be registered into the deep learning-based single-modal and multi-modal medical image registration network trained in Step 4, and output the registration result image.

4. The medical image registration method based on a multilayer perceptron according to claim 3, characterized in that, In step 1, the medical images are CT images and / or MRI images.

5. The medical image registration method based on a multilayer perceptron according to claim 3, characterized in that, In step 2, the preprocessing steps for the medical image are as follows: the medical image is sequentially subjected to affine transformation, resampling, and cropping to preprocess it into an image with the same specifications as the MNI-152 meningeal plate.

6. The medical image registration method based on a multilayer perceptron according to claim 3, characterized in that, In step 4, the loss function is defined as the unsupervised loss function L, and its expression is: , In the formula, and As a regulating factor, The similarity penalty loss is expressed as follows: , In the formula, For fixed images, For moving images, express Belongs to the image domain , For fixed image exist Surrounding Neighboring points within a neighborhood region Strength at that location, For fixed image In The average intensity within the local area centered on the value. For moving images in Surrounding Neighboring points within a neighborhood region Strength at that location, For moving images in The average intensity within the local area centered on it; The regularization loss is expressed as follows: , In the formula, This represents the spatial gradient operator.

7. The medical image registration method based on a multilayer perceptron according to claim 6, characterized in that, In the unsupervised loss function L in step S4, Set to 0.8, Set it to 0.2.