A deep learning image block splicing method based on feature extraction and alignment fusion

By using an improved VGG alignment network and a U-Net fusion network, automatic, fast, and accurate stitching of pathological image blocks was achieved, solving the problems of long stitching time and poor quality in existing technologies, reducing reliance on hardware and manual annotation, and improving stitching efficiency and quality.

CN122243731APending Publication Date: 2026-06-19西安应用光学研究所

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
西安应用光学研究所
Filing Date
2026-02-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies for pathological image stitching suffer from problems such as long processing time, poor stitching quality, and high dependence on high-precision hardware and manual annotation, making it difficult to achieve automatic, fast, and accurate stitching of multiple pathological image blocks.

Method used

A deep learning-based image patch stitching method based on feature extraction and alignment fusion is adopted. By predicting the geometric transformation parameters of image patches through an improved VGG alignment network and STN network, and combining a channel-weighted fusion module and U-Net fusion network, accurate image alignment and seamless stitching are achieved.

Benefits of technology

It achieves high-quality, visually seamless panoramic pathological image stitching, reduces the reliance on high-precision hardware and manual annotation, and improves stitching efficiency and model generalization ability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243731A_ABST
    Figure CN122243731A_ABST
Patent Text Reader

Abstract

This invention discloses a deep learning-based image patch stitching method based on feature extraction and alignment fusion. The method first acquires pathological images and divides them into discrete image patches, generating training data pairs through random affine transformation. Then, the transformed image patches are input into an improved VGG alignment network, which utilizes introduced branch design to fuse multi-scale features to predict geometric transformation parameters. These parameters are then executed by a spatial transformation network to achieve accurate geometric correction of the image patches. The corrected image patches are then weighted and preliminarily fused using a weighted fusion module. Finally, the weighted feature maps are input into a U-Net fusion network, where feature extraction and image reconstruction are achieved through its encoder-decoder structure and skip connections, outputting a seamless panoramic pathological image. This invention achieves end-to-end automatic stitching, effectively improving the accuracy, efficiency, and robustness of pathological image stitching while reducing reliance on high-precision hardware and manual annotation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of pathological image processing technology, specifically to a deep learning-based image patch stitching method based on feature extraction and alignment fusion. Background Technology

[0002] In the process of digitizing pathological slides, whole-slide scanners play a crucial role. Typically, to ensure that digital slide images provide diagnostic information comparable to traditional microscopy, the scanner must meticulously record the tissue contents under the microscope, encompassing key information such as the specimen's outline, microstructure, and cell morphology and distribution. However, due to the limited field of view of microscope objectives, it is impossible to simultaneously acquire high-magnification and wide-field-of-view images. Therefore, it is necessary to move the stage to acquire multiple local image patches, and then synthesize a complete panoramic image using image stitching techniques.

[0003] Traditional image stitching methods require multiple steps, including feature matching, homography matrix calculation, image transformation, image registration, and fusion, to complete the stitching operation. These steps are time-consuming, and when processing pathological images with complex textures and high background repetition, these methods face difficulties in feature extraction and have low matching accuracy, resulting in poor stitching quality. If these steps are to be skipped and stitching is to be performed directly, a high-precision motor platform is required to ensure the coordinate stability of the sequence of image blocks during acquisition, which is costly.

[0004] With the development of computer technology, deep learning-based image stitching methods have gradually come into view, showing excellent performance in terms of stitching speed, stitching quality, and robustness. However, pathological images have more complex texture features than natural images, with high background repetition and greater difficulty in feature extraction. Current technologies mainly focus on stitching two image pairs, making them difficult to apply directly to scenarios where multiple image patches are stitched into a panoramic image at once. In addition, the reliance on feature point matching leads to existing methods heavily depending on a large number of manually annotated image pairs and transformation matrices for supervised learning, resulting in high costs and limited generalization ability.

[0005] There is an urgent need in this field for a method that can automatically, quickly, and accurately stitch together multiple pathological image blocks, while reducing the reliance on high-precision hardware and manual annotation. Summary of the Invention

[0006] To address the aforementioned issues, this invention proposes a deep learning-based image patch stitching method based on feature extraction and alignment fusion. This method introduces a branch-fusion-improved VGG alignment network combined with an STN network, which can accurately predict and correct complex geometric differences between image patches, thus maintaining excellent stitching quality even with varying offsets and overlap ratios. By automatically generating training data pairs, the method reduces reliance on high-precision hardware platforms and large amounts of manually labeled data, enhancing the model's generalization ability in practical applications. The combination of preliminary fusion using a channel-weighted fusion module and final fusion using a U-Net fusion network effectively eliminates potential seams from direct stitching after alignment, automatically generating high-quality, visually seamless panoramic pathological images with excellent image quality and high stitching efficiency.

[0007] The technical solution of this invention is as follows:

[0008] A deep learning-based image patch stitching method based on feature extraction and alignment fusion includes the following steps:

[0009] Step 1: Acquire target pathological images using a pathological slide scanner, and construct a graphic dataset from multiple acquired target pathological images;

[0010] Step 2: The target pathological image acquired in Step 1 is preprocessed with discrete and random geometric deformation to become discrete pathological image blocks. Then, the discrete pathological image blocks are input into the improved VGG alignment network to predict the geometric transformation parameters of each image block.

[0011] Step 3: Input the transformation parameters predicted in Step 2 into the STN spatial transformation network to obtain the sampler, resample the input image blocks to achieve accurate geometric correction, and map each discrete pathological image block to obtain the aligned first image block;

[0012] Step 4: Input the first image block obtained in Step 3 into the weighted fusion module. The weighted fusion module performs average pooling and max pooling operations on each channel of the first image block, concatenates the two pooling results by channel, and sends them into the multilayer perceptron (MLP) to obtain the weight coefficient of each channel. Multiply the weight coefficient of each channel with the first image block channel by channel to obtain the weighted feature map, thus completing the initial fusion.

[0013] Step 5: Input the weighted feature map obtained in Step 4 into the trained U-Net fusion network for overall feature extraction and image reconstruction to achieve final fusion and obtain a high-quality, visually seamless panoramic pathological image.

[0014] Furthermore, step 2 includes the following sub-steps:

[0015] Step 2.1: Divide each target pathological image acquired in Step 1 into a 5×5 grid to obtain 25 discrete square pathological image blocks of equal size, which are used to simulate multiple local fields of view that may be acquired in actual scanning; assuming that there is an overlapping area between adjacent square pathological image blocks, the overlap coefficient l is defined as equal to the length of the overlap of the square pathological image blocks divided by the side length of the square pathological image blocks, and the overlap coefficient l is selected from the set {0.15, 0.20, 0.25};

[0016] Step 2.2: To train the improved VGG alignment network to handle various possible misalignment situations, a random affine transformation is applied to each divided image block: the offset distance is randomly selected within a set offset range in the x-axis and y-axis directions, and the rotation angle θ is randomly selected within a set rotation angle range; the affine transformation matrix M is calculated based on the selected offset distance and rotation angle. ij

[0017]

[0018] In the formula, θ is the rotation angle of the square pathological image patch. This represents the offset distance of the square pathological image patch along the x-axis. This represents the offset distance of the square pathological image patch along the y-axis.

[0019] matrix M ij As the "real label" for training the improved VGG alignment network, this matrix maps the original image patch to the transformed image patch. Specifically, the pixel coordinates of the divided square pathological image patch are multiplied by the transformation matrix to obtain the transformed pathological image patch. Thus, a data pair of "transformed image patch - real transformation matrix" is obtained for supervised learning.

[0020] Step 2.3: Input the transformed image patch into the improved VGG alignment network for training to predict the geometric transformation parameters of each image patch.

[0021] Furthermore, the improved VGG alignment network is constructed from five groups of 3×3 convolutional layers. The first and second groups consist of two convolutional layers for initial feature capture; the third, fourth, and fifth groups consist of three convolutional layers for mining deep semantic information; each group of convolutional layers is followed by a max-pooling layer to halve the feature map size and reduce the number of parameters; branch design is introduced between the fifth and seventh convolutional layers, and between the eighth and tenth convolutional layers, to improve the ability to extract complex pathological features by directly concatenating the input and output of the current group of convolutional layers; the last max-pooling layer is followed by two fully connected layers for parameter regression.

[0022] The improved VGG network processes input data as follows:

[0023] First, the first convolutional layer in the first group of convolutional layers receives the transformed pathological image patch. Each convolutional layer performs a convolution operation on the input data to obtain a feature map of the input data. After two convolutional layers, a max pooling layer is connected to perform a max pooling operation on the input feature map, which is then used as the input data for the second group of convolutional layers, and so on, until the last max pooling layer. In the above operation process, a branch design is introduced between the fifth and seventh convolutional layers, and between the eighth and tenth convolutional layers. By directly concatenating the input and output of the current group of convolutional layers, the fusion of features at different depths is achieved. The feature map obtained from the last max pooling layer is output to the fully connected layer. The fully connected layer summarizes the feature maps obtained by the convolutional layers and regresses the specific geometric transformation parameters of each image patch.

[0024] Furthermore, the loss function of the improved VGG alignment network is defined as the mean square error between the predicted transformation parameters and the true transformation matrix, and the specific expression is as follows:

[0025]

[0026] in, This represents the loss function of the VGG alignment network. Represents the total number of images in the dataset. Representing the For the training data, that is, the data pairs of "transformed image patch - true transformation matrix", and and Representing the first The transformation parameters for predicting the data and the actual transformation matrix.

[0027] Furthermore, in step 4, the Multilayer Perceptron (MLP) obtains the weight coefficients of each channel of the first image block under the MLP using the first formula. The first formula expression is specifically as follows:

[0028]

[0029] Among them, the The activation function is Sigmoid. and Let represent average pooling and max pooling, respectively, and I represent the input feature map.

[0030] Furthermore, the U-Net network described in step 5 includes a contraction path and an expansion path, which are arranged in a U-shape.

[0031] The shrinking path contains four shrinking convolutional modules. Each module contains two sets of convolutional layers, two sets of BN layers, two sets of ReLU layers, followed by a max pooling layer for downsampling. The convolutional layers use 3 × 3 kernels to extract feature maps, and the max pooling layer has a stride of 2 to reduce the output size of each shrinking convolutional module by half.

[0032] The expansion path contains four expansion convolutional modules. Each module contains a deconvolutional layer, a batch normalization (BN) layer, and a ReLU layer for upsampling. After each upsampling, the feature maps of the same scale as the intermediate layers of the contraction path are spliced ​​together, followed by a double convolutional module. The double convolutional module contains two sets of 3×3 convolutional layers, a batch normalization (BN) layer, and a ReLU layer for feature fusion.

[0033] The U-Net network processes the input data as follows: First, the input feature map passes through a double convolutional module for preliminary feature extraction and channel adjustment. Then, it enters a shrinking path, where four consecutive shrinking convolutional modules gradually reduce the size of the feature map and extract image features at different levels. After shrinking, it enters an expanding path, where four expanding convolutional modules restore the size of the feature map. After each expanding convolutional module, a jumper is used to concatenate the feature maps at different levels in the shrinking path with those in the expanding path, preserving the original information of the image. Finally, a double convolutional module restores the number of channels of the image, resulting in the final concatenated and fused image.

[0034] Furthermore, the loss function of the U-Net fusion network is a weighted sum of the mean squared error loss and the perceptual loss; the specific expression of the loss function of the U-Net fusion network is as follows:

[0035]

[0036] In the formula, Let U-Net be the loss function of the fusion network. For mean square error loss, In order to perceive loss, and These are the weighting coefficients for the mean squared error loss and the perceived loss, respectively, where the weight of the mean squared error loss is... The weight of the perceived loss is 1. It is 2;

[0037] The formula for calculating the mean squared error loss is as follows:

[0038]

[0039] in, This represents the input to the U-Net fusion network. This represents the output of the U-Net fusion network. This represents the true panoramic image corresponding to the input image patch. This represents the total number of images in the dataset;

[0040] The formula for calculating the perceived loss is as follows:

[0041]

[0042] Among them, in the formula The number of pixels representing the height of the weighted feature map. The number of pixels representing the width of the weighted feature map. This indicates the number of channels in the feature map.

[0043] Furthermore, both the improved VGG alignment network and the U-Net fusion network were built using PyTorch and trained on an Nvidia RTX4070 GPU using the Adam optimizer, with an initial learning rate set to... Furthermore, a cosine annealing strategy is employed to dynamically adjust the learning rate, ensuring the smoothness and stability of the learning rate during its decrease.

[0044] The improved VGG alignment network has a batch size of 16 and is trained for 300 training epochs.

[0045] The weighted fusion module and the U-Net fusion network are trained together, with a batch size of 16 and 100 training epochs.

[0046] Furthermore, the present invention also provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the aforementioned deep learning image patch stitching method based on feature extraction and alignment fusion.

[0047] Furthermore, the present invention also provides a computer-readable storage medium storing a computer program thereon, wherein when the computer program is executed by a computer, it implements the aforementioned deep learning image patch stitching method based on feature extraction and alignment fusion.

[0048] Beneficial effects:

[0049] This invention proposes a deep learning-based image patch stitching method based on feature extraction and alignment fusion. This method introduces a branch-fusion-improved VGG alignment network combined with an STN network, which can accurately predict and correct complex geometric differences between image patches, thus maintaining excellent stitching quality even with varying offsets and overlap ratios. By automatically generating training data pairs, the method reduces reliance on high-precision hardware platforms and large amounts of manually labeled data, enhancing the model's generalization ability in practical applications. The combination of preliminary fusion using a channel-weighted fusion module and final fusion using a U-Net fusion network effectively eliminates potential seams from direct stitching after alignment, automatically generating high-quality, visually seamless panoramic pathological images with excellent image quality and high stitching efficiency. Attached Figure Description

[0050] Figure 1 This is a flowchart illustrating a deep learning-based image patch stitching method based on feature extraction and alignment fusion according to an embodiment of the present invention.

[0051] Figure 2 This is a schematic diagram of the improved VGG alignment network structure in an embodiment of the present invention;

[0052] Figure 3 This is a schematic diagram of the weighted fusion module structure in an embodiment of the present invention;

[0053] Figure 4 This is a schematic diagram of the U-Net fusion network structure in an embodiment of the present invention;

[0054] Figure 5 This is a flowchart illustrating the creation of discrete pathological image blocks in an embodiment of the present invention. Detailed Implementation

[0055] To make the technical problems solved, the technical solutions, and the beneficial effects of this invention clearer and to enable those skilled in the art to better understand the invention, the invention will be further described in detail and in full below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for illustrative purposes only and are not intended to limit the invention.

[0056] like Figure 1 As shown in the figure, a deep learning image patch stitching method based on feature extraction and alignment fusion in this embodiment includes the following steps:

[0057] Step 1: Acquire target pathological images using a pathological slide scanner, and collect n target pathological images to form a graphic dataset;

[0058] In this embodiment, the acquisition device uses a Nikon plan apochromatic objective lens (CFIPlan Apochromat VC 20) with a numerical aperture (NA) of 0.75. (In conjunction with the OASIS Glide OEM XYZ automated motion platform from Objective Imaging in the UK, it acquires sequential microscopic images at a magnification of 20x.)

[0059] Step 2: The target pathological images acquired in Step 1 are preprocessed with discrete and random geometric deformation to become discrete pathological image patches. These discrete pathological image patches are then input into an improved VGG alignment network to predict the geometric transformation parameters of each image patch. Specifically, this includes the following sub-steps:

[0060] Step 2.1: Divide each target pathological image acquired in Step 1 into a uniform 5×5 grid, such as... Figure 5 As shown, 25 discrete square pathological image blocks of equal size are obtained to simulate multiple local fields of view that may be acquired in actual scanning. It is assumed that there is an overlapping area between adjacent square pathological image blocks. The overlap coefficient l is defined as equal to the length of the overlap of the square pathological image blocks divided by the side length of the square pathological image blocks. The overlap coefficient l is selected from the set {0.15, 0.20, 0.25}.

[0061] Step 2.2: To train the improved VGG alignment network to handle various possible misalignment situations, a random affine transformation is applied to each divided image block: the offset distance is randomly selected within a set offset range in the x-axis and y-axis directions, and the rotation angle θ is randomly selected within a set rotation angle range; the affine transformation matrix M is calculated based on the selected offset distance and rotation angle. ij

[0062]

[0063] In the formula, θ is the rotation angle of the square pathological image patch. This represents the offset distance of the square pathological image patch along the x-axis. This represents the offset distance of the square pathological image patch along the y-axis.

[0064] matrix M ij As the "real label" for training the improved VGG alignment network, this matrix maps the original image patch to the transformed image patch. Specifically, the pixel coordinates of the divided square pathological image patch are multiplied by the transformation matrix to obtain the transformed pathological image patch. Thus, a data pair of "transformed image patch - real transformation matrix" is obtained for supervised learning.

[0065] Step 2.3: Input the transformed image patch into the improved VGG alignment network for training to predict the geometric transformation parameters of each image patch.

[0066] In this embodiment, as Figure 2 As shown, the improved VGG alignment network consists of five groups of 3×3 convolutional layers. The first and second groups consist of two convolutional layers for initial feature capture; the third, fourth, and fifth groups consist of three convolutional layers for mining deep semantic information; each group of convolutional layers is followed by a max-pooling layer to halve the feature map size and reduce the number of parameters; branch design is introduced between the fifth and seventh convolutional layers and between the eighth and tenth convolutional layers to improve the fusion of features at different depths by directly concatenating the upper layer input with the subsequent output, thereby enhancing the ability to extract complex pathological features; the last max-pooling layer is followed by two fully connected layers for regression transformation parameters;

[0067] The improved VGG network processes input data as follows:

[0068] First, the first convolutional layer in the first group of convolutional layers receives the transformed pathological image patch. Each convolutional layer performs a convolution operation on the input data to obtain a feature map of the input data. After two convolutional layers, a max pooling layer is connected to perform a max pooling operation on the input feature map, which is then used as the input data for the second group of convolutional layers, and so on, until the last max pooling layer. In the above operation process, a branch design is introduced between the fifth and seventh convolutional layers, and between the eighth and tenth convolutional layers. By directly concatenating the input and output of the current group of convolutional layers, the fusion of features at different depths is achieved. The feature map obtained from the last max pooling layer is output to the fully connected layer. The fully connected layer summarizes the feature maps obtained by the convolutional layers and regresses the specific geometric transformation parameters of each image patch.

[0069] In this embodiment, the loss function of the improved VGG alignment network is defined as the mean square error between the predicted transform parameters and the true transform matrix, and the specific expression is as follows:

[0070]

[0071] in, This represents the loss function of the VGG alignment network. Represents the total number of images in the dataset. Representing the For the training data, that is, the data pairs of "transformed image patch - true transformation matrix", and and Representing the first The transformation parameters for predicting the data and the actual transformation matrix.

[0072] Step 3: Input the transformation parameters predicted in Step 2 into the STN spatial transformation network to obtain the sampler, resample the input image blocks to achieve accurate geometric correction, and map each discrete pathological image block to obtain the aligned first image block;

[0073] Step 4: As Figure 3 As shown, the first image block obtained in step 3 is input into a weighted fusion module. The weighted fusion module performs average pooling and max pooling operations on each channel of the first image block, concatenates the two pooling results by channel, and sends them into a multilayer perceptron (MLP) to obtain the weight coefficient of each channel. The weight coefficient of each channel is multiplied by the first image block channel by channel to obtain the weighted feature map, thus completing the initial fusion.

[0074] In this embodiment, in step 4, the multilayer perceptron (MLP) obtains the weight coefficients of each channel of the first image block under the MLP using the first formula. The first formula expression is specifically as follows:

[0075]

[0076] Among them, the The activation function is Sigmoid. and Let represent average pooling and max pooling, respectively, and I represent the input feature map.

[0077] Step 5: Input the weighted feature map obtained in Step 4 into the trained U-Net fusion network for overall feature extraction and image reconstruction to achieve final fusion and obtain a high-quality, visually seamless panoramic pathological image.

[0078] In this embodiment, as Figure 4 As shown, the U-Net network includes a contraction path and an expansion path, which form a U-shaped structure.

[0079] The shrinking path contains four shrinking convolutional modules. Each module contains two sets of convolutional layers, two sets of BN layers, two sets of ReLU layers, followed by a max pooling layer for downsampling. The convolutional layers use 3 × 3 kernels to extract feature maps, and the max pooling layer has a stride of 2 to reduce the output size of each shrinking convolutional module by half.

[0080] The expansion path contains four expansion convolutional modules. Each module contains a deconvolutional layer, a batch normalization (BN) layer, and a ReLU layer for upsampling. After each upsampling, the feature maps of the same scale as the intermediate layers of the contraction path are spliced ​​together, followed by a double convolutional module. The double convolutional module contains two sets of 3×3 convolutional layers, one BN layer, and one ReLU layer for feature fusion.

[0081] The U-Net network processes the input data as follows: First, the input feature map passes through a double convolutional module for preliminary feature extraction and channel adjustment. Then, it enters a shrinking path, where four consecutive shrinking convolutional modules gradually reduce the size of the feature map and extract image features at different levels. After shrinking, it enters an expanding path, where four expanding convolutional modules restore the size of the feature map. After each expanding convolutional module, a jumper is used to concatenate the feature maps at different levels in the shrinking path with those in the expanding path, preserving the original information of the image. Finally, a double convolutional module restores the number of channels of the image, resulting in the final concatenated and fused image.

[0082] In this embodiment, the loss function of the U-Net fusion network is a weighted sum of the mean squared error loss and the perceptual loss; the specific expression of the loss function of the U-Net fusion network is:

[0083]

[0084] In the formula, Let U-Net be the loss function of the fusion network. For mean square error loss, In order to perceive loss, and These are the weighting coefficients for the mean squared error loss and the perceived loss, respectively, where the weight of the mean squared error loss is... The weight of the perceived loss is 1. It is 2;

[0085] The formula for calculating the mean squared error loss is as follows:

[0086]

[0087] in, This represents the input to the U-Net fusion network. This represents the output of the U-Net fusion network. This represents the true panoramic image corresponding to the input image patch. This represents the total number of images in the dataset;

[0088] The formula for calculating the perceived loss is as follows:

[0089]

[0090] Among them, in the formula The number of pixels representing the height of the weighted feature map. The number of pixels representing the width of the weighted feature map. This indicates the number of channels in the feature map.

[0091] In this embodiment, both the improved VGG alignment network and the U-Net fusion network are built using PyTorch and trained on an Nvidia RTX4070 GPU using the Adam optimizer, with an initial learning rate set to... Furthermore, a cosine annealing strategy is employed to dynamically adjust the learning rate, ensuring the smoothness and stability of the learning rate during its decrease.

[0092] The improved VGG alignment network has a batch size of 16 and is trained for 300 training epochs.

[0093] The weighted fusion module and the U-Net fusion network are trained together, with a batch size of 16 and 100 training epochs.

[0094] To verify the beneficial effects of the method proposed in this invention, after training, the parameters of the VGG alignment network and the U-Net fusion network were fixed, and the algorithm was tested using a test set; at the same time, the method of this invention was compared and analyzed with existing splicing algorithms.

[0095] In this embodiment, several representative stitching methods were selected for comparison, including direct stitching, the AutoStitch software's automatic stitching algorithm, and the Lucas-Kanade method implemented using the Bigstitcher tool. Direct stitching means directly stitching multiple blocks without aligning them using the maximum value method. The Lucas-Kanade method is a type of optical flow method that utilizes Taylor series expansion (or difference) in the image signal, applying partial derivatives to spatial and temporal coordinates to estimate the movement of each pixel in the image, thereby achieving registration and stitching of different images. The Bigstitcher tool, created by David et al., is software for processing and analyzing bioinformatics datasets. It integrates various image alignment and stitching fusion techniques, such as phase correlation and the Lucas-Kanade method, to ensure accurate alignment of medical pathology images.

[0096] To comprehensively evaluate the performance of the splicing method of this invention, an overlap coefficient was set. Three offset standards are set, namely translation only ( , , ), tiny offset ( , , Large offset () , , Under three preset transformation standards, the method of this invention was compared with the above-mentioned comparative methods in the stitching of pathological images. Table 1 shows in detail the quality comparison of the stitched images of various methods. In Table 1, PSNR represents peak signal-to-noise ratio, SSIM represents structural similarity, and Time represents stitching time.

[0097] Peak signal-to-noise ratio (PSNR) is the most commonly used metric for evaluating the quality of reconstructed images. It is calculated based on the differences between pixels in two images; a higher PSNR value indicates a smaller difference between the reconstructed image and the original image, signifying higher quality. The specific calculation formula is as follows:

[0098]

[0099] In the formula, The maximum value of the image pixels is represented by MSE, which is the mean square error between two images. The calculation formula for multi-channel images is as follows:

[0100]

[0101] In the formula, M represents the number of pixels in the height of the feature map, N represents the number of pixels in the width of the feature map, and C represents C channels;

[0102] Structural Similarity (SSIM) measures the similarity between two images, comprehensively evaluating them based on brightness, structure, and contrast. A higher SSIM value indicates less distortion and higher quality in the reconstructed image. The formula for calculating SSIM is:

[0103]

[0104] In the formula, and These are the average pixel values ​​of the two images, respectively. and The standard deviation of the image pixels. Let X and Y be the covariances. and It is a constant for adjusting stability;

[0105] The stitching time t is the time required to stitch discrete pathological image blocks into a complete panoramic image.

[0106] Table 1. Comparison of quantitative results of various stitching algorithms under different offset standards.

[0107]

[0108] Regarding stitching quality, Table 1 shows the stitching results under different offset conditions. It can be observed that when the offset is fixed, directly stitching the pathological image blocks leads to uneven edges and imperfect pixel fusion, resulting in the worst performance among all methods. In contrast, the AutoStitch and Lucas-Kanade algorithms employ two different stitching strategies and ultimately achieve similar results. However, the stitching method proposed in this invention exhibits the best performance under various offset criteria, both in peak signal-to-noise ratio (PSNR) and structural similarity (SSIM), two key indicators for evaluating image quality. As the offset gradually intensifies, longitudinal comparative analysis reveals that the stitching quality of all methods is affected to some extent. The direct stitching method, in particular, performs the worst due to its inability to effectively correct rotation parameters. Although the method proposed in this invention also shows some reduction in evaluation metrics, the decrease is smaller compared to traditional algorithms, and the stitching effect remains superior. This fully demonstrates the superiority and effectiveness of the stitching method proposed in this invention for the automatic stitching of pathological image blocks.

[0109] Regarding stitching speed, as shown in Table 1, the direct stitching algorithm is more efficient than the AutoStitch and Lucas-Kanade algorithms because it eliminates the registration step. However, it still requires block-by-block stitching, which limits its speed to some extent. The stitching method proposed in this invention extracts features from the input pathological image blocks using a neural network, directly outputting the stitched panoramic pathological image, thus achieving very high time efficiency, approximately 30 times faster than traditional methods. Furthermore, as the offset gradually worsens, the AutoStitch and Lucas-Kanade algorithms must calculate the offset parameters of each image individually to ensure stitching quality, leading to increased stitching time. Direct stitching skips the registration step; although the stitching time remains the same, its stitching quality drops significantly. In contrast, the stitching method proposed in this invention, because it directly outputs the fused image, does not show significant changes in stitching time when facing different degrees of offset, maintaining extremely high efficiency, which is particularly important in practical applications with large image sizes and numerous stitching tasks.

[0110] To verify the robustness of the stitching method proposed in this invention, the overlap coefficient between images was also varied at a small offset level. The splicing quality was evaluated by setting the values ​​to 0.15, 0.20, and 0.25 respectively, and the results are shown in Table 2.

[0111] Table 2 Comparison of stitching results of various stitching algorithms at different overlap rates under small offset levels

[0112]

[0113] As can be seen from Table 2, the smaller the overlap area between adjacent pathological image blocks, the worse the performance of the stitching result. However, the stitching image quality of the stitching method proposed in this invention is still better than other traditional algorithms, which proves the stability of the stitching method proposed in this invention and that it can be applied to pathological image block stitching tasks with different overlap ratios.

[0114] Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of the present invention without departing from the principles and spirit of the present invention.

Claims

1. A deep learning-based image patch stitching method based on feature extraction and alignment fusion, characterized in that: Includes the following steps: Step 1: Acquire target pathological images using a pathological slide scanner, and construct a graphic dataset from multiple acquired target pathological images; Step 2: The target pathological image acquired in Step 1 is preprocessed with discrete and random geometric deformation to become discrete pathological image blocks. Then, the discrete pathological image blocks are input into the improved VGG alignment network to predict the geometric transformation parameters of each image block. Step 3: Input the transformation parameters predicted in Step 2 into the STN spatial transformation network to obtain the sampler, resample the input image blocks to achieve accurate geometric correction, and map each discrete pathological image block to obtain the aligned first image block; Step 4: Input the first image block obtained in Step 3 into the weighted fusion module. The weighted fusion module performs average pooling and max pooling operations on each channel of the first image block, concatenates the two pooling results by channel, and sends them into the multilayer perceptron (MLP) to obtain the weight coefficient of each channel. Multiply the weight coefficient of each channel with the first image block channel by channel to obtain the weighted feature map, thus completing the initial fusion. Step 5: Input the weighted feature map obtained in Step 4 into the trained U-Net fusion network for overall feature extraction and image reconstruction to achieve final fusion and obtain a high-quality, visually seamless panoramic pathological image.

2. The deep learning image patch stitching method based on feature extraction and alignment fusion according to claim 1, characterized in that: Step 2 includes the following sub-steps: Step 2.1: Divide each target pathological image acquired in Step 1 into a 5×5 grid to obtain 25 discrete square pathological image blocks of equal size, which are used to simulate multiple local fields of view that may be acquired in actual scanning; assuming that there is an overlapping area between adjacent square pathological image blocks, the overlap coefficient l is defined as equal to the length of the overlap of the square pathological image blocks divided by the side length of the square pathological image blocks, and the overlap coefficient l is selected from the set {0.15, 0.20, 0.25}; Step 2.2: To train the improved VGG alignment network to handle various possible misalignment situations, a random affine transformation is applied to each divided image block: the offset distance is randomly selected within a set offset range in the x-axis and y-axis directions, and the rotation angle θ is randomly selected within a set rotation angle range; the affine transformation matrix M is calculated based on the selected offset distance and rotation angle. ij In the formula, θ is the rotation angle of the square pathological image patch. This represents the offset distance of the square pathological image patch along the x-axis. This represents the offset distance of the square pathological image patch along the y-axis. matrix M ij As the "real label" for training the improved VGG alignment network, this matrix maps the original image patch to the transformed image patch. Specifically, the pixel coordinates of the divided square pathological image patch are multiplied by the transformation matrix to obtain the transformed pathological image patch; thus, a data pair of "transformed image patch - real transformation matrix" is obtained. Step 2.3: Input the transformed image patch into the improved VGG alignment network for training to predict the geometric transformation parameters of each image patch.

3. The deep learning image patch stitching method based on feature extraction and alignment fusion according to claim 2, characterized in that: The improved VGG alignment network consists of five groups of 3×3 convolutional layers. The first and second groups each consist of two convolutional layers for initial feature capture; the third, fourth, and fifth groups each consist of three convolutional layers for mining deep semantic information. Each group of convolutional layers is followed by a max-pooling layer to halve the feature map size and reduce the number of parameters. An improved branch design is introduced between the fifth and seventh convolutional layers, and between the eighth and tenth convolutional layers. This design achieves feature fusion of different depths by directly concatenating the input and output of the current group of convolutional layers, thereby enhancing the extraction capability of complex pathological features. The last max-pooling layer is followed by two fully connected layers for parameter regression.

4. The deep learning image patch stitching method based on feature extraction and alignment fusion according to claim 3, characterized in that: The loss function of the improved VGG alignment network is defined as the mean square error between the predicted transformation parameters and the true transformation matrix, and the specific expression is as follows: in, This represents the loss function of the VGG alignment network. Represents the total number of images in the dataset. Representing the For the training data, that is, the data pairs of "transformed image patch - true transformation matrix", and and Representing the first The transformation parameters for predicting the data and the actual transformation matrix.

5. The deep learning image patch stitching method based on feature extraction and alignment fusion according to claim 1, characterized in that: In step 4, the Multilayer Perceptron (MLP) obtains the weighting coefficients of each channel of the first image block under the MLP using the first formula. The first formula expression is specifically as follows: Among them, the The activation function is Sigmoid. and Let represent average pooling and max pooling, respectively, and I represent the input feature map.

6. The deep learning image patch stitching method based on feature extraction and alignment fusion according to claim 1, characterized in that: The U-Net network described in step 5 includes a shrinking path and an expanding path, which form a U-shaped structure. The shrinking path contains four shrinking convolutional modules. Each module contains two sets of convolutional layers, two sets of BN layers, two sets of ReLU layers, followed by a max pooling layer for downsampling. The convolutional layers use 3 × 3 kernels to extract feature maps, and the max pooling layer has a stride of 2 to reduce the output size of each shrinking convolutional module by half. The expansion path contains four expansion convolutional modules. Each module contains a deconvolutional layer, a batch normalization (BN) layer, and a ReLU layer for upsampling. After each upsampling, the feature maps of the same scale as the intermediate layers of the contraction path are spliced ​​together, followed by a double convolutional module. The double convolutional module contains two sets of 3×3 convolutional layers, one BN layer, and one ReLU layer for feature fusion.

7. The deep learning image patch stitching method based on feature extraction and alignment fusion according to claim 6, characterized in that: The loss function of the U-Net fusion network is a weighted sum of mean squared error loss and perceptual loss; the specific expression of the loss function of the U-Net fusion network is as follows: In the formula, Let U-Net be the loss function of the fusion network. For mean square error loss, In order to perceive loss, and These are the weighting coefficients for the mean squared error loss and the perceived loss, respectively, where the weight of the mean squared error loss is... The weight of the perceived loss is 1. It is 2; The formula for calculating the mean squared error loss is as follows: in, This represents the input to the U-Net fusion network. This represents the output of the U-Net fusion network. This represents the true panoramic image corresponding to the input image patch. This represents the total number of images in the dataset; The formula for calculating the perceived loss is as follows: Among them, in the formula The number of pixels representing the height of the weighted feature map. The number of pixels representing the width of the weighted feature map. This indicates the number of channels in the feature map.

8. The deep learning image patch stitching method based on feature extraction and alignment fusion according to claim 1, characterized in that: Both the improved VGG alignment network and the U-Net fusion network were built using PyTorch and trained on an Nvidia RTX4070 GPU using the Adam optimizer, with an initial learning rate set to... Furthermore, a cosine annealing strategy is employed to dynamically adjust the learning rate, ensuring the smoothness and stability of the learning rate during its decrease. The improved VGG alignment network has a batch size of 16 and is trained for 300 training epochs. The weighted fusion module and the U-Net fusion network are trained together, with a batch size of 16 and 100 training epochs.

9. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that: When the processor executes the computer program, it implements the method according to any one of claims 1 to 8.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a computer, it implements the method described in any one of claims 1 to 8.