Method for registering image sequences

By decomposing the absolute displacement of the image sequence and fine-tuning it with a deep learning network, the problem of overcorrection in image registration is solved, achieving high-precision physiological motion preservation and random perturbation compensation, and providing an efficient image registration method.

CN121937501BActive Publication Date: 2026-06-26江苏富翰医疗产业发展有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
江苏富翰医疗产业发展有限公司
Filing Date
2026-03-30
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing image registration methods are prone to overcorrection when eliminating jitter in image sequences, which can damage the real physiological motion structure. Furthermore, deep learning methods that rely on a large amount of labeled data cannot guarantee physical rationality.

Method used

By decomposing the absolute displacement of the image sequence, the trend component of physiological motion and the high-frequency residual of random perturbation are extracted. A deep learning network is then used for local deformation fine-tuning. The compensation displacement is determined by combining the high-frequency residual and the displacement fine-tuning amount, thus achieving accurate image registration.

Benefits of technology

While compensating for random perturbations, it preserves physiological motion trends, avoids over-correction of registration results, and provides high-precision image registration effects.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121937501B_ABST
    Figure CN121937501B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of medical image processing, and provides a registration method of an image sequence. The method first calculates relative displacement of adjacent frames, accumulates the absolute displacement, decomposes the absolute displacement, extracts a trend component representing physiological motion and a high-frequency residual representing random disturbance, simultaneously inputs a current frame and a reference frame into a displacement fine-tuning network to obtain a displacement fine-tuning amount, adds the high-frequency residual after inversion to the displacement fine-tuning amount to obtain a compensation displacement, and performs spatial transformation on the image. The method separates motion components, compensates for random jitter while retaining the physiological motion trend, avoids overcorrection caused by integral accumulation error or end-to-end learning, and realizes high-precision image sequence registration.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of medical image processing technology, and in particular to a method for registering image sequences. Background Technology

[0002] Imaging techniques that acquire three-dimensional volumetric data through cross-sectional acquisition, such as optical coherence tomography (OCT) and computed tomography (CT), can obtain the three-dimensional microstructure of biological tissues or objects, and are widely used in clinical diagnosis, industrial testing, and scientific research analysis. However, in actual acquisition, inter-frame displacement caused by the physiological movement, mechanical vibration, or environmental interference of the scanned object—i.e., motion artifacts—can disrupt the spatial continuity of volumetric data, leading to poor results in three-dimensional reconstruction and quantitative analysis.

[0003] To eliminate motion artifacts, registration methods based on cross-correlation, phase correlation, and deep learning can be employed. Cross-correlation estimates displacement by calculating the correlation between sliding windows of images, while phase correlation uses frequency-domain cross-correlation phase information to solve for sub-pixel translation. Both are simple in principle but sensitive to local deformations and struggle to handle rotation. Deep learning methods construct end-to-end regression networks to directly predict transformation parameters, achieving rapid registration after training.

[0004] However, none of the above methods can distinguish between the macroscopic physiological motion trend and high-frequency jitter components in the displacement sequence, which makes it impossible for the compensation process to retain the true low-frequency motion. The registration result is prone to overcorrection while eliminating jitter. Summary of the Invention

[0005] This application provides a registration method for image sequences to solve the problem that the registration results are prone to overcorrection while eliminating jitter.

[0006] This application provides a method for registering image sequences, including:

[0007] Obtain the image sequence to be registered, the image sequence comprising multiple frames of two-dimensional images;

[0008] Calculate the relative displacement between two adjacent two-dimensional images in the image sequence, and accumulate multiple relative displacements to obtain the absolute displacement of each image relative to the reference image.

[0009] The absolute displacement is decomposed to obtain a trend component and a high-frequency residual. The trend component is used to characterize physiological motion, and the high-frequency residual is used to characterize random perturbation.

[0010] The current frame image and the reference frame image are input into the displacement fine-tuning network to obtain the displacement fine-tuning amount of the current frame image relative to the reference frame image;

[0011] The compensation displacement of the current frame image is determined based on the high-frequency residual and the displacement fine-tuning amount.

[0012] Based on the compensated displacement, the current frame image is spatially transformed to obtain a registered image;

[0013] By traversing multiple frames of two-dimensional images in the image sequence, a registration image sequence is obtained, which includes registration images corresponding to multiple frames of two-dimensional images.

[0014] The method decomposes the absolute displacement to obtain the trend component representing physiological motion and the high-frequency residual representing random disturbance. Then, it combines the high-frequency residual with the displacement fine-tuning amount to determine the compensation displacement. This method can retain the physiological motion trend while compensating for random disturbances and avoid over-correction of the registration results.

[0015] In some feasible embodiments, before calculating the relative displacement between two adjacent two-dimensional images in the image sequence, the method further includes:

[0016] Traverse the two-dimensional images in the image sequence and calculate the global minimum gray level;

[0017] Subtract the global minimum gray value from the gray value of each pixel in each frame of the image to obtain a non-negative gray image;

[0018] Divide the gray value of each pixel in the non-negative grayscale image by the preset maximum gray value and map it to the normalization interval to obtain the preprocessed image sequence.

[0019] The calculation of the relative displacement between two adjacent two-dimensional images in the image sequence includes:

[0020] Fourier transform is performed on two adjacent frames of the preprocessed image sequence to obtain a first frequency domain image and a second frequency domain image.

[0021] Calculate the normalized cross power spectrum based on the first frequency domain image and the second frequency domain image;

[0022] Perform an inverse Fourier transform on the normalized cross-power spectrum to obtain the spatial domain function;

[0023] Locate the peak position of the spatial domain function;

[0024] Based on the peak position, the relative displacement between two adjacent two-dimensional images is determined, and the accuracy of the relative displacement is at the sub-pixel level.

[0025] By preprocessing the image grayscale values ​​to a normalized range, and combining Fourier transform and peak localization, subpixel-level relative displacement is obtained, providing input data for the decomposition of absolute displacement.

[0026] In some feasible embodiments, the reference frame image is the first frame image in the image sequence;

[0027] The accumulation of multiple relative displacements to obtain the absolute displacement of each frame image relative to the reference frame image includes:

[0028] The absolute displacement of the reference frame image is set to zero;

[0029] The relative displacement of the second frame image in the image sequence with respect to the first frame image is determined as the absolute displacement of the second frame image;

[0030] The relative displacements of the third frame and subsequent frames in the image sequence are sequentially added to the absolute displacement of the previous frame to obtain the absolute displacement of each frame relative to the first frame.

[0031] Using the first frame as a reference frame, an absolute displacement sequence is constructed by accumulating the relative displacement frame by frame, which ensures the continuity of trend component extraction and high-frequency residual calculation.

[0032] In some feasible embodiments, the decomposition of the absolute displacement to obtain trend components and high-frequency residuals includes:

[0033] The absolute displacement is filled by boundary reflection to obtain the filled displacement sequence;

[0034] A one-dimensional convolutional smoothing filter is used to process the padded displacement sequence to extract the trend component representing physiological movement. The smoothing window size of the one-dimensional convolutional smoothing filter is an odd number, and a normalized rectangular window function is set in the smoothing window. The rectangular window function is used to perform a weighted average on the padded displacement sequence.

[0035] Subtracting the trend component from the filled displacement sequence yields a high-frequency residual characterizing the random disturbance.

[0036] By extracting the trend component through boundary reflection filling and one-dimensional convolution smoothing filtering, the high-frequency residual is obtained by subtracting the trend component from the absolute displacement, thus achieving the separation of physiological motion from random disturbance.

[0037] In some feasible embodiments, the displacement fine-tuning network is a convolutional neural network, which includes convolutional layers, batch normalization layers, activation function layers, pooling layers, and fully connected layers, with the batch normalization layers and activation function layers disposed after the convolutional layers;

[0038] The step of inputting the current frame image and the reference frame image into the displacement fine-tuning network to obtain the displacement fine-tuning amount of the current frame image relative to the reference frame image includes:

[0039] The current frame image and the reference frame image are stitched together along the channel dimension to form a dual-channel input image;

[0040] After the dual-channel input image is input into multiple convolutional layers for feature extraction, an extracted feature map is obtained;

[0041] The extracted feature map is input into the pooling layer for spatial dimension compression to obtain a compressed feature map;

[0042] After flattening the compressed feature map, it is input into the fully connected layer, and a two-dimensional displacement vector is output, which is a displacement fine-tuning amount.

[0043] By employing a convolutional neural network, through dual-channel input and feature extraction, a two-dimensional displacement vector is output, providing fine-tuning capability for local deformation.

[0044] In some feasible embodiments, the displacement fine-tuning network is pre-trained in a supervised manner, the pre-training including:

[0045] Obtain a pre-registered image sequence, wherein each frame of the pre-registered image sequence has a corresponding real displacement label;

[0046] The current frame image and the reference frame image in the pre-registered image sequence are input into the displacement fine-tuning network to be trained to obtain the predicted displacement fine-tuning amount;

[0047] Calculate the mean square error between the predicted displacement fine-tuning amount and the actual displacement label;

[0048] Based on the mean square error, the parameters of the displacement fine-tuning network to be trained are updated to obtain the displacement fine-tuning network.

[0049] A supervised training method was adopted to train the network using real displacement labels of pre-registered image sequences. The network parameters were optimized by mean square error, enabling the displacement fine-tuning network to accurately predict local fine-tuning amounts.

[0050] In some feasible embodiments, the displacement fine-tuning network is trained online in an unsupervised manner, and the online training includes:

[0051] Calculate the loss function value, which is the sum of squares of the displacement fine-tuning amounts;

[0052] Based on the loss function value, the parameters of the translation fine-tuning network to be trained are updated by backpropagation to learn the implicit transformation that aligns the current frame image with the reference frame image, thus obtaining the translation fine-tuning network.

[0053] Unsupervised online training is adopted, and the sum of squares of displacement fine-tuning is used as the loss function to backpropagate and update parameters, enabling the network to learn implicit transformations for image alignment, thus adapting to application scenarios without pre-registration data.

[0054] In some feasible embodiments, determining the compensation displacement of the current frame image based on the high-frequency residual and the displacement fine-tuning amount includes:

[0055] The negative value of the high-frequency residual is used to obtain the reverse compensation value;

[0056] The reverse compensation value and the displacement fine-tuning amount are weighted and summed to obtain the compensation displacement, which is used to compensate for random disturbances and local deformations.

[0057] The compensation displacement is obtained by taking the negative value of the high-frequency residual and weighting it with the displacement fine-tuning amount. This process incorporates local fine-tuning while compensating for random disturbances, ensuring that the registration result takes into account both trend preservation and deformation correction.

[0058] In some feasible embodiments, the step of performing a spatial transformation on the current frame image based on the compensated displacement to obtain a registered image includes:

[0059] Based on the components of the compensated displacement in the depth and lateral directions, a two-dimensional translation transformation matrix is ​​constructed;

[0060] The pixel coordinates of the current frame image are mapped using the two-dimensional translation transformation matrix to generate a sampling grid;

[0061] For the current frame image, bilinear interpolation resampling is performed according to the sampling grid to obtain a registered image. The resampling is used to process boundary pixels using a reflection filling method.

[0062] A two-dimensional translation transformation matrix is ​​constructed based on the compensated displacement to generate a sampling grid. The registered image is obtained through bilinear interpolation resampling and reflection filling, thus realizing the execution of spatial transformation.

[0063] In some feasible embodiments, after traversing multiple frames of two-dimensional images in the image sequence to obtain the registered image sequence, the method further includes:

[0064] From the registered image sequence, continuous target regions are obtained along the depth direction to obtain a cropped image sequence;

[0065] For each column of pixels in the cropped image sequence, calculate the grayscale mean along the depth direction to obtain the projection value matrix;

[0066] The projection value matrix is ​​converted into a two-dimensional grayscale image to generate a two-dimensional frontal projection image.

[0067] A two-dimensional frontal projection image is generated by acquiring the target region along the depth direction from the registered image sequence and calculating the gray-scale mean, which can provide visualization results.

[0068] As can be seen from the above technical solutions, this application provides a method for registering an image sequence. The method includes: acquiring an image sequence comprising multiple frames of two-dimensional images to be registered; calculating the relative displacement between two adjacent frames of two-dimensional images in the image sequence; accumulating multiple relative displacements to obtain the absolute displacement of each frame image relative to a reference frame image; decomposing the absolute displacement to obtain a trend component and a high-frequency residual, wherein the trend component is used to characterize physiological motion and the high-frequency residual is used to characterize random disturbances; inputting the current frame image and the reference frame image into a displacement fine-tuning network to obtain the displacement fine-tuning amount of the current frame image relative to the reference frame image; determining the compensation displacement of the current frame image based on the high-frequency residual and the displacement fine-tuning amount; performing a spatial transformation on the current frame image based on the compensation displacement to obtain a registered image; traversing the multiple frames of two-dimensional images in the image sequence to obtain a registered image sequence, wherein the registered image sequence includes registered images corresponding to the multiple frames of two-dimensional images. The method decomposes the absolute displacement to obtain the trend component representing physiological motion and the high-frequency residual representing random disturbance. Then, it combines the high-frequency residual with the displacement fine-tuning amount to determine the compensation displacement. This method can retain the physiological motion trend while compensating for random disturbances and avoid over-correction of the registration results. Attached Figure Description

[0069] To more clearly illustrate the technical solution of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0070] Figure 1 A schematic flowchart illustrating the image sequence registration method provided in this application embodiment;

[0071] Figure 2 This is a schematic diagram of the displacement fine-tuning network provided in the embodiments of this application;

[0072] Figure 3 This is a schematic diagram of the training process of the displacement fine-tuning network provided in the embodiments of this application;

[0073] Figure 4 This is a schematic diagram of the first type of registration image comparison provided in the embodiments of this application;

[0074] Figure 5 This is a schematic diagram of a second type of registration image comparison provided in an embodiment of this application. Detailed Implementation

[0075] The embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following examples do not represent all embodiments consistent with this application.

[0076] Imaging technologies that acquire three-dimensional volume data through cross-sectional acquisition, such as optical coherence tomography (OCT) and computed tomography (CT), can obtain three-dimensional structural information of tissues in a non-invasive manner. In ophthalmic clinical diagnosis, OCT can be used for morphological evaluation of structures such as the retina and optic nerve head; in the cardiovascular field, OCT can be used to observe the microstructure of blood vessel walls; in industrial inspection, CT can be used to analyze internal defects in workpieces. These applications all rely on three-dimensional volume data constructed from continuously acquired two-dimensional image sequences.

[0077] Taking OCT as an example, OCT is applied in the field of imaging the microstructure of biological tissues, enabling the acquisition of three-dimensional structural information within tissues in a non-invasive manner. In ophthalmological clinical diagnosis, OCT is used for morphological assessment of structures such as the retina and optic nerve head; in the cardiovascular field, OCT can be used to observe the microstructure of blood vessel walls; in dermatology, OCT can visualize the layered structure of the epidermis and superficial dermis. These applications all rely on three-dimensional volumetric data constructed from continuously acquired B-scan image sequences.

[0078] In actual imaging processes, due to the physiological activities of the subjects and the physical limitations of the imaging equipment, inter-frame shifts are unavoidable in the acquired image sequences. Taking ophthalmic OCT imaging as an example, the subject's eyeballs exhibit various types of movements. For instance, slow drift is a continuous, low-frequency movement caused by changes in the tension of the eye muscles; another example is micro-eye movements, which are high-frequency, small-amplitude random jerks used to prevent adaptation of retinal photoreceptor cells; in addition, respiration and heartbeat can also cause periodic micro-movements of the eyeballs through head conduction.

[0079] These motion components overlap, resulting in both continuous physiological motion trends and random jitter interference between adjacent B-scan images. If these motions are not effectively corrected, the reconstructed three-dimensional volume will exhibit structural breaks or distortions, affecting the quantitative analysis results.

[0080] To address the aforementioned issues, various image registration methods exist. For example, cross-correlation-based registration methods, or cross-correlation methods, estimate the displacement by calculating the sliding window correlation between two images in the spatial domain; the cross-correlation function reaches its peak when the two images are aligned. To improve computational efficiency, fast Fourier transform can be used to calculate the cross-correlation in the frequency domain. This method is intuitive, easy to implement, and has a certain degree of robustness to noise.

[0081] However, cross-correlation requires high grayscale consistency in the image. When there are local contrast variations or uneven illumination in the image, the correlation peak becomes flat, leading to a decrease in displacement estimation accuracy. Furthermore, cross-correlation assumes global translation between images, making it difficult to handle local deformations or rotations. In OCT images, due to the complexity of tissue structures and noise interference, cross-correlation often results in mismatches in low signal-to-noise ratio regions.

[0082] Building upon the cross-correlation method, a phase-correlation-based registration method is introduced, incorporating frequency domain analysis. The relative displacement is estimated using the cross-correlation phase information of two images in the frequency domain. By calculating the normalized cross-power spectrum and performing an inverse Fourier transform, the impulse function in the spatial domain is obtained, with its peak position corresponding to a sub-pixel displacement vector. The phase-correlation method is computationally efficient, can directly output sub-pixel accuracy, and is robust to changes in image grayscale; therefore, it is often used for initial alignment of OCT images. However, it still assumes only translational motion between the images and is insensitive to complex transformations such as rotation and scaling.

[0083] When images contain noise, local distortions, or low-contrast areas, cross-correlation peaks are prone to blurring or spurious peaks, leading to incorrect displacement estimation. More importantly, phase correlation methods can only estimate the relative displacement between adjacent frames. To obtain the absolute motion trajectory of the entire sequence, the relative displacement needs to be integrated. This integration process accumulates errors and cannot distinguish between the actual physiological movements of the eye and high-frequency random jitter. Direct integration compensates for both physiological movements and jitter, resulting in overcorrection and damaging the true anatomical structure.

[0084] Deep learning-based registration methods typically employ supervised registration, which constructs an end-to-end regression network. The input consists of image pairs to be registered, and the output is spatial transformation parameters. During network training, known registration ground values ​​are used as supervision signals. These methods can learn complex nonlinear transformation relationships and, theoretically, can handle complex situations such as local deformations.

[0085] However, supervised deep learning heavily relies on large-scale, high-quality labeled data, and obtaining accurate OCT volume ground truth is extremely costly. Manual labeling is not only time-consuming and labor-intensive, but also difficult to accurately label sub-pixel displacements due to the resolution limitations of the OCT image itself. Relying on other registration algorithms to generate pseudo-labels may introduce errors from the algorithm itself.

[0086] While unsupervised or self-supervised methods do not require ground truth, they often necessitate the design of complex loss functions, such as image similarity-based metrics. These methods are difficult to train and struggle to guarantee the physical plausibility of the registration. More importantly, existing deep learning methods directly map the entire image to transformation parameters, ignoring the inherent motion characteristics of OCT image sequences. Macroscopic physiological motion and high-frequency jitter have different frequency characteristics and physical meanings. During training, the network tends to learn low-frequency trends along with high-frequency jitter, resulting in registration that, while eliminating jitter, also smooths out real physiological motion.

[0087] In summary, OCT image registration techniques share common limitations in addressing motion artifacts. They fail to effectively distinguish between macroscopic physiological eye movements and high-frequency random jitter, resulting in either over-correction that destroys the true structure or under-correction that leaves artifacts. Phase correlation methods are sensitive to noise and suffer from cumulative errors, while deep learning methods rely on large amounts of labeled data and struggle to guarantee physical plausibility.

[0088] To address the aforementioned issues, this application provides a registration method for image sequences. By decomposing the absolute displacement sequence, the macroscopic physiological motion trend is explicitly separated from the high-frequency random perturbation. Then, a deep learning network is used to fine-tune the local deformation. Finally, the high-frequency residual is synthesized with the network-predicted displacement, achieving high-precision registration that preserves the real physiological motion while compensating for random perturbations.

[0089] like Figure 1 As shown, the method includes the following steps S101-S107.

[0090] S101: Obtain the OCT image sequence to be registered. The image sequence includes multiple frames of two-dimensional images.

[0091] In this embodiment, the OCT image sequence to be registered can be the raw data acquired by the optical coherence tomography device through continuous scanning of the retina.

[0092] Taking optical coherence tomography (OCT) as an example, the raw data can be a three-dimensional volumetric data V, whose dimensions (Y, Z, X) are defined by three parameters: Y represents the number of B-scan image frames, i.e., the number of images along the temporal or spatial scanning direction; Z represents the number of pixels in the depth direction, i.e., the length of the A-scan signal (A-scan) in each B-scan image; and X represents the number of pixels in the horizontal width, i.e., the number of horizontal points in the B-scan image. Therefore, the image sequence is a data set including multiple frames of two-dimensional images, with each frame (i.e., a B-scan) having dimensions of Z rows and X columns, and Y determining the total number of image frames included in the image sequence.

[0093] Multiple frames of two-dimensional images can be arranged in the order of scanning time or spatial order to form raw data reflecting the three-dimensional structure of the tissue.

[0094] Besides OCT, this application is also applicable to imaging technologies that acquire three-dimensional volume data through cross-sectional acquisition, such as computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound imaging. For example, in CT, the image sequence to be registered can be a series of continuously acquired tomographic slice images, which undergo inter-frame displacement due to respiratory motion or changes in patient position; in MRI, the image sequence to be registered can be two-dimensional images acquired through multi-phase dynamic scanning, which produce motion artifacts due to heartbeat or respiration; in ultrasound imaging, the image sequence to be registered can be a series of continuously acquired two-dimensional ultrasound images, which undergo displacement due to probe movement or tissue motion. The image sequences acquired by the above imaging technologies can all be used as OCT image sequences to be registered.

[0095] After acquiring the OCT image sequence to be registered, in order to eliminate grayscale differences caused by different acquisition conditions and improve the stability and accuracy of subsequent processing, in some embodiments, the two-dimensional images in the image sequence are traversed and the global minimum grayscale value is calculated; the grayscale value of each pixel in each frame image is subtracted from the global minimum grayscale value to obtain a non-negative grayscale image; the grayscale value of each pixel in the non-negative grayscale image is divided by the preset maximum grayscale value and mapped to the normalization interval to obtain the preprocessed image sequence.

[0096] The above steps are the preprocessing steps for the image sequence. First, the original three-dimensional volume data V is normalized to map the gray values ​​of the image to a standard numerical range. Specifically, all two-dimensional images in the acquired image sequence are traversed to find and calculate the global minimum value min(V) and the global maximum value max(V) of the gray values ​​of all pixels. Then, for each frame of the image, the original gray value of each pixel is converted according to the following formula:

[0097] ;

[0098] in, It is a non-zero value, used to prevent the denominator from being zero and to ensure the stability of the calculation.

[0099] The above formula can map the original grayscale range to a normalized interval, for example, between [0, 1]. After the above processing, a preprocessed image sequence is obtained, providing input data with uniform format and optimized quality for displacement calculation.

[0100] S102: Calculate the relative displacement between two adjacent two-dimensional images in the image sequence, and accumulate multiple relative displacements to obtain the absolute displacement of each image relative to the reference image.

[0101] In this step, motion estimation is performed on the preprocessed image sequence to obtain the precise spatial location information of each frame.

[0102] Relative displacement refers to the amount of relative motion between two adjacent image frames. For example, the displacement of image I2 relative to image I1 describes the motion of the subject or imaging device during the time from the capture of I1 to the capture of I2.

[0103] Absolute displacement is the position of each frame of an image relative to a fixed reference frame. In this embodiment, a frame in the image sequence is selected as the reference frame image. For example, the first frame image is the reference frame image, and the position of this frame is defined as the zero point. The positions of all other frames are relative to this frame. By successively accumulating the relative displacements between adjacent frames, the absolute displacement sequence of each frame image relative to the reference frame can be constructed, thereby completely obtaining the motion trajectory during the scanning process.

[0104] To achieve high-precision displacement estimation, this embodiment uses the phase correlation method based on Fourier transform to calculate the relative displacement. The phase correlation method based on Fourier transform has significant advantages in terms of noise resistance and sub-pixel accuracy.

[0105] In some embodiments, Fourier transforms are performed on two adjacent frames of two-dimensional images in the preprocessed image sequence to obtain a first frequency domain image and a second frequency domain image; the normalized cross power spectrum is calculated based on the first frequency domain image and the second frequency domain image; an inverse Fourier transform is performed on the normalized cross power spectrum to obtain a spatial domain function; the peak position of the spatial domain function is located; and the relative displacement between two adjacent frames of two-dimensional images is determined based on the peak position, with the accuracy of the relative displacement being at the sub-pixel level.

[0106] Specifically, for the preprocessed image sequence, select any pair of adjacent two-dimensional images, let the (i-1)th frame image be... The i-th frame image is .

[0107] First, perform a two-dimensional Fourier transform on each of the two images to convert them from the spatial domain to the frequency domain. The corresponding frequency domain image is obtained using the following formula:

[0108] ;

[0109] ;

[0110] Where F represents the Fourier transform operation. Represents the coordinates in the frequency domain, corresponding to the frequency components in the spatial domain.

[0111] In this process, the Fourier transform operation decomposes the gray-level distribution of the image into sinusoidal components of different frequencies. The amplitude spectrum reflects the intensity of each frequency component, while the phase spectrum preserves the structural and positional information of the image. Next, based on these two frequency domain images, the normalized cross-power spectrum is calculated using the following formula:

[0112] ;

[0113] in, express The complex conjugate of , |·| denotes taking the modulus of the complex number.

[0114] The numerator is the cross-power spectrum of the two frequency domain images, containing phase difference information between them; the denominator normalizes the cross-power spectrum to eliminate the influence of amplitude information. Therefore, the normalized cross-power spectrum... Only the phase difference information is retained, so the displacement estimation based on it is not sensitive to grayscale changes in the image content and has good robustness.

[0115] After obtaining the normalized cross-power spectrum, an inverse Fourier transform is performed on it to convert it back from the frequency domain to the spatial domain, resulting in a spatial domain correlation function r(dx,dz), where dx and dz represent the displacement variables in the lateral and depth directions, respectively. According to the properties of the Fourier transform, this spatial domain correlation function r(dx,dz) is ideally a Dirac impulse function, and its peak position corresponds to the relative displacement between two frames.

[0116] Therefore, by locating the peak position of this spatial domain correlation function, the relative displacement between two adjacent two-dimensional images can be determined. Specifically, it is necessary to find the displacement coordinates (dx, dz) that maximize r(dx, dz), as calculated in the following formula:

[0117] ;

[0118] in, This represents the horizontal displacement of the i-th frame relative to the (i-1)-th frame. This indicates displacement in the depth direction.

[0119] When locating a peak, you can first find the peak coordinates at integer pixel positions, and then use the pixel values ​​around the peak to calculate the precise sub-pixel position through surface fitting or interpolation methods.

[0120] If the initial displacement estimation has a large error, these errors will be amplified during subsequent accumulation, ultimately affecting the registration quality. In this embodiment, since the Fourier phase correlation method utilizes continuous frequency domain information, the displacement estimation accuracy it can achieve can be much smaller than a single pixel unit, i.e., sub-pixel accuracy. For example, by using methods such as parabolic fitting or Gaussian fitting, displacement values ​​with an accuracy of 0.1 pixels or even higher can be calculated from discrete correlation function values.

[0121] Repeating the above operation for all adjacent frames in the image sequence yields a complete relative displacement sequence. Specifically, for a sequence containing Y frames, Y-1 relative displacement pairs can be obtained, where the value of i ranges from 1 to Y-1. For ease of subsequent processing, the displacement of the first frame (reference frame) is set to zero, but this is only used as the starting point for accumulation and does not participate in the displacement calculation of adjacent frames.

[0122] After obtaining the relative displacements of all adjacent frames, these relative displacements need to be accumulated to construct the absolute displacement of each frame image relative to a common reference frame. As described above, in this embodiment, the first frame image in the image sequence is set as the reference frame image.

[0123] In some embodiments, the absolute displacement of the reference frame image is set to zero; the relative displacement of the second frame image in the image sequence relative to the first frame image is determined as the absolute displacement of the second frame image; the relative displacements corresponding to the third frame image and subsequent frame images in the image sequence are sequentially accumulated onto the absolute displacement of the previous frame image to obtain the absolute displacement of each frame image relative to the first frame image.

[0124] The accumulation process specifically includes: First, setting the absolute displacement of the reference frame image (i.e., the first frame image I1) to zero, that is... , For each subsequent frame in the image sequence, the absolute displacement of the current frame is obtained by sequentially summing all adjacent relative displacements preceding the current frame. The calculation formula is as follows:

[0125] ;

[0126] ;

[0127] ;

[0128] Where k represents the frame index number, which is the number of all adjacent frame pairs from frame 1 to frame i.

[0129] By accumulating frame by frame, the absolute displacement of each frame relative to the first reference frame can be obtained, forming a complete absolute displacement sequence. This sequence records the spatial drift trajectory of each frame from the start to the end of the scan.

[0130] It is understandable that, in this embodiment, the Fourier phase correlation method can directly utilize frequency domain phase information, has high computational efficiency, and can output sub-pixel level accuracy. However, other methods that can estimate inter-image displacement, such as the aforementioned cross-correlation-based time-domain method and feature point matching method, can also be used to implement some steps of this process, provided that the sub-pixel accuracy requirement is met.

[0131] S103: Decompose the absolute displacement to obtain the trend component and the high-frequency residual. The trend component is used to characterize physiological motion, and the high-frequency residual is used to characterize random disturbance.

[0132] After obtaining the absolute displacement sequence, the absolute displacement sequence is decomposed. The purpose of the decomposition is to separate the different motion components mixed in the displacement sequence.

[0133] Among them, the trend component is the slowly changing and continuous part of the displacement sequence, which reflects the low-frequency motion caused by physiological activities such as macroscopic eye saccades, respiratory rhythm, and heartbeat. This part of the motion is the real physiological state of biological tissue and should be preserved in the registration process.

[0134] High-frequency residuals are the rapidly changing and highly random parts of the displacement sequence, mainly composed of high-frequency random disturbances caused by eye tremors, equipment mechanical vibrations, etc. These disturbances are undesirable noises introduced during the imaging process and need to be compensated and eliminated during registration.

[0135] To achieve separation, this embodiment employs a smoothing filter technique. In some embodiments, the absolute displacement is decomposed to obtain a trend component and a high-frequency residual, including: performing boundary reflection filling on the absolute displacement to obtain a filled displacement sequence; using a one-dimensional convolutional smoothing filter to process the filled displacement sequence to extract the trend component representing physiological motion, wherein the smoothing window size of the one-dimensional convolutional smoothing filter is an odd number, and a normalized rectangular window function is provided within the smoothing window, which is used to perform a weighted average on the filled displacement sequence; and subtracting the trend component from the filled displacement sequence to obtain the high-frequency residual representing random perturbation.

[0136] Specifically, the decomposition process first requires boundary reflection filling of the absolute displacement sequence. Boundary reflection filling involves reflecting the values ​​inside the sequence to the outside at the beginning and end of the sequence, with the boundary points as the axis of symmetry. This can eliminate the boundary effects that may be caused by subsequent filtering and ensure that the points at both ends of the sequence can also obtain effective filtering results.

[0137] Then, a one-dimensional convolutional smoothing filter is used to process the padded displacement sequence. The one-dimensional convolutional smoothing filter uses a sliding window to perform a weighted average of the sequence. The filtering method used in this embodiment is equivalent to a low-pass filter, which can effectively retain the slowly changing low-frequency components in the sequence while suppressing the rapidly changing high-frequency components.

[0138] For example, in a practical implementation, a smoothing window of odd size is set, and the smoothing window size... The number of points can be set to 5, 7, or 9 to ensure the symmetry of the window center. A normalized rectangular window function is set within the smoothing window. This rectangular window function is used to perform an equal-weighted average of the displacement sequence data within the window's coverage area, thereby calculating the smoothing value at the window's center point. By sliding the window from the start point to the end point of the sequence, a complete and smooth trend curve can be obtained, representing the trend component of physiological movement.

[0139] Specifically, for the lateral absolute displacement corresponding to the i-th frame image Its smoothed trend component Calculated in the following way:

[0140] ;

[0141] The summation range covers all valid indices after padding.

[0142] Absolute displacement in the depth direction Perform the same operation, see the following formula:

[0143] .

[0144] By sliding the window from the beginning to the end of the sequence, a complete sequence of macroscopic physiological movement trends can be obtained. and .

[0145] After extracting the trend component, subtracting the trend component from the original absolute displacement yields the high-frequency residual characterizing the random disturbance, as shown in the following formula:

[0146] ;

[0147] ;

[0148] Through the above decomposition, the absolute displacement is deconstructed into two parts: the trend component is retained, and the high-frequency residual is treated as an object that needs to be compensated in reverse.

[0149] S104: Input the current frame image and the reference frame image into the displacement fine-tuning network to obtain the displacement fine-tuning amount of the current frame image relative to the reference frame image.

[0150] To compensate for potential local nonlinear deformations in the image and complex displacements that the phase correlation method cannot perfectly handle, this embodiment uses deep learning technology to input the current frame image and the reference frame image into a displacement fine-tuning network to obtain the displacement fine-tuning amount of the current frame image relative to the reference frame image.

[0151] The translation fine-tuning network is a trained convolutional neural network model that learns the mapping relationship between two images and the translation between them. The translation fine-tuning amount is a two-dimensional vector that represents the amount of additional translation the network believes needs to be applied to the current frame image to achieve a finer alignment with the reference frame image.

[0152] It is important to note that the displacement fine-tuning amount is different from the high-frequency residual. The high-frequency residual comes from the physical decomposition of the global motion trajectory and mainly compensates for global random jitter, while the displacement fine-tuning amount comes from the network's analysis of the local content of the image and is used to correct small deformations and residuals in local areas.

[0153] like Figure 2 As shown, in some embodiments, the displacement fine-tuning network is a lightweight convolutional neural network, including multiple convolutional layers for extracting image features, batch normalization layers for accelerating training and stabilizing the model, activation function layers for introducing nonlinearity, pooling layers for reducing the dimensionality of the feature map space, flattening layers, and fully connected layers for finally regressing the displacement value. More specifically, the structure of the displacement fine-tuning network is shown in the table below:

[0154] Table 1

[0155]

[0156] Batch normalization layers and activation function layers are sequentially placed after each convolutional layer to form convolutional module 1 or convolutional module 2. The activation function layer can be a rectified linear unit (ReLU) function. The ReLU function is simple to calculate and can effectively alleviate the gradient vanishing problem.

[0157] In some embodiments, inputting the current frame image and a reference frame image into a displacement fine-tuning network to obtain the displacement fine-tuning amount of the current frame image relative to the reference frame image includes: concatenating the current frame image and the reference frame image in the channel dimension to form a dual-channel input image; inputting the dual-channel input image into multiple convolutional layers for feature extraction to obtain an extracted feature map; inputting the extracted feature map into a pooling layer for spatial dimension compression to obtain a compressed feature map; flattening the compressed feature map and inputting it into a fully connected layer to output a two-dimensional displacement vector, which is the displacement fine-tuning amount.

[0158] The design of stitching the current frame and the reference frame into a dual-channel input is based on the specific requirements of OCT image sequence registration: the current frame and the reference frame are images of the same anatomical location at different time points, and they are highly similar in spatial structure, differing only in position. The dual-channel input allows the network to simultaneously perceive the pixel distribution of the two images, directly learn the correspondence between them, and thus more accurately estimate the required displacement compensation.

[0159] In this embodiment, for the i-th frame image, the previous frame image, i.e., the (i-1)-th frame, can also be selected as the reference frame image. This dynamic reference method can avoid the accumulation of errors and is more in line with the temporal continuity characteristics of sequential images.

[0160] After determining the reference frame, the current frame image and the reference frame image are stitched together in the channel dimension to form a dual-channel input image. That is, the information of the two images are superimposed together and used as a dual-channel tensor input network, which makes it easier for the network to learn the differences and correspondences between them.

[0161] After the dual-channel input image enters the network, it first passes through Convolutional Layer 1 for feature extraction. Convolutional Layer 1 uses learnable convolutional kernels to slide across the image, extracting hierarchical features from edges and textures to higher-level semantics, transforming the original pixel information into a high-dimensional feature map, i.e., feature map extraction. Understandably, the extracted feature map is then sequentially fed into Batch Normalization Layer 1 and Activation Function Layer 1, and then into the pooling layer.

[0162] The pooling layer uses max pooling to compress the spatial dimension of the feature map. This reduces subsequent computation and expands the network's receptive field, allowing the network to focus on a wider range of contextual information, thus resulting in a compressed feature map.

[0163] Understandably, the compressed feature map is then sequentially fed into convolutional layer 2, batch normalization layer 2, activation function layer 2, and adaptive average pooling. Finally, the compressed multidimensional feature map is flattened into a one-dimensional vector and input into a fully connected layer. The fully connected layer maps the high-dimensional features to the output space through linear transformations and non-linear activations, ultimately outputting a two-dimensional displacement vector. This two-dimensional displacement vector is the fine-tuning amount of the displacement of the current frame image relative to the reference frame image predicted by the network, denoted as . ,in, This indicates the amount of fine-tuning in the lateral direction. This represents the amount of displacement adjustment in the depth direction.

[0164] like Figure 3 As shown, to obtain a high-performance displacement fine-tuning network, it needs to be trained, i.e., the network parameters need to be adjusted to make its prediction results as accurate as possible. This application includes two training modes: supervised training mode and unsupervised online training mode.

[0165] Supervised training is suitable for scenarios with high-quality pre-registration data. Pre-registration data consists of image sequences that have already achieved accurate registration results using other high-precision methods, along with their corresponding registered data.

[0166] In some embodiments, pre-training includes: acquiring a pre-registered image sequence, wherein each frame in the pre-registered image sequence has a corresponding real displacement label; inputting the current frame image and the reference frame image in the pre-registered image sequence into the displacement fine-tuning network to be trained to obtain a predicted displacement fine-tuning amount; calculating the mean square error between the predicted displacement fine-tuning amount and the real displacement label; and updating the parameters of the displacement fine-tuning network to be trained based on the mean square error to obtain the displacement fine-tuning network.

[0167] Specifically, for a training sample, i.e., a three-dimensional volume data, let the original data be denoted as . The registered data is To construct the true displacement labels required for training, it is necessary to calculate the absolute displacement of each frame in the original data to its corresponding frame after registration.

[0168] This application uses the same Fourier phase correlation method as step S102 for calculation, that is, for the i-th frame image, the phase correlation method is used to calculate... arrive The absolute displacement is used to obtain the true displacement label. ,in, Indicates the actual lateral displacement. This represents the true depth displacement. Perform this operation on all frames to obtain a set of true displacement labels of shape (Y, 2), where Y is the total number of frames in the image sequence.

[0169] During supervised training, all frames for each volume are sequentially input into the network for training. For the i-th frame (i≥1), the current frame image is... As a moving image M, its previous frame image is used. As a fixed image F1.

[0170] M and F1 are concatenated along the channel dimension to form a dual-channel input, which is then input into the displacement fine-tuning network to be trained. After forward propagation of the network, the predicted displacement fine-tuning amount is obtained. Then, the mean square error between the predicted displacement fine-tuning and the actual displacement label is calculated as the loss function value, as shown in the following formula:

[0171] ;

[0172] Where ||·||² represents the L2 norm, and this loss function measures the degree of deviation between the predicted value and the true value. Mean squared error is used as the loss function because it can effectively quantify the difference between the network's predicted displacement and the actual physical displacement, directly driving the network to learn and approximate the true displacement, ensuring that the network output remains consistent with the registered physical target.

[0173] After calculating the loss for all frames in the current volume in turn, the total loss is obtained by summing the losses of all frames. Then, the gradient is calculated and the network parameters are updated by backpropagation algorithm. During training, the Adam optimizer is used to optimize the parameters, and the initial learning rate can be set to 0.001.

[0174] Since each volume itself contains multiple frames of images, the batch size can be set to 1, that is, a complete volume is used as the input for one iteration. The number of training rounds is determined according to the loss change on the validation set. Training stops when the loss on the validation set no longer decreases, thus obtaining a displacement fine-tuning network that can accurately predict the displacement fine-tuning amount.

[0175] Unsupervised online training mode is suitable for scenarios where there is no pre-registration data but real-time registration of the currently acquired image sequence is required. Training can be performed directly on the image sequence to be registered without any external labels.

[0176] In some embodiments, online training includes: inputting the current frame image and the reference frame image into the translation fine-tuning network to be trained to obtain the translation fine-tuning amount; calculating the loss function value, the loss function value being the sum of squares of the translation fine-tuning amount; and updating the parameters of the translation fine-tuning network to be trained by backpropagation based on the loss function value, so as to learn the implicit transformation that aligns the current frame image with the reference frame image, thereby obtaining the translation fine-tuning network.

[0177] For the current volume to be registered The parameters of the displacement fine-tuning network can be initialized randomly, or a pre-trained model can be loaded as the initial value. The training process is carried out frame by frame in sequence, and the parameters are updated once after each frame is processed.

[0178] Specifically, for the i-th frame image, the current frame... As a moving image M, its previous frame As a fixed image F1, M and F1 are concatenated into a dual-channel input, which is then input into the displacement fine-tuning network to be trained. After forward propagation of the network, the predicted displacement fine-tuning amount is obtained. .

[0179] Unlike other supervised learning methods, there are no ground truth labels here. Therefore, a self-supervised loss function needs to be constructed. This embodiment utilizes the physical principle of image alignment, defining the loss function value as the sum of squares of the predicted displacement fine-tuning amounts, i.e. The physical meaning of this loss function is: if two images are perfectly aligned, the network should output zero displacement; conversely, if they are not aligned, the network will output a non-zero displacement to attempt to align them. Therefore, by minimizing the sum of squared displacements, the network is essentially forced to learn how to align two frames. This loss function design directly utilizes the physical essence of image registration—that aligned images should have zero displacement—closely integrating the goal of unsupervised learning with the inherent requirements of the registration task, driving network learning without the need for external labels.

[0180] Because it is online training, the network needs to quickly adapt to the characteristics of the current data. Therefore, the learning rate is usually set higher than in supervised mode. For example, it can be set to 0.01. By processing frame by frame and updating parameters online, the network can learn an implicit transformation that aligns the current frame image with the reference frame image.

[0181] In unsupervised online training, the loss function guides the network to find the displacement that minimizes the difference between two frames, which is a self-supervised alignment mechanism. However, if we rely solely on this loss function, the network may get stuck in trivial solutions, i.e., always outputting zero displacement.

[0182] To avoid this problem, the displacement fine-tuning amount output by the network is combined with the high-frequency residual obtained in step S103. The residual compensation mechanism can ensure that the network's learning always has a clear physical orientation.

[0183] After training or online updating of the displacement fine-tuning network, the forward inference of the network can be performed for the current frame image to obtain its displacement fine-tuning amount relative to the reference frame image, providing input for subsequent compensation synthesis.

[0184] S105: Determine the compensation displacement of the current frame image based on the high-frequency residual and the displacement fine-tuning amount.

[0185] After obtaining the high-frequency residual characterizing random perturbation and the displacement adjustment amount characterizing local fine-tuning, the two are combined to determine the compensation displacement, which is a correction amount applied to the current frame image to complete the correction of global random jitter and local minor deformation.

[0186] Regardless of whether the displacement fine-tuning network is pre-trained in a supervised manner or trained online in an unsupervised manner, the displacement fine-tuning amount predicted by the network is obtained. Afterwards, it needs to be compared with the high-frequency residual obtained in step S103. Synthesize the image to determine the final compensated displacement of the current frame.

[0187] In some embodiments, determining the compensation displacement of the current frame image based on the high-frequency residual and the displacement fine-tuning amount includes: taking a negative value for the high-frequency residual to obtain a reverse compensation value; and weighting and summing the reverse compensation value with the displacement fine-tuning amount to obtain the compensation displacement, which is used to compensate for random disturbances and local deformations.

[0188] The principle behind taking the negative value of the high-frequency residual is that the high-frequency residual represents the positive offset of the image's current position relative to its ideal position. For example, if the image shifts to the right horizontally due to random jitter... One pixel, offset downwards in the depth direction. To compensate for this offset, a displacement in the opposite direction, i.e., a leftward offset, needs to be applied. One pixel, offset upwards Each pixel. Therefore, taking the negative value of the high-frequency residual yields the inverse compensation value ( ). It is used directly to counteract random disturbances that have already occurred.

[0189] Secondly, the negative high-frequency residuals are weighted and summed with the displacement fine-tuning predicted by the network. During the summation process, a balance coefficient is introduced. This is used to adjust the contribution of the network's predicted displacement fine-tuning to the final compensated displacement. The formula for calculating the compensated displacement is:

[0190] ;

[0191] ;

[0192] in, and These represent the final compensated displacements of the i-th frame image in the horizontal and depth directions, respectively. and Let be the high-frequency residual of the i-th frame, and take a negative sign to indicate reverse compensation for random jitter; and The displacement adjustment amount predicted by the displacement adjustment network; This is a balancing coefficient used to adjust the magnitude of the network's predictive contribution.

[0193] Balance coefficient It can be adjusted according to the actual application scenario. In this application, A value of 1.0 is used, meaning that the high-frequency residual and the displacement fine-tuning are considered to contribute equally. However, in certain special cases, such as when there are severe local deformations in the image sequence, the value can be appropriately increased. The value of α can be adjusted to enhance the ability of the displacement fine-tuning network to correct local deformations; conversely, when random jitter dominates, the value of α can be appropriately reduced so that the compensation relies more on the reverse compensation of high-frequency residuals.

[0194] Through the above synthesis operation, the obtained compensation displacement includes both global random jitter compensation achieved by negative high-frequency residuals and local deformation fine-tuning contributed by the displacement fine-tuning network. More importantly, since the high-frequency residuals themselves are obtained by subtracting the trend component from the absolute displacement, and the trend component has been completely preserved in step S103, the compensation process will not affect the low-frequency trend component that characterizes physiological motion.

[0195] After the compensation displacement is calculated, the displacement amount will be sent to the next step to perform a spatial transformation on the current frame image, thereby obtaining the registered image.

[0196] S106: Based on the compensation displacement, perform spatial transformation on the current frame image to obtain the registration image.

[0197] Spatial transformation converts the calculated compensation displacement into actual pixel coordinate adjustments. In some embodiments, spatial transformation is performed on the current frame image based on the compensation displacement to obtain a registered image, including: constructing a two-dimensional translation transformation matrix based on the components of the compensation displacement in the depth and lateral directions; mapping the pixel coordinates of the current frame image using the two-dimensional translation transformation matrix to generate a sampling grid; and resampling the current frame image using bilinear interpolation according to the sampling grid, wherein boundary pixels are processed using reflection filling during the resampling process to obtain the registered image.

[0198] Since this application primarily deals with translational motion between images, the constructed affine transformation matrix only considers translational transformations and does not consider rotation or scaling. Let the width of the current frame image be X (corresponding to the number of pixels in the horizontal direction) and the depth be Z (corresponding to the number of pixels in the depth direction). The compensation displacement determined in step S105 is... .

[0199] Before constructing the transformation matrix, the pixel unit offset needs to be converted to a normalized coordinate system. The coordinate range of the normalized coordinate system is usually [-1, 1], where -1 corresponds to the left or top boundary of the image, and 1 corresponds to the right or bottom boundary of the image.

[0200] For example, the conversion process is as follows: compensation displacement in the lateral direction We need to divide by the image width X and multiply by 2 to obtain the horizontal translation t in the normalized coordinate system. x Similarly, the compensation displacement in the depth direction Divide by the image depth Z and multiply by 2 to obtain the depth translation t in the normalized coordinate system. z See the following formula:

[0201] ;

[0202] ;

[0203] The reason for using a negative sign here is that the affine transformation matrix defines the position of the target image pixel mapped back to the source image, rather than the displacement direction directly applied to the image.

[0204] Based on the above translation, a two-dimensional affine transformation matrix θ can be constructed, the specific form of which is shown in the following equation:

[0205] ;

[0206] The two-dimensional affine transformation matrix is ​​a 2×3 transformation matrix. The first two columns correspond to the identity matrix, indicating no rotation or scaling. The third column (t...) x , t z This is the translation amount.

[0207] After obtaining θ, a sampling grid is generated. The sampling grid is implemented by the function affine_grid, which takes θ and the size of the output image as input, calculates the corresponding coordinate position of each pixel in the output image in the input image. The generated sampling grid is a two-dimensional coordinate mapping table with the same size as the output image, where each element records which position in the input image the output pixel should be sampled from.

[0208] The current frame image is resampled according to the sampling grid. The resampling operation is implemented by the grid_sample function, which takes the original image M and the sampling grid as input, performs interpolation sampling on the image, and generates the transformed image M'.

[0209] Since the coordinates in the sampling grid may not be integer positions, an interpolation algorithm is needed to determine the value of the new pixel. This application uses bilinear interpolation for resampling. Bilinear interpolation can achieve a good balance between computational efficiency and image quality. It uses the gray values ​​of the four nearest integer pixels around the point to be determined to perform linear weighted calculation in two directions to obtain the gray value of the point.

[0210] During resampling, for sampling points falling outside the image boundary, a boundary filling mode needs to be set. This application uses a reflection filling method to process boundary pixels, that is, filling the outer area by mirroring the pixel values ​​of the inner edges of the image. For example, if a value needs to be taken at the position of a pixel on the left side of the image, the mirror value of the leftmost column of pixels in the image is used instead. This filling method can avoid producing unnatural black edges or repeated stripes and other artifacts at the image boundary, ensuring the overall quality of the registered image.

[0211] Through the above transformation, the registration image corresponding to the current frame image is finally obtained. Repeat the above operation for all frames, that is, perform spatial transformation sequentially from frame 1 to frame Y-1, to obtain a complete registered slice sequence. }, where i ranges from 0 to Y-1, and the 0th frame image (i.e., the reference frame image) remains unchanged during the transformation, resulting in the registered image. It is precisely aligned with the reference frame image in spatial location, and at the same time, through the aforementioned trend component preservation mechanism, it ensures that the real physiological motion information in the image sequence is not destroyed.

[0212] S107: Traverse the multiple frames of two-dimensional images in the image sequence to obtain the registration image sequence, which includes the registration images corresponding to the multiple frames of two-dimensional images.

[0213] Except for the reference frame image used as a baseline, steps S102 to S106 are repeated for each frame image in the image sequence.

[0214] Specifically, starting with the second frame, which is set as the current frame, operations such as relative displacement calculation, absolute displacement accumulation, trend decomposition, network prediction, compensation synthesis, and spatial transformation are performed to obtain its registered image. The third and fourth frames are then processed, and so on until the last frame. After all frames have been processed, all the obtained registered images are combined in their original order, that is, the registered images corresponding to each frame are combined. Arranged sequentially according to index i from 0 to Y-1, a complete sequence of registered images is finally formed. }, i=0,1,...,Y-1.

[0215] The registered image sequence corresponds one-to-one with the original image sequence in terms of frame number, but each frame image is spatially aligned to eliminate motion artifacts. At the same time, through the trend component preservation mechanism, the real physiological motion information represented by the trend component is preserved.

[0216] All registered 2D images are stitched together along the Y dimension in their original order to obtain the registered volume. , The dimensions remain consistent with the original input volume V, still being (Y, Z, X), but It has better spatial continuity and can more accurately reflect the three-dimensional anatomical structure of the examined tissue.

[0217] After obtaining the registered image sequence, it can be further reconstructed into three-dimensional volume data. In some embodiments, the method further includes: obtaining a continuous target region along the depth direction from the registered image sequence to obtain a cropped image sequence; calculating the gray-scale mean value of each column of pixels in the cropped image sequence along the depth direction to obtain a projection value matrix; and converting the projection value matrix into a two-dimensional grayscale image to generate a two-dimensional frontal projection image.

[0218] To facilitate clinical observation and quantitative analysis, this application also generates a two-dimensional frontal projection image. The two-dimensional frontal projection image is equivalent to a view obtained by looking vertically downward from above the three-dimensional data, which can intuitively show the frontal morphology of the tissue layer. Generating a two-dimensional frontal projection image includes two steps: depth cropping and mean projection.

[0219] First, a depth clipping operation is performed. Based on the imaging parameters of the imaging device and the location of the target region of interest, a continuous region along the depth direction (Z-axis direction) is selected as the region of interest, i.e., the target region. The target region is defined by the starting depth. and end depth Define and select a region that completely encompasses a specific tissue layer. For example, in a retinal OCT image, the region containing the retinal nerve fiber layer can be selected as the region of interest.

[0220] Through deep cropping, from Extract the location located at [ , The cropped volume is obtained by taking all pixels within the depth range. , The dimension is (Y, ( - (X), which means that the horizontal information of all frames is preserved, but the depth direction is restricted to the selected interval.

[0221] After depth cropping, a mean projection operation is performed. Mean projection compresses the cropped volume along the depth direction to generate a two-dimensional image. Specifically, for each horizontal position (X coordinate) and each frame index (Y coordinate), the arithmetic mean of all pixel values ​​in the corresponding column is calculated along the depth direction (Z direction), as shown in the following formula:

[0222] ;

[0223] in, This represents the pixel value at coordinates (y, x) of the generated two-dimensional frontal projection image, and the summation range covers all pixels within the selected depth interval.

[0224] By calculating point by point, a two-dimensional projection value matrix E is finally obtained, with dimensions (Y, X), which reflects the average reflectance distribution of the tissue layer under a frontal view.

[0225] Finally, the values ​​in the projection matrix E are appropriately mapped to grayscale to convert them into a standard two-dimensional grayscale image, which is a two-dimensional frontal projection image. This image can be saved in a common image format, such as BMP, so that clinicians can observe and perform preliminary analysis in conventional image viewing software.

[0226] Meanwhile, the registered three-dimensional volume data It can be saved in data formats commonly used in scientific computing, such as NumPy format (.npy file), so that it can be used with tools such as Python for more in-depth quantitative analysis, such as thickness measurement, volume calculation, and morphological parameter extraction.

[0227] like Figure 4 , Figure 5 As shown, Figure 4 , Figure 5 All images were displayed using black-and-white microscopic images, showing the vascular network structure, with blood vessels appearing as dark gray reticular patterns. Figure 4 a and Figure 5 Image a is a top-down view before registration. Due to high-frequency jitter in the original OCT image sequence, the blood vessels appear discontinuous and misaligned. Figure 4 b and Figure 5 In the image below (b), the top view after registration shows that the blood vessels are clear, coherent, and continuous. This demonstrates that the proposed method can effectively correct motion artifacts and restore a realistic, high-quality three-dimensional vascular structure.

[0228] Different experiments and data samples may result in varying computational loads for the acquired image sequences. The solution provided in this application can achieve fast processing on ordinary computing hardware, such as a 5070 graphics card. (See also...) Figure 4 It was completed in 2.88 seconds. Figure 5 It can be completed in 2.25 seconds, which is highly efficient.

[0229] Similar parts between the embodiments provided in this application can be referred to mutually. The specific implementation methods provided above are only a few examples under the overall concept of this application and do not constitute a limitation on the scope of protection of this application. For those skilled in the art, any other implementation methods extended from the solution of this application without creative effort shall fall within the scope of protection of this application.

Claims

1. A method for registering an image sequence, characterized in that, include: Obtain the image sequence to be registered, the image sequence comprising multiple frames of two-dimensional images; Calculate the relative displacement between two adjacent two-dimensional images in the image sequence, and accumulate multiple relative displacements to obtain the absolute displacement of each image relative to the reference image. The absolute displacement is decomposed to obtain a trend component and a high-frequency residual. The trend component is used to characterize physiological motion, and the high-frequency residual is used to characterize random perturbation. The absolute displacement is decomposed to obtain trend components and high-frequency residuals, including: The absolute displacement is filled by boundary reflection to obtain the filled displacement sequence; A one-dimensional convolutional smoothing filter is used to process the padded displacement sequence to extract the trend component representing physiological movement. The smoothing window size of the one-dimensional convolutional smoothing filter is an odd number, and a normalized rectangular window function is set in the smoothing window. The rectangular window function is used to perform a weighted average on the padded displacement sequence. Subtracting the trend component from the filled displacement sequence yields a high-frequency residual characterizing the random disturbance. The current frame image and the reference frame image are input into the displacement fine-tuning network to obtain the displacement fine-tuning amount of the current frame image relative to the reference frame image; The compensation displacement of the current frame image is determined based on the high-frequency residual and the displacement fine-tuning amount. Based on the compensated displacement, the current frame image is spatially transformed to obtain a registered image; By traversing multiple frames of two-dimensional images in the image sequence, a registration image sequence is obtained, which includes registration images corresponding to multiple frames of two-dimensional images.

2. The image sequence registration method according to claim 1, characterized in that, Before calculating the relative displacement between two adjacent two-dimensional images in the image sequence, the method further includes: Traverse the two-dimensional images in the image sequence and calculate the global minimum gray level; Subtract the global minimum gray value from the gray value of each pixel in each frame of the image to obtain a non-negative gray image; Divide the gray value of each pixel in the non-negative grayscale image by the preset maximum gray value and map it to the normalization interval to obtain the preprocessed image sequence. The calculation of the relative displacement between two adjacent two-dimensional images in the image sequence includes: Fourier transform is performed on two adjacent frames of the preprocessed image sequence to obtain a first frequency domain image and a second frequency domain image. Calculate the normalized cross power spectrum based on the first frequency domain image and the second frequency domain image; Perform an inverse Fourier transform on the normalized cross-power spectrum to obtain the spatial domain function; Locate the peak position of the spatial domain function; Based on the peak position, the relative displacement between two adjacent two-dimensional images is determined, and the accuracy of the relative displacement is at the sub-pixel level.

3. The image sequence registration method according to claim 1, characterized in that, The reference frame image is the first frame image in the image sequence; The accumulation of multiple relative displacements to obtain the absolute displacement of each frame image relative to the reference frame image includes: The absolute displacement of the reference frame image is set to zero; The relative displacement of the second frame image in the image sequence with respect to the first frame image is determined as the absolute displacement of the second frame image; The relative displacements of the third frame and subsequent frames in the image sequence are sequentially added to the absolute displacement of the previous frame to obtain the absolute displacement of each frame relative to the first frame.

4. The image sequence registration method according to claim 1, characterized in that, The shift fine-tuning network is a convolutional neural network, which includes convolutional layers, batch normalization layers, activation function layers, pooling layers, and fully connected layers. The batch normalization layers and activation function layers are placed after the convolutional layers. The step of inputting the current frame image and the reference frame image into the displacement fine-tuning network to obtain the displacement fine-tuning amount of the current frame image relative to the reference frame image includes: The current frame image and the reference frame image are stitched together along the channel dimension to form a dual-channel input image; The dual-channel input image is input into multiple convolutional layers for feature extraction to obtain an extracted feature map. The extracted feature map is input into the pooling layer for spatial dimension compression to obtain a compressed feature map; After flattening the compressed feature map, it is input into the fully connected layer, and a two-dimensional displacement vector is output, which is a displacement fine-tuning amount.

5. The image sequence registration method according to claim 1, characterized in that, The displacement fine-tuning network is pre-trained using a supervised method, and the pre-training includes: Obtain a pre-registered image sequence, wherein each frame of the pre-registered image sequence has a corresponding real displacement label; The current frame image and the reference frame image in the pre-registered image sequence are input into the displacement fine-tuning network to be trained to obtain the predicted displacement fine-tuning amount; Calculate the mean square error between the predicted displacement fine-tuning amount and the actual displacement label; Based on the mean square error, the parameters of the displacement fine-tuning network to be trained are updated to obtain the displacement fine-tuning network.

6. The image sequence registration method according to claim 1, characterized in that, The displacement fine-tuning network is trained online in an unsupervised manner, and the online training includes: Calculate the loss function value, which is the sum of squares of the displacement fine-tuning amounts; Based on the loss function value, the parameters of the translation fine-tuning network to be trained are updated by backpropagation to learn the implicit transformation that aligns the current frame image with the reference frame image, thus obtaining the translation fine-tuning network.

7. The image sequence registration method according to claim 1, characterized in that, Determining the compensation displacement of the current frame image based on the high-frequency residual and the displacement fine-tuning amount includes: The negative value of the high-frequency residual is used to obtain the reverse compensation value; The reverse compensation value and the displacement fine-tuning amount are weighted and summed to obtain the compensation displacement, which is used to compensate for random disturbances and local deformations.

8. The image sequence registration method according to claim 1, characterized in that, The step of performing a spatial transformation on the current frame image based on the compensated displacement to obtain a registered image includes: Based on the components of the compensated displacement in the depth and lateral directions, a two-dimensional translation transformation matrix is ​​constructed; The pixel coordinates of the current frame image are mapped using the two-dimensional translation transformation matrix to generate a sampling grid; For the current frame image, bilinear interpolation resampling is performed according to the sampling grid to obtain a registered image. The resampling is used to process boundary pixels using a reflection filling method.

9. The image sequence registration method according to claim 1, characterized in that, After traversing multiple frames of two-dimensional images in the image sequence to obtain the registered image sequence, the method further includes: From the registered image sequence, continuous target regions are obtained along the depth direction to obtain a cropped image sequence; For each column of pixels in the cropped image sequence, calculate the grayscale mean along the depth direction to obtain the projection value matrix; The projection value matrix is ​​converted into a two-dimensional grayscale image to generate a two-dimensional frontal projection image.