An endoscope virtual staining method and device based on a frequency domain guided diffusion model, equipment and medium
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- THE FIRST AFFILIATED HOSPITAL OF SOOCHOW UNIV
- Filing Date
- 2026-03-25
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to effectively decouple structural and stylistic information from images during diffusion model generation, leading to blurred vessel edges, structural illusions, or texture artifacts. Furthermore, the lack of multi-scale dynamic control over frequency information makes it difficult to meet clinical requirements for pixel-level image fidelity.
A frequency-domain guided diffusion model is adopted. The white light endoscope image is decoupled into multi-scale frequency components through frequency domain multi-scale decomposition. The frequency-domain guided loss is constructed by using the frequency weights that change dynamically with the diffusion time step to realize the gradient correction of noise prediction and generate a virtual staining image that conforms to the distribution of the target staining domain.
This ensures that the generated image is strictly aligned with the fine structure of the original image at the pixel level, effectively suppressing blood vessel rupture, edge blurring, and texture artifacts, and significantly improving the fidelity, robustness, and clinical reliability of virtual staining results.
Smart Images

Figure CN122243722A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision technology, and in particular to an endoscopic virtual staining method, apparatus, device, and medium based on a frequency domain guided diffusion model. Background Technology
[0002] In gastrointestinal endoscopy, narrow-band imaging (NBI) significantly enhances the visibility of microvessels and glandular structures on the mucosal surface through specific wavelength spectral illumination, which is of great value for the diagnosis of early tumors. However, due to limitations in hardware cost and equipment compatibility, NBI has a low adoption rate in primary healthcare institutions, and white light endoscopy (WLE) remains the primary imaging method in clinical practice. Therefore, researchers are attempting to convert white light images into virtual stained images using image processing algorithms to improve the accessibility of early cancer screening. Current mainstream methods include unsupervised mapping methods based on Cycle-Consistent Generative Adversarial Networks (CycleGAN) and image translation methods based on diffusion models. The former achieves inter-domain transformation of unpaired data through adversarial loss and cycle consistency loss, while the latter reconstructs NBI-style images in a noisy space using a pre-trained stable diffusion model combined with a self-attention injection strategy. However, existing methods have significant limitations in medical image processing: neither global mapping based on generative adversarial networks nor self-attention guidance based on diffusion models can effectively decouple the structural and stylistic information of images in the spatial domain, leading to problems such as blurred blood vessel edges, structural illusions, or texture artifacts. At the same time, diffusion models lack multi-scale dynamic control of frequency information during the generation process, and fixed-weight guidance strategies cannot balance the preservation of macroscopic structure with the restoration of microscopic details. In addition, the reliance of existing technologies on text prompts makes the generated results random when dealing with atypical lesions or complex scenes, and lacks a posterior verification mechanism for the consistency of anatomical structures, making it difficult to meet the clinical requirements for pixel-level image fidelity.
[0003] As can be seen from the above, how to achieve multi-scale decoupling of frequency features and dynamic collaborative guidance of time steps in the process of diffusion model generation, so as to complete accurate virtual staining conversion while ensuring the authenticity of anatomical structures, is an urgent problem to be solved. Summary of the Invention
[0004] In view of this, the purpose of this invention is to provide an endoscopic virtual staining method, apparatus, device, and medium based on a frequency domain-guided diffusion model. This method enables multi-scale decoupling of frequency features and dynamic time-step collaborative guidance during the diffusion model generation process, thereby achieving precise virtual staining conversion while ensuring the realism of anatomical structures. The specific solution is as follows: In a first aspect, this application provides an endoscopic virtual staining method based on a frequency domain guided diffusion model, comprising: Acquire a white light endoscope image to be processed, and preprocess the white light endoscope image to obtain a target white light endoscope image; The target white light endoscope image is mapped to the latent space using a preset variational autoencoder to obtain the corresponding initial latent variables, and an initial noise state with semantic consistency with the white light endoscope image to be processed is obtained based on the initial latent variables using a preset deterministic inversion algorithm. The target white light endoscope image is decomposed in the frequency domain at multiple scales to extract the corresponding multi-scale frequency components. In the process of reverse denoising based on the initial noise state using the pre-trained latent space diffusion model, the frequency conditions based on the multi-scale frequency components are injected into the noise prediction network, and a frequency domain guided loss is constructed based on the frequency weights that change dynamically with the diffusion time step, so as to use the gradient of the frequency domain guided loss to correct the noise prediction and generate target latent variables that conform to the target coloring domain distribution. The target latent variables are mapped to pixel space to obtain the corresponding virtual staining image.
[0005] Optionally, the step of performing frequency domain multi-scale decomposition on the target white light endoscope image to extract the corresponding multi-scale frequency components includes: The target white light endoscope image is decomposed into a first component, a second component, and a third component using the Laplacian pyramid. The first component represents the fine structural information of the image, the second component represents the mesoscale texture information of the image, and the third component represents the global tone and illumination distribution information of the image. Each component is encoded separately using an independent convolutional coding branch to obtain a frequency conditional tensor aligned with the dimensions of the intermediate layer of the noise prediction network; the frequency conditional tensor is used to modulate the feature responses of the intermediate layer of the noise prediction network during the inverse denoising process.
[0006] Optionally, the frequency weights based on dynamic changes with diffusion time steps include a first weight corresponding to the first component, which has the largest weight value at the beginning of the inverse denoising process and decreases with the advancement of diffusion time steps; a second weight corresponding to the second component, which is enhanced by a preset function in the middle stage of the inverse denoising process to guide the generation of mesoscale textures; and a third weight corresponding to the third component, which monotonically increases with the increase of diffusion time steps in the inverse denoising process to constrain color distribution migration.
[0007] Optionally, the construction of the frequency domain guiding loss based on the frequency weights that dynamically change with the diffusion time step includes: Using a preset latent space frequency extraction operator, frequency band feature representations corresponding to the first component, the second component, and the third component are separated from the current latent variable; The difference between the feature representation of each frequency band and the corresponding frequency condition tensor is calculated using a preset frequency domain guided loss function, and the difference is weighted and summed using the first weight, the second weight and the third weight to construct the frequency domain guided loss.
[0008] Optionally, the gradient correction noise prediction using the frequency domain guided loss includes: The gradient of the frequency domain guided loss with respect to the current latent variable is multiplied by a preset guided step size parameter and then fused with the original noise prediction value output by the noise prediction network to obtain a corrected noise prediction value; the corrected noise prediction value is used to guide the latent variable update in the current iteration step.
[0009] Optionally, the endoscopic virtual staining method based on the frequency domain guided diffusion model further includes: A target stained domain image dataset is collected, and the dataset is used to fine-tune the distribution transfer of the noise prediction network of the initial latent space diffusion model to obtain a pre-trained latent space diffusion model.
[0010] Optionally, after mapping the target latent variable to pixel space to obtain the corresponding virtual staining image, the method further includes: Calculate the spectral energy difference between the virtual stained image and the target white light endoscope image in the frequency band corresponding to the first component; If the difference in spectral energy exceeds a preset threshold, the frequency band corresponding to the first component in the virtual coloring image is corrected by using a frequency domain mask to obtain a corrected virtual coloring image.
[0011] Secondly, this application provides an endoscopic virtual staining device based on a frequency domain guided diffusion model, comprising: The image acquisition module is used to acquire a white light endoscope image to be processed and to preprocess the white light endoscope image to obtain a target white light endoscope image. The initial noise state acquisition module is used to map the target white light endoscope image to the latent space using a preset variational autoencoder to obtain the corresponding initial latent variables, and to obtain the initial noise state that is semantically consistent with the white light endoscope image to be processed based on the initial latent variables using a preset deterministic inversion algorithm. The frequency domain analysis module is used to perform frequency domain multi-scale decomposition on the target white light endoscope image to extract the corresponding multi-scale frequency components. The target latent variable generation module is used to inject frequency conditions based on the multi-scale frequency components into the noise prediction network during the reverse denoising process based on the initial noise state using the pre-trained latent space diffusion model, and to construct a frequency domain guided loss based on the frequency weights that dynamically change with the diffusion time step, so as to use the gradient of the frequency domain guided loss to correct the noise prediction and generate target latent variables that conform to the target coloring domain distribution. The image reconstruction module is used to map the target latent variables to pixel space to obtain the corresponding virtual stained image.
[0012] Thirdly, this application provides an electronic device, comprising: Memory, used to store computer programs; A processor is used to execute the computer program to implement the aforementioned endoscopic virtual staining method based on a frequency domain guided diffusion model.
[0013] Fourthly, this application provides a computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the aforementioned endoscopic virtual staining method based on a frequency domain guided diffusion model.
[0014] This application provides a virtual endoscopic staining method based on a frequency-domain guided diffusion model. The method involves acquiring a white light endoscope image to be processed and preprocessing it to obtain a target white light endoscope image. A preset variational autoencoder is used to map the target white light endoscope image to a latent space to obtain corresponding initial latent variables. A preset deterministic inversion algorithm is then used to obtain an initial noise state semantically consistent with the white light endoscope image to be processed based on the initial latent variables. The target white light endoscope image is then subjected to frequency-domain multi-scale decomposition to extract corresponding multi-scale frequency components. During the reverse denoising process using a pre-trained latent space diffusion model based on the initial noise state, frequency conditions based on the multi-scale frequency components are injected into a noise prediction network. A frequency-domain guided loss is constructed based on frequency weights that dynamically change with the diffusion time step. The gradient of the frequency-domain guided loss is used to correct the noise prediction, thereby generating target latent variables that conform to the target staining domain distribution. Finally, the target latent variables are mapped to a pixel space to obtain a corresponding virtual staining image.
[0015] As shown above, this application decouples white light endoscopic images into frequency components carrying different anatomical meanings through multi-scale frequency domain decomposition, and injects these components as frequency conditions into the noise prediction network of the diffusion model. Simultaneously, a frequency-domain guided loss is constructed by combining frequency weights that dynamically change with the diffusion time step, and its gradient is used to correct noise predictions. This achieves differentiated guidance for anatomical structure preservation and staining style transfer during the reverse denoising process. This mechanism ensures that the generated image is strictly aligned with the fine structure of the original image at the pixel level, effectively suppressing problems such as vascular rupture, edge blurring, and texture artifacts, and significantly improving the fidelity, robustness, and clinical credibility of the virtual staining results. Thus, multi-scale decoupling of frequency features and dynamic time-step collaborative guidance can be achieved during the diffusion model generation process to complete accurate virtual staining conversion while ensuring the authenticity of the anatomical structure. Attached Figure Description
[0016] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.
[0017] Figure 1 This is a flowchart of an endoscopic virtual staining method based on a frequency domain guided diffusion model disclosed in this invention; Figure 2 This is a schematic diagram of a multi-scale image decomposition based on frequency domain operators disclosed in this invention; Figure 2 a is a schematic diagram of the original white light image; Figure 2 b is a schematic diagram of the low-frequency components; Figure 2 c is a schematic diagram of the intermediate frequency components; Figure 2 d is a schematic diagram of high-frequency components; Figure 3 This is a schematic diagram of a frequency-domain guided diffusion generation structure disclosed in this invention; Figure 4 This is a schematic diagram of an endoscopic virtual staining device based on a frequency domain guided diffusion model disclosed in this invention. Figure 5 This is a structural diagram of an electronic device disclosed in this invention. Detailed Implementation
[0018] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0019] In gastrointestinal endoscopy, narrow-band imaging (NBI) significantly enhances the visibility of microvessels and glandular structures on the mucosal surface through specific wavelength spectral illumination, which is of great value for the diagnosis of early tumors. However, due to limitations in hardware cost and equipment compatibility, NBI has a low adoption rate in primary healthcare institutions, and white light endoscopy remains the primary imaging method in clinical practice. Therefore, researchers are attempting to convert white light images into virtual stained images using image processing algorithms to improve the accessibility of early cancer screening. Current mainstream methods include unsupervised mapping methods based on cycle-consistency generative adversarial networks (GANs) and image translation methods based on diffusion models. The former achieves inter-domain transformation of unpaired data through adversarial loss and cycle-consistency loss, while the latter reconstructs NBI-style images in a noisy space using a pre-trained stable diffusion model combined with a self-attention injection strategy. However, existing methods have significant limitations in medical image processing: neither global mapping based on generative adversarial networks nor self-attention guidance based on diffusion models can effectively decouple the structural and stylistic information of images in the spatial domain, leading to problems such as blurred blood vessel edges, structural illusions, or texture artifacts. Furthermore, diffusion models lack multi-scale dynamic control of frequency information during generation, and fixed-weight guidance strategies struggle to balance macroscopic structural preservation with microscopic detail restoration. In addition, the reliance on text prompts in existing technologies results in randomness in the generated results when processing atypical lesions or complex scenes, lacking a posterior verification mechanism for anatomical structural consistency and failing to meet clinical requirements for pixel-level image fidelity. Therefore, this application provides an endoscopic virtual staining method, apparatus, device, and medium based on a frequency-domain guided diffusion model, which can achieve multi-scale decoupling of frequency features and dynamic time-step collaborative guidance during diffusion model generation, enabling accurate virtual staining conversion while ensuring the realism of anatomical structures.
[0020] See Figure 1 As shown in the figure, this application discloses an endoscopic virtual staining method based on a frequency domain guided diffusion model, including: Step S11: Obtain the white light endoscope image to be processed, and preprocess the white light endoscope image to obtain the target white light endoscope image.
[0021] In this embodiment, the white light endoscope image to be processed is acquired. The image is processed using the Limited Contrast Adaptive Histogram Equalization (CLAHE) algorithm to eliminate uneven illumination and enhance the edge sharpness of blood vessels and mucosa, thereby obtaining the target white light endoscopic image. The calculation formula is shown below: , in, The target is a white light endoscopic image.
[0022] In this embodiment, a pre-trained Latent Diffusion Model (LDM) is employed, leveraging its advantage in probabilistic modeling within the latent space Z. A clinical NBI image dataset is collected. Extracting frequency band components from the current NBI sample image as a reference for the physical structure of self-supervised training, and then using convolutional coding branches... Transformed into structural condition features Using low-rank adaptive techniques for noise predictors Fine-tuning is then performed. The training objective function is shown below: ; in, This is the training loss function for the latent space diffusion model, used to minimize the error between the model's predicted noise and the actual added noise; is the expectation operator, indicating that the expected value of all random variables within the parentheses is taken; z is the latent space variable, obtained by encoding the NBI image using a variational autoencoder; It is a variational autoencoder; The latent variables, after t steps of noise addition, are obtained by z through a forward diffusion process. During this process, the output layer of the convolutional coding branch employs a zero-convolution strategy, allowing the model to gradually learn how to subject the generated latent variable trajectories to structural conditional features while maintaining the original LDM generation quality. Through this process, the model masters the unique spectral absorption characteristics of NBI, the high-contrast distribution of microvessels, and the statistical regularities of mucosal gland openings. This enables it to form a conditional score approximation of the statistical distribution of the target chromatographic domain during the backdiffusion process, thus tending to generate latent variable trajectories that conform to the NBI statistical distribution during sampling.
[0023] Step S12: Map the target white light endoscope image to the latent space using a preset variational autoencoder to obtain the corresponding initial latent variables, and use a preset deterministic inversion algorithm to obtain an initial noise state that is semantically consistent with the white light endoscope image to be processed based on the initial latent variables.
[0024] In this embodiment, a preset variational autoencoder (VAE) is used to compress the image into the latent space to obtain the initial latent variables. The calculation formula is as follows: , in, As initial latent variables, It is a variational autoencoder.
[0025] Furthermore, the DDIM Inversion algorithm is used to deterministically back-engineer the initial latent variables to the noisy state in t steps, thereby obtaining the latent variables of the noisy state. This allows for the acquisition of an initial noise state consistent with the semantics of the original image through deterministic backdiffusion.
[0026] Step S13: Perform frequency domain multi-scale decomposition on the target white light endoscope image to extract the corresponding multi-scale frequency components.
[0027] In this embodiment, image information is decoupled according to frequency distribution to achieve differentiated control of different clinical features and provide a physical reference for the guidance mechanism. Specifically, the frequency domain multi-scale decomposition of the target white light endoscope image to extract the corresponding multi-scale frequency components may include: decomposing the target white light endoscope image into a first component, a second component, and a third component using a Laplacian pyramid, wherein the first component represents the fine structural information of the image, the second component represents the mesoscale texture information of the image, and the third component represents the global tone and illumination distribution information of the image. That is, a third-order Laplacian pyramid is constructed. The image is smoothed and downsampled to obtain the low-frequency component, and the high-frequency residual is obtained by subtracting the upsampled low-frequency image from the original image. See also Figure 2 As shown, the image Decomposed into Among them, low-frequency components Includes global style information such as background base color and lighting distribution; mid-frequency components Includes information on the outline of large blood vessels and tissue folds; high-frequency components Includes microvascular details, capillary networks, and mucosal pits.
[0028] Furthermore, utilizing the independently trained convolutional coding branches already completed in the aforementioned steps, the physical frequency features of the current white light image are injected into the generation process to achieve a hard constraint on the generated trajectory. Specifically, each component is encoded separately using independent convolutional coding branches to obtain a frequency conditional tensor aligned with the dimensions of the intermediate layers of the noise prediction network; the frequency conditional tensor is used to modulate the feature responses of the intermediate layers of the noise prediction network during the reverse denoising process.
[0029] Step S14: In the process of reverse denoising based on the initial noise state using the pre-trained latent space diffusion model, the frequency conditions based on the multi-scale frequency components are injected into the noise prediction network, and a frequency domain guided loss is constructed based on the frequency weights that change dynamically with the diffusion time step, so as to use the gradient of the frequency domain guided loss to correct the noise prediction and generate target latent variables that conform to the target coloring domain distribution.
[0030] See Figure 3As shown, in this embodiment, a dynamic frequency-aware mechanism is introduced into the diffusion model denoising loop to avoid anatomical structure drift and staining texture artifacts during the diffusion generation process. Specifically, the diffusion model focuses on different aspects of image information reconstruction at different stages of denoising. To accommodate this characteristic, this invention designs a nonlinear frequency weighting function that dynamically changes with the diffusion time step t (t∈{T,T-1,...,1}). This is used to adjust the guiding intensity of different frequency features in stages: The frequency weights, which dynamically change with the diffusion time step, include a first weight corresponding to the first component, which has the largest weight value at the beginning of the inverse denoising process and decreases with the advancement of the diffusion time step; a second weight corresponding to the second component, which is enhanced by a preset function in the middle stage of the inverse denoising process to guide the generation of mesoscale textures; and a third weight corresponding to the third component, which monotonically increases with the increase of the diffusion time step during the inverse denoising process to constrain color distribution migration. Specifically, the high-frequency structure weight (first weight) In the early stages of denoising, an exponential decay function is used, where high-frequency structure weights are at high values. At this point, the model primarily constructs the topological skeleton of the image. By maintaining strong high-frequency constraints, the model is forced to anchor the blood vessel orientation and edge contours in the original white light image, thus determining the anatomical structure. The expression for the exponential decay function is as follows: ; in, These are preset control parameters. Mid-frequency texture weight (second weight). In the intermediate stage, Gaussian-like functions are used for enhancement to finely guide the texture generation of mucosal folds and glandular pits. Low-frequency color weight (third weight). As the denoising process progresses, the intensity gradually increases. At this point, the image structure is largely finalized, guiding the focus to low-frequency components. By strengthening color distribution constraints, the spectral transfer to the target NBI staining features is ensured to be more accurate and smooth. The specific expression is shown below: .
[0031] Furthermore, a latent space frequency extraction operator is defined. Real-time prediction of latent variables The corresponding high, medium, and low frequency feature expressions are separated from the data, and the frequency domain guiding loss is further calculated. Specifically, the construction of the frequency domain guiding loss based on the frequency weights that dynamically change with the diffusion time step may include: using a preset latent space frequency extraction operator to separate the frequency band feature expressions corresponding to the first component, the second component, and the third component from the current latent variable; using a preset frequency domain guiding loss function to calculate the difference between each frequency band feature expression and the corresponding frequency conditional tensor; and using the first weight, the second weight, and the third weight to perform a weighted summation of the difference to construct the frequency domain guiding loss. That is, constructing the frequency domain guiding loss function. Calculate the frequency conditions of the predicted features and the encoded original white light image at the current iteration step. The loss function measures the degree to which the generated image deviates physically from the original image, using the Euclidean distance between them. The frequency-domain guided loss function is expressed as follows: .
[0032] Furthermore, the negative gradient of the frequency-domain guided loss is injected into the noise predictor to achieve deep coupling between prior generation and physical constraints. Specifically, the step of correcting the noise prediction using the gradient of the frequency-domain guided loss can include: multiplying the gradient of the frequency-domain guided loss with respect to the current latent variable by a preset guided step size parameter, and then fusing it with the original noise prediction value output by the noise prediction network to obtain a corrected noise prediction value; the corrected noise prediction value is used to guide the latent variable update in the current iteration step. That is, this mechanism forces the diffusion model to satisfy the frequency energy constraint of the original white light image at the corresponding coordinates when generating pixel colors that conform to the NBI probability distribution, thereby mathematically ensuring that the transformed key diagnostic features such as blood vessels are statistically aligned with the high-frequency structure of the original image. Through noise gradient fusion, the corrected final noise prediction value... The calculation formula is as follows: ; in, This represents the original score obtained through pre-training based on the aforementioned steps. The structural correction force from the frequency domain components of the original white light image is indicated by calculating the gradient of the loss function with respect to the current latent variable, which points out the direction of correcting structural bias. The step size parameter is used to adjust the balance between structural fidelity and generation flexibility.
[0033] Step S15: Map the target latent variable to the pixel space to obtain the corresponding virtual staining image.
[0034] In this embodiment, the final latent variables after denoising are mapped back to the pixel space through a decoder to obtain a preliminary virtual coloring image. The image statistically conforms to the distribution characteristics of the target narrow-band stained region, and the structural features of the white light image have been constrained through the aforementioned frequency-guided diffusion process.
[0035] Furthermore, to ensure that the generated virtual staining image does not experience unwanted spectral drift in the key structural frequency bands, a frequency domain consistency analysis is performed between the virtual staining image and the original white light image. Specifically, after mapping the target latent variable to the pixel space to obtain the corresponding virtual staining image, the process may further include: calculating the spectral energy difference between the virtual staining image and the target white light endoscope image in the frequency band corresponding to the first component; if the spectral energy difference exceeds a preset threshold, the frequency band corresponding to the first component in the virtual staining image is corrected using a frequency domain mask to obtain a corrected virtual staining image. That is, when the energy distribution of the virtual staining image in the high-frequency band deviates from the preset threshold range, a restricted frequency domain correction operation is performed to further enhance structural fidelity. The frequency domain correction calculation formula is as follows: ; in, and This refers to the spectral representations obtained by applying frequency domain transform operators to the two images respectively; This is a mask matrix limited to the high-frequency band; This represents element-wise multiplication; These are the frequency domain correction coefficients, used to control the correction intensity. Subsequently, the corrected virtual staining image is obtained through inverse frequency domain transformation: Through the above frequency domain consistency verification and restricted correction mechanism, the virtual staining results can be lightly adjusted within the structure-related frequency range without destroying the overall generation distribution of the diffusion model, thereby improving the image performance in terms of blood vessel clarity, edge sharpness and visual stability.
[0036] As can be seen from the above, this invention, through the introduction of a dynamic frequency-aware guidance mechanism involving multi-scale decomposition and diffusion processes in the frequency domain, achieves pixel-level preservation of anatomical structures and eliminates medical image artifacts. By performing explicit frequency domain decomposition on the original white light image, the high-frequency components carrying core anatomical information are used as rigid constraints. This effectively solves problems such as vascular rupture, edge blurring, and structural illusion, ensuring absolute accuracy in determining the direction of microvessels in clinical diagnosis. A frequency weighting function that dynamically changes with the diffusion time step t is adopted, enabling precise reconstruction of low-frequency color distribution in the later stages of sampling. Compared to the color drift and background noise that easily occur in existing technologies, the NBI image generated by this scheme not only visually approximates real NBI imaging, but its feature distribution also better conforms to the physical laws of hemoglobin absorption characteristics under narrow-band spectra. By eliminating the reliance on vague text prompts and instead using deterministic frequency domain components as conditional guidance, the randomness of the generation model is greatly suppressed. It exhibits high robustness when facing complex endoscopic scenes containing reflective points, mucus occlusion, or motion blur. Because frequency domain decomposition utilizes the physical properties of images as priors, this embodiment exhibits stronger generalization ability in small-sample scenarios. Compared to complex models requiring tens of thousands of paired images for training, this embodiment only requires a small number of target domain samples for fine-tuning, significantly reducing training computational costs and data acquisition costs while maintaining high performance, which is beneficial for rapid deployment in primary healthcare equipment. By comparing the frequency domain energy distribution of the generated image with that of the input image, this embodiment can automatically identify and correct potential abnormal generation regions, ensuring that every virtual staining image output to doctors has rigorous pathological logic support, greatly reducing the risk of misdiagnosis.
[0037] See Figure 4 As shown in the figure, this application discloses an endoscopic virtual staining device based on a frequency domain guided diffusion model, comprising: Image acquisition module 11 is used to acquire a white light endoscope image to be processed and to preprocess the white light endoscope image to obtain a target white light endoscope image. The initial noise state acquisition module 12 is used to map the target white light endoscope image to the latent space using a preset variational autoencoder to obtain the corresponding initial latent variables, and to obtain the initial noise state that is semantically consistent with the white light endoscope image to be processed based on the initial latent variables using a preset deterministic inversion algorithm. The frequency domain analysis module 13 is used to perform frequency domain multi-scale decomposition on the target white light endoscope image to extract the corresponding multi-scale frequency components. The target latent variable generation module 14 is used to inject frequency conditions based on the multi-scale frequency components into the noise prediction network during the reverse denoising process based on the initial noise state using the pre-trained latent space diffusion model, and to construct a frequency domain guided loss based on the frequency weights that dynamically change with the diffusion time step, so as to use the gradient of the frequency domain guided loss to correct the noise prediction and generate target latent variables that conform to the target coloring domain distribution. Image reconstruction module 15 is used to map the target latent variable to pixel space to obtain the corresponding virtual staining image.
[0038] In some specific embodiments, the frequency domain analysis module 13 may specifically include: The frequency domain analysis unit is used to decompose the target white light endoscope image into a first component, a second component, and a third component using the Laplacian pyramid, wherein the first component represents the fine structure information of the image, the second component represents the mesoscale texture information of the image, and the third component represents the global tone and illumination distribution information of the image. A frequency domain coding unit is used to encode each component separately using independent convolutional coding branches to obtain a frequency conditional tensor aligned with the dimensions of the intermediate layer of the noise prediction network; the frequency conditional tensor is used to modulate the feature response of the intermediate layer of the noise prediction network during the reverse denoising process. Accordingly, the frequency weights based on the dynamic change with the diffusion time step include a first weight corresponding to the first component, which has the largest weight value at the beginning of the inverse denoising process and decreases as the diffusion time step progresses; a second weight corresponding to the second component, which is enhanced by a preset function in the middle stage of the inverse denoising process to guide the generation of mesoscale textures; and a third weight corresponding to the third component, which monotonically increases with the increase of the diffusion time step in the inverse denoising process to constrain the migration of color distribution. Furthermore, the target latent variable generation module 14 may specifically include: The feature extraction unit is used to extract frequency band feature representations corresponding to the first component, the second component, and the third component from the current latent variable using a preset latent space frequency extraction operator; The frequency domain guided loss construction unit is used to calculate the difference between the feature representation of each frequency band and the corresponding frequency condition tensor using a preset frequency domain guided loss function, and to perform a weighted summation of the difference using the first weight, the second weight and the third weight to construct the frequency domain guided loss. The noise prediction generation unit is used to multiply the gradient of the frequency domain guided loss with respect to the current latent variable by a preset guided step size parameter, and then fuse it with the original noise prediction value output by the noise prediction network to obtain a corrected noise prediction value; the corrected noise prediction value is used to guide the latent variable update of the current iteration step.
[0039] In some specific embodiments, the endoscopic virtual staining device based on the frequency domain guided diffusion model may further include: The model training unit is used to collect the target stained domain image dataset and use the dataset to perform distribution transfer fine-tuning on the noise prediction network of the initial latent space diffusion model to obtain the pre-trained latent space diffusion model. The difference calculation unit is used to calculate the spectral energy difference between the virtual stained image and the target white light endoscope image in the frequency band corresponding to the first component; An image correction unit is used to correct the frequency band corresponding to the first component in the virtual coloring image by means of a frequency domain mask if the spectral energy difference exceeds a preset threshold, so as to obtain a corrected virtual coloring image.
[0040] Furthermore, embodiments of this application also disclose an electronic device, Figure 5 This is a structural diagram of an electronic device 20 according to an exemplary embodiment. The content of the diagram should not be construed as limiting the scope of this application. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input / output interface 25, and a communication bus 26. The memory 22 stores a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the endoscopic virtual staining method based on a frequency-domain guided diffusion model disclosed in any of the foregoing embodiments. Furthermore, the electronic device 20 in this embodiment may specifically be a computer.
[0041] In this embodiment, the power supply 23 is used to provide operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and external devices, and the communication protocol it follows can be any communication protocol applicable to the technical solution of this application, and is not specifically limited here; the input / output interface 25 is used to acquire external input data or output data to the outside world, and its specific interface type can be selected according to specific application needs, and is not specifically limited here.
[0042] In addition, the memory 22, as a carrier for resource storage, can be a read-only memory, random access memory, disk or optical disk, etc. The resources stored thereon can include operating system 221, computer program 222, etc., and the storage method can be temporary storage or permanent storage.
[0043] The operating system 221 is used to manage and control the various hardware devices on the electronic device 20 and the computer program 222, which may be Windows Server, Netware, Unix, Linux, etc. In addition to including a computer program capable of performing the endoscopic virtual staining method based on a frequency-domain guided diffusion model executed by the electronic device 20 as disclosed in any of the foregoing embodiments, the computer program 222 may further include computer programs capable of performing other specific tasks.
[0044] Furthermore, this application also discloses a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, it implements the aforementioned endoscopic virtual staining method based on a frequency-domain guided diffusion model. Specific steps of this method can be found in the corresponding content disclosed in the foregoing embodiments, and will not be repeated here.
[0045] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to in the method section.
[0046] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0047] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software module executed by a processor, or a combination of both. The software module can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.
[0048] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0049] The technical solutions provided in this application have been described in detail above. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the methods and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.
Claims
1. An endoscopic virtual staining method based on a frequency-domain guided diffusion model, characterized in that, include: Acquire a white light endoscope image to be processed, and preprocess the white light endoscope image to obtain a target white light endoscope image; The target white light endoscope image is mapped to the latent space using a preset variational autoencoder to obtain the corresponding initial latent variables, and an initial noise state with semantic consistency with the white light endoscope image to be processed is obtained based on the initial latent variables using a preset deterministic inversion algorithm. The target white light endoscope image is decomposed in the frequency domain at multiple scales to extract the corresponding multi-scale frequency components. In the process of reverse denoising based on the initial noise state using the pre-trained latent space diffusion model, the frequency conditions based on the multi-scale frequency components are injected into the noise prediction network, and a frequency domain guided loss is constructed based on the frequency weights that change dynamically with the diffusion time step, so as to use the gradient of the frequency domain guided loss to correct the noise prediction and generate target latent variables that conform to the target coloring domain distribution. The target latent variables are mapped to pixel space to obtain the corresponding virtual staining image.
2. The endoscopic virtual staining method based on a frequency domain guided diffusion model according to claim 1, characterized in that, The step of performing frequency domain multi-scale decomposition on the target white light endoscope image to extract the corresponding multi-scale frequency components includes: The target white light endoscope image is decomposed into a first component, a second component, and a third component using the Laplacian pyramid. The first component represents the fine structural information of the image, the second component represents the mesoscale texture information of the image, and the third component represents the global tone and illumination distribution information of the image. Each component is encoded separately using an independent convolutional coding branch to obtain a frequency conditional tensor aligned with the dimensions of the intermediate layer of the noise prediction network; the frequency conditional tensor is used to modulate the feature responses of the intermediate layer of the noise prediction network during the inverse denoising process.
3. The endoscopic virtual staining method based on a frequency domain guided diffusion model according to claim 2, characterized in that, The frequency weights based on dynamic changes with diffusion time steps include a first weight corresponding to the first component, which has the largest weight value at the beginning of the inverse denoising process and decreases with the advancement of diffusion time steps; a second weight corresponding to the second component, which is enhanced by a preset function in the middle stage of the inverse denoising process to guide the generation of mesoscale textures; and a third weight corresponding to the third component, which monotonically increases with the increase of diffusion time steps in the inverse denoising process to constrain color distribution migration.
4. The endoscopic virtual staining method based on a frequency domain guided diffusion model according to claim 3, characterized in that, The frequency domain guiding loss is constructed based on frequency weights that dynamically change with the diffusion time step, including: Using a preset latent space frequency extraction operator, frequency band feature representations corresponding to the first component, the second component, and the third component are separated from the current latent variable; The difference between the feature representation of each frequency band and the corresponding frequency condition tensor is calculated using a preset frequency domain guided loss function, and the difference is weighted and summed using the first weight, the second weight and the third weight to construct the frequency domain guided loss.
5. The endoscopic virtual staining method based on a frequency domain guided diffusion model according to claim 4, characterized in that, The gradient-corrected noise prediction using the frequency-domain guided loss includes: The gradient of the frequency domain guided loss with respect to the current latent variable is multiplied by a preset guided step size parameter and then fused with the original noise prediction value output by the noise prediction network to obtain a corrected noise prediction value; the corrected noise prediction value is used to guide the latent variable update in the current iteration step.
6. The endoscopic virtual staining method based on a frequency domain guided diffusion model according to claim 1, characterized in that, Also includes: A target stained domain image dataset is collected, and the dataset is used to fine-tune the distribution transfer of the noise prediction network of the initial latent space diffusion model to obtain a pre-trained latent space diffusion model.
7. The endoscopic virtual staining method based on a frequency domain guided diffusion model according to claim 2, characterized in that, After mapping the target latent variable to pixel space to obtain the corresponding virtual staining image, the method further includes: Calculate the spectral energy difference between the virtual stained image and the target white light endoscope image in the frequency band corresponding to the first component; If the difference in spectral energy exceeds a preset threshold, the frequency band corresponding to the first component in the virtual coloring image is corrected by using a frequency domain mask to obtain a corrected virtual coloring image.
8. An endoscopic virtual staining device based on a frequency domain guided diffusion model, characterized in that, include: The image acquisition module is used to acquire a white light endoscope image to be processed and to preprocess the white light endoscope image to obtain a target white light endoscope image. The initial noise state acquisition module is used to map the target white light endoscope image to the latent space using a preset variational autoencoder to obtain the corresponding initial latent variables, and to obtain the initial noise state that is semantically consistent with the white light endoscope image to be processed based on the initial latent variables using a preset deterministic inversion algorithm. The frequency domain analysis module is used to perform frequency domain multi-scale decomposition on the target white light endoscope image to extract the corresponding multi-scale frequency components. The target latent variable generation module is used to inject frequency conditions based on the multi-scale frequency components into the noise prediction network during the reverse denoising process based on the initial noise state using the pre-trained latent space diffusion model, and to construct a frequency domain guided loss based on the frequency weights that dynamically change with the diffusion time step, so as to use the gradient of the frequency domain guided loss to correct the noise prediction and generate target latent variables that conform to the target coloring domain distribution. The image reconstruction module is used to map the target latent variables to pixel space to obtain the corresponding virtual stained image.
9. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor for executing the computer program to implement the endoscopic virtual staining method based on a frequency domain guided diffusion model as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, Used to store a computer program, wherein the computer program, when executed by a processor, implements the endoscopic virtual staining method based on a frequency domain guided diffusion model as described in any one of claims 1 to 7.