3D medical image one-step generative segmentation method and system based on average flow model and medium

CN122244443APending Publication Date: 2026-06-19LANZHOU UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
LANZHOU UNIV
Filing Date
2026-03-24
Publication Date
2026-06-19

Smart Images

  • Figure CN122244443A_ABST
    Figure CN122244443A_ABST
Patent Text Reader

Abstract

This invention discloses a one-step generative segmentation method, system, and medium for 3D medical images based on the average flow model, belonging to the field of medical image processing technology. The invention acquires the 3D medical image to be segmented and anatomical condition information, samples Gaussian noise as an initial latent variable, and inputs it into a MeanFlow network after temporal embedding. Anatomical conditions are input into a VeloMod module to generate scale and offset tensors, and the MeanFlow network features are modulated pixel-by-pixel. The average velocity field is calculated using the average flow identity, and target distribution features are generated through one-step mapping. A 3D segmentation mask aligned with the original image space is output by a 3D decoder. This invention achieves single-step function evaluation and inference, significantly improving segmentation speed and anatomical fidelity. It possesses advantages such as small-sample generalization, multimodal robustness, missing modality compatibility, and strong interpretability, meeting the needs of real-time clinical navigation, intraoperative planning, and high-throughput screening. It has significant application value in the field of intelligent 3D medical image segmentation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of medical image processing technology. More specifically, this invention relates to a one-step generative segmentation method, system, and medium for 3D medical images based on an average flow model. Background Technology

[0002] In modern medical diagnostic systems, precise segmentation of anatomical structures in 3D medical images such as CT, MRI, and PET scans is a crucial foundation for clinical diagnosis, radiotherapy planning, surgical navigation, and intelligent medical analysis. Related technologies have evolved from traditional image processing methods to deep learning-driven automated segmentation techniques.

[0003] In recent years, deep learning has been widely applied in the field of 3D medical image segmentation due to its powerful feature extraction and representation capabilities. Discriminative segmentation models, represented by 3D U-Net, construct an encoder-decoder structure through convolutional neural networks, effectively extracting local and global spatial features of images, and have become the mainstream solution in clinical practice and scientific research. With the development of generative artificial intelligence, diffusion models have provided a new technical direction for handling complex topological changes, noise interference, and morphological diversity in medical images. Existing technologies use a volumetric conditioning module (VCM) to conditionally guide 3D latent diffusion models (3D-LDM), improving the generation quality and segmentation accuracy of 3D anatomical structures by introducing semantic priors, local images, and other external conditions, demonstrating certain advantages in the reconstruction of complex anatomical structures.

[0004] Although generative segmentation methods based on diffusion models have made some progress, they still have many obvious shortcomings in practical applications: First, inference efficiency is low and real-time performance is poor. Diffusion models rely on a multi-step iterative Markov chain process for denoising, with the number of non-zero function evaluations (NFE) typically requiring 50 to 1000 steps. This results in high computational overhead and inference latency when processing high-resolution 3D volumetric data, making it difficult to meet the high-response-speed requirements of clinical scenarios such as surgical navigation and real-time diagnosis. Although some existing technologies use consistency models and other methods to compress the number of sampling steps, under single-step inference (1-NFE) conditions, problems such as blurred segmentation results and geometric distortion are prone to occur.

[0005] Second, the modeling capability for nonlinear evolution trajectories is insufficient. Traditional flow matching and diffusion models mainly model the instantaneous velocity field of probabilistic flows, while the mapping path from noise to the segmentation mask in 3D medical images has highly nonlinear and complex spatial topological characteristics. Existing methods can only perform local linear extrapolation during single-step generation, making it difficult to capture fine anatomical boundaries and easily leading to topological breaks, structural discontinuities, and blurred boundaries at the segmentation edges.

[0006] Third, the depth of conditional feature fusion is insufficient. Existing volume-guided methods often use simple methods such as shallow stitching and residual connections to fuse conditional information with the generation path, which makes it difficult to fully explore the complex relationship between grayscale features, lesion patterns and spatial structures in medical images, resulting in insufficient segmentation reliability in low-contrast and heterogeneous lesion scenarios.

[0007] Fourth, the model's generalization ability and interpretability are weak. Existing diffusion-type segmentation models have complex structures, opaque decision-making processes, and poor interpretability. When faced with different imaging devices, different modalities, or heterogeneous lesions that are not observed, the model's performance is prone to significant decline, and its robustness is difficult to meet the actual clinical needs. Summary of the Invention

[0008] One object of the present invention is to solve at least the above-mentioned problems and to provide at least the advantages that will be described later.

[0009] Another objective of this invention is to provide a one-step generative segmentation method, system, and medium for 3D medical images based on the mean flow model. By combining the one-step generation characteristics of mean flow with the spatial control capability of the asymmetric volume condition module, a direct mapping from noise distribution to an accurate segmentation mask is achieved within a single inference step.

[0010] To achieve these objectives and other advantages according to the present invention, a one-step generative segmentation method for 3D medical images based on an average flow model is provided, comprising the following steps: Acquire 3D medical image data to be segmented and manually specified anatomical conditions; The initial latent variable z1 is obtained by sampling from a standard Gaussian distribution; The preset start and end times are respectively encoded in terms of location, and the time embedding information is obtained by processing through a multilayer perceptron. Anatomical condition information is input into the VeloMod module to extract multi-scale spatial volume features and generate modulation parameters to guide the generation process. The modulation parameters include the scale tensor S and the offset tensor B. The 3D medical image data to be segmented, the initial latent variable z1, and the temporal embedding information are input into the MeanFlow network. At the same time, the modulation parameters {S,B} are injected into the intermediate feature map of the MeanFlow network in a pixel-by-pixel linear modulation manner. This allows the MeanFlow network to fuse the spatial features of the 3D medical image data and, under the constraint of the modulation parameters, to calculate the average velocity field u from the noise distribution to the target segmentation mask from the start time to the end time using the average flow identity. Using the average velocity field u, the feature representation z0 of the target distribution is generated through a one-step mapping formula z0=z1-u; The feature representation z0 is input into the 3D decoder and restored to a voxel space segmentation mask aligned with the spatial dimensions of the 3D medical image data to be segmented, outputting the final 3D medical image segmentation result.

[0011] Preferably, the VeloMod module adopts an asymmetric 3D U-Net architecture, where the encoder depth is greater than the decoder, and generates a scale tensor S and an offset tensor B through a dual-head output layer, which are used to perform pixel-wise linear modulation on the intermediate feature maps of the MeanFlow network. The modulation formula is as follows: Where h is the intermediate feature map before modulation, h' is the intermediate feature map after modulation, ⊙ represents element-level multiplication, and ⊕ represents element-level addition.

[0012] Preferably, the MeanFlow network is based on the Transformer architecture, integrates the adaLN-Zero mechanism, receives temporal embedding information, and outputs an average velocity field u; the average velocity field u is defined as: Where u is the average velocity field, z t Let z be the state variable at time t. τ Let v(z) be the state variable at time τ, r be the start time, t be the end time, and z be the state variable at time τ. τ ,τ) represents the instantaneous velocity field at time τ.

[0013] Preferably, the average flow identity is: Where u is the average velocity field and v is the instantaneous velocity field; During training, the instantaneous velocity field v is approximated by the difference between the noise n and the true segmentation mask x, i.e., v = nx, and the average velocity field u is optimized through self-supervised method, where n is the initial latent variable z1 obtained by sampling from the standard Gaussian distribution.

[0014] Preferably, the VeloMod module further includes a modal latent discarding mechanism, which is used to randomly set the latent features of a specific modality to zero at the output of the multi-channel parallel encoder according to a preset probability, so as to achieve robust processing of missing modalities.

[0015] Preferably, the total loss function of the method is: Among them, L MeanFlow L1(S) is the squared L2 error of the average flow identity, L1(B) is the L1 regularization term of the scale tensor S, L1(B) is the L1 regularization term of the offset tensor B, and λ is the regularization coefficient.

[0016] This invention also provides a one-step generative segmentation system for 3D medical images based on an average flow model, characterized in that it includes: The data acquisition module is used to acquire the 3D medical image data to be segmented and the anatomical condition information specified by the user. The noise sampling module is used to sample the initial latent variable z1 from a standard Gaussian distribution; The time embedding module is used to encode the preset start time and end time at their respective positions, and then process them through a multilayer perceptron to obtain the time embedding information. The VeloMod module is used to receive anatomical condition information, extract multi-scale spatial volume features, and generate modulation parameters to guide the generation process. The modulation parameters include the scale tensor S and the offset tensor B. The MeanFlow network module is used to receive the 3D medical image data to be segmented, the initial latent variable z1 and the temporal embedding information, fuse the spatial features of the 3D medical image data, and simultaneously receive the modulation parameters {S,B} and inject the modulation parameters into the intermediate feature map in a pixel-by-pixel linear modulation manner. Under the constraint of the modulation parameters, the average velocity field u from the noise distribution to the target segmentation mask is calculated using the average flow identity from the start time to the end time. A one-step generation module is used to generate a feature representation z0 of the target distribution by using the average velocity field u and the one-step mapping formula z0=z1-u; The 3D decoder module is used to restore the feature representation z0 to a voxel space segmentation mask aligned with the spatial dimensions of the 3D medical image data to be segmented; The output module is used to output the final 3D medical image segmentation results.

[0017] Preferably, the VeloMod module is an asymmetric 3D U-Net architecture, in which the number of encoder channels doubles layer by layer, the number of decoder channels halve layer by layer, and the output layer is a dual-head structure, outputting the scale tensor S and the offset tensor B respectively.

[0018] Preferably, the MeanFlow network module adopts a Transformer structure, integrates temporal embedding and adaLN-Zero mechanism, and supports average velocity field prediction under single-step inference.

[0019] The present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the above-described one-step generative segmentation method for 3D medical images based on the average flow model.

[0020] This invention combines an average flow model with a velocity-driven volume modulation module (VeloMod) to construct a single-step generative 3D medical image segmentation architecture, which has the following significant advantages compared to existing technologies: 1. Order-of-magnitude improvement in inference efficiency to meet real-time clinical needs. This invention employs an average flow model to achieve single-step function evaluation (1-NFE), simplifying the 50-1000 step iterative denoising process of the traditional diffusion model into a one-step direct mapping, significantly reducing computational overhead and memory usage. Under the same hardware conditions, the inference time for a single 3D medical image is only about 128ms, far lower than traditional diffusion models and mainstream discriminative models. This meets the stringent response speed requirements of clinical scenarios such as real-time surgical navigation, intraoperative planning, and high-throughput image screening, and is easier to deploy on edge clinical devices.

[0021] 2. Significantly Improved Single-Step Generation Accuracy and Anatomical Fidelity: This invention overcomes the problems of edge blurring, topological breaks, and funnel effects caused by traditional flow matching relying solely on instantaneous velocity by modeling the global evolution trajectory from noise to the segmentation mask through average velocity field modeling. Combined with the VeloMod module, it achieves precise spatial feature modulation, effectively capturing complex anatomical boundaries and fine structures. On public datasets, the Dice coefficient reaches 92.45%, and the 95% Hausdorff distance (HD95) is as low as 3.12 mm. Both segmentation accuracy and geometric fidelity are superior to existing discriminative models and multi-step diffusion models.

[0022] 3. Strong learning ability with small samples and higher data utilization efficiency: This invention relies on the global distribution modeling capability of average flow and the strong conditional guidance capability of VeloMod to maintain excellent segmentation performance even in scenarios with scarce labeled data. Under the extreme condition of only 10 training samples, the segmentation Dice coefficient can still reach 89.2%, which is far higher than similar advanced models, significantly reducing the dependence on large-scale labeled data and adapting to practical scenarios where medical data labeling costs are high and sample sizes are limited.

[0023] 4. Strong multimodal adaptability and robustness to missing modalities: The VeloMod module has a built-in modal latent discarding mechanism, supporting various conditional inputs such as semantic masks, local images, and anatomical priors. It can also robustly handle missing modalities, achieving "plug-and-play" multimodal conditional guidance. The model can flexibly adapt to 3D medical images from different imaging devices and different imaging modalities, significantly improving its cross-device and cross-modal generalization ability.

[0024] 5. Deeper fusion of conditional features and more stable segmentation of complex scenes: This invention achieves deep fusion of conditional information and generated features through pixel-by-pixel linear modulation, which can fully explore information such as texture, contrast, and spatial topology in medical images. It has a stronger feature capture capability for complex scenes such as low contrast and morphologically heterogeneous lesions, effectively suppresses imaging artifacts and noise interference, and improves the reliability of segmentation results.

[0025] 6. Enhanced Model Interpretability, Supporting Transparent Clinical Decision-Making: The average velocity field learned in this invention has a clear physical meaning and can intuitively represent the displacement mapping relationship from noise to the segmentation mask. It transforms the "black box" iterative process of traditional diffusion models into an understandable and visualized flow field trajectory, making it easier for clinicians to understand the basis of model decisions and improving the credibility and usability of intelligent segmentation results in clinical diagnosis and treatment.

[0026] Other advantages, objectives and features of the present invention will become apparent in part from the following description, and in part from those skilled in the art through study and practice of the invention. Attached Figure Description

[0027] Figure 1 This is a flowchart of the VeloVox model described in this invention; Figure 2 The flowchart is a one-step generative segmentation method for 3D medical images based on the average flow model described in this invention. Figure 3 This is a flowchart illustrating the workflow of the MeanFlow network and one-step generation module described in this invention. Figure 4 This is a flowchart of the workflow of the VeloMod module described in this invention. Detailed Implementation

[0028] The present invention will now be described in further detail with reference to the accompanying drawings, so that those skilled in the art can implement it based on the description.

[0029] This invention provides a one-step generative segmentation method for 3D medical images based on an average flow model. This method is implemented by constructing a VeloVox (Velocity-driven Voxel-wise Volumetric Flow) model, the core architecture of which is as follows: Figure 1 As shown, it mainly consists of three parts: the VeloMod module, the MeanFlow network, and the 3D decoder.

[0030] Please refer to Figures 1-4 The VeloVox model architecture described in this invention includes the following components: Input layer: Receives 3D medical image data x to be segmented (e.g., MRI image, 160×224×160 pixels) and manually specified anatomical condition information y. i (For example, semantic masks or local grayscale images of certain regions). Simultaneously, the model samples initial latent variables z1 from a standard Gaussian distribution, serving as the starting point for the generation process.

[0031] Temporal embedding layer: The preset start time r=0 and end time t=1 are sinusoidally position encoded and then processed by a two-layer multilayer perceptron (MLP) and summed to obtain temporal embedding information. This information will be injected into each computational layer of the MeanFlow network to indicate the time interval of the generation process.

[0032] The VeloMod module employs an asymmetric 3D U-Net architecture, with its encoder significantly exceeding the decoder's depth. Specifically, the encoder has a base of 16 channels, with a channel doubling sequence of [1, 2, 4, 8]. Each downsampling stage consists of residual blocks, each containing GroupNorm normalization, a Swish activation function, and a 3D convolutional layer. The decoder's channel count is halved layer by layer to reduce computational overhead. The VeloMod module also incorporates a ModalityLatent Dropout mechanism. This mechanism differs from traditional random neuron deactivation; instead, it performs random zeroing mapping on the latent feature tensors of specific modalities at the output of the multi-parallel encoder based on preset probabilities. This enables robust handling of missing modal inputs, improving generalization and stability under multimodal conditional inputs. The VeloMod module receives anatomical condition information y. i The system extracts features from different modalities through multiple 3D convolutional layers, processes them through a modal latent dropout mechanism, then fuses the extracted features through a concatenation aggregator, and finally generates a scale tensor S and an offset tensor B through a dual-head output layer. The dimensions of these two tensors are strictly aligned with the dimensions of the intermediate feature maps of the MeanFlow network.

[0033] The MeanFlow network employs a Transformer-based architecture (referencing the DiT / ViT design) and integrates the adaLN-Zero mechanism. This network receives the 3D medical image data to be segmented, the initial latent variable z1, and temporal embedding information as input, fusing the spatial features of the 3D medical image data. It also receives modulation parameters {S, B} from the VeloMod module. The modulation parameters are injected into the intermediate feature map of the network using a pixel-wise linear modulation method, with the modulation formula as follows: Where h is the intermediate feature map before modulation, h' is the intermediate feature map after modulation, ⊙ represents element-wise multiplication, and ⊕ represents element-wise addition. The MeanFlow network fuses the spatial features of 3D medical image data and, under the constraints of the above modulation parameters, calculates the average velocity field u from the noise distribution to the target segmentation mask within the time interval from the start time r=0 to the end time t=1.

[0034] A one-step generation module is used to generate a feature representation z0 of the target distribution by using the average velocity field u and the one-step mapping formula z0=z1-u; 3D Decoder: It restores the target distribution feature representation z0 output by the one-step generation module into a segmentation mask in voxel space, and outputs the final three-dimensional medical image segmentation result, whose spatial dimension is strictly aligned with the input image x.

[0035] Please refer to Figure 2 Based on the VeloVox model described above, the one-step generative segmentation method for 3D medical images based on the average flow model provided by this invention includes the following specific steps: S101. Obtain the 3D medical image data to be segmented and the manually specified anatomical condition information; In this step, the acquired 3D medical image data can be medical images from modalities such as CT, MRI, or PET. Simultaneously, the acquired, manually specified anatomical condition information can be semantic masks of partial regions, local images, or other forms of prior knowledge; this conditional information will be used to guide the generation process.

[0036] S102. Sample the initial latent variable z1 from the standard Gaussian distribution to serve as the starting point for the generation process; In this step, an initial latent variable z1 is randomly sampled from a standard Gaussian distribution. This initial latent variable serves as the starting point for the generation process from the noise distribution to the target segmentation mask. The standard Gaussian distribution is used as the noise prior distribution and is denoted as p(n).

[0037] S103. The preset start time and end time are respectively encoded in position, and the time embedding information is obtained by processing through a multilayer perceptron. In this step, sinusoidal position encoding is performed on the preset start time r=0 and end time t=1, and the results are summed after processing by a multilayer perceptron to obtain time embedding information. This time embedding information will be injected into each computational layer of the MeanFlow network to indicate the time interval in which the generation process takes place.

[0038] S104. The anatomical condition information is input into the VeloMod module. The VeloMod module first extracts different modal features through multiple 3D convolutional layers, and randomly sets some modal latent features to zero through a modal latent discarding mechanism to adapt to the missing modal scene. Then, it extracts multi-scale spatial volume features and generates modulation parameters to guide the generation process. The modulation parameters include the scale tensor S and the offset tensor B. In this step, the anatomical condition information y obtained in step S101 will be used... i The input is fed into the VeloMod module. The VeloMod module first extracts features from different modalities through multiple 3D convolutional layers. Then, it randomly zeros out some latent features of some modalities through a modal latent dropout mechanism to achieve robust handling of missing modalities. Next, it aggregates the processed features with the current flow model state information through a concatenation aggregator. Finally, it extracts key anatomical control points through an asymmetric downsampling path and generates the scale tensor S and offset tensor B through a dual-head output layer.

[0039] S105. Input the 3D medical image data to be segmented, the initial latent variable z1, and the temporal embedding information into the MeanFlow network. At the same time, inject the modulation parameters {S,B} into the intermediate feature map of the MeanFlow network in a pixel-by-pixel linear modulation manner. This allows the MeanFlow network to fuse the spatial features of the 3D medical image data and, under the constraint of the modulation parameters, use the average flow identity to calculate the average velocity field u from the noise distribution to the target segmentation mask from the start time to the end time. In this step, the 3D medical image data to be segmented obtained in step S101, the initial latent variable z1 obtained in step S102, and the temporal embedding information obtained in step S103 are jointly input into the MeanFlow network. Simultaneously, the modulation parameters {S, B} generated in step S104 are injected into the intermediate feature map of the MeanFlow network using a pixel-wise linear modulation method. The MeanFlow network fuses the spatial features of 3D medical image data and, under the constraints of the aforementioned modulation parameters, calculates the average velocity field u from the noise distribution to the target segmentation mask over the time interval from start to finish using the average flow identity. The average velocity field u is defined as: Where u is the average velocity field, z t Let z be the state variable at time t. τLet v(z) be the state variable at time τ, r be the start time, t be the end time, and z be the state variable at time τ. τ ,τ) represents the instantaneous velocity field at time τ.

[0040] The training process does not rely on the true value of the velocity field, but instead uses the average flow identity for self-supervised constraints. The average flow identity is: Where u is the average velocity field and v is the instantaneous velocity field; During training, the instantaneous velocity field v is approximated by the difference between the noise n and the true segmentation mask x, i.e., v = nx, where n is the initial latent variable z1 obtained by sampling from the standard Gaussian distribution.

[0041] S106. Using the average velocity field u, the feature representation z0 of the target distribution is generated through the one-step mapping formula z0=z1-u. In this step, the average velocity field u calculated in step S105 is used to directly calculate the characteristic representation z0 of the target distribution through a one-step mapping formula. The one-step mapping formula is: z0=z1-u, where z1 is the initial latent variable and u is the average velocity field.

[0042] S107. Input the feature representation z0 into the 3D decoder to restore it into a voxel space segmentation mask aligned with the spatial dimensions of the 3D medical image data to be segmented, and output the final three-dimensional medical image segmentation result.

[0043] In this step, the feature representation z0 of the target distribution obtained in step S106 is input to the 3D decoder. The decoder then uses upsampling and convolution operations to restore the voxel space segmentation mask, and finally outputs the three-dimensional medical image segmentation result.

[0044] This invention also provides a training method for the VeloVox model, specifically including the following steps: S201, Data Preprocessing; All 3D medical images are registered in the MNI space, and then the pixel values ​​are mapped to the [0,1] range by Min-Max normalization, and intensity clipping is performed to remove outliers; S202, Construct the loss function; The total loss function L consists of two parts: Among them, L MeanFlow The squared L2 error of the average flow identity is used to avoid calculating higher-order gradients by employing the Stop-gradient operator. L1(S) is the L1 regularization term of the scale tensor S, L1(B) is the L1 regularization term of the offset tensor B, and λ is the regularization coefficient. In this embodiment, λ = 0.01 is taken.

[0045] S203. Set hyperparameters and train the model; The AdamW optimizer was used, with a base learning rate of 5×10⁻⁶. -5 The training consisted of 10,000 epochs with a minimum batch size of 16. Axial flip data augmentation and linear learning rate annealing were applied during training.

[0046] Experimental results: To verify the effectiveness of this invention, comparative experiments were conducted on the BraTS2021 brain tumor segmentation challenge dataset. The experiments selected nnU-NetV2, SwinUNETR, U-KAN, KM-UNet, and 3D-LDM+VCM as comparison models. Evaluation metrics included Dice coefficient (%), 95% Hausdorff distance (HD95, mm), number of function evaluations (NFE), and single-instance inference time (ms). Experimental results are shown in Table 1. Table 1 Performance Comparison of Different Architectures in BraTS2021 Task As shown in Table 1, while KAN-based models (such as U-KAN) have potential in nonlinear representation, their 3D convolution and B-spline computational overhead is significant, resulting in an inference latency of 262ms. Although KM-UNet, which incorporates Mamba, improves linear complexity efficiency, it is still limited by the feature bottleneck of discriminative learning. In contrast, the VeloVox proposed in this invention simplifies the generation process to a one-step displacement mapping using average flow theory, with an inference time of only 128ms, significantly outperforming all the aforementioned models and demonstrating its application value in real-time clinical diagnosis.

[0047] Models such as U-KAN generally perform poorly in terms of HD95 (6.45 mm) when dealing with complex boundaries (such as ET-enhanced tumor regions) due to the potential for edge smoothing caused by excessive parameterization of the KAN layer. This invention utilizes volumetric condition intervention provided by VeloMod and trains using the average flow identity, achieving significant improvements in both the Dice coefficient (92.45%) and boundary refinement (HD95 3.12 mm), demonstrating the superiority of generative flow models in capturing global anatomical topology.

[0048] In addition, to verify the generalization ability in small sample scenarios, comparative experiments were conducted with 10, 50, and 500 training samples. The results are shown in Table 2. Table 2. Comparison of Dice index (%) under different training sample sizes Training sample size KM-UNet U-KAN VeloVox 10 cases 74.2 71.5 89.2 50 cases 82.5 80.8 91.1 500 cases 91.2 90.3 92.4 As shown in Table 2, in the extreme scenario with only 10 samples, U-KAN and KM-UNet experienced a significant performance decline due to their reliance on large amounts of data to learn nonlinear mappings. In contrast, the VeloVox of this invention, utilizing the prior knowledge of a pre-trained streaming model and combining it with the plug-and-play nature of VeloMod, maintains a high accuracy of 89.2% even with a small sample size. This demonstrates the strong robustness and efficiency of this solution in data-constrained medical environments.

[0049] Corresponding to the above method embodiments, this embodiment also provides a one-step generative segmentation system for 3D medical images based on the average flow model, including: The data acquisition module is used to acquire the 3D medical image data to be segmented and the anatomical condition information specified by the user. The noise sampling module is used to sample the initial latent variable z1 from the standard Gaussian distribution as the starting point for the generation process; The time embedding module is used to encode the preset start time and end time at their respective positions, and then process them through a multilayer perceptron to obtain the time embedding information. The VeloMod module is used to receive 3D medical image data and anatomical condition information, extract multi-scale spatial volume features, and generate modulation parameters to guide the generation process. The modulation parameters include the scale tensor S and the offset tensor B. The MeanFlow network module is used to receive the 3D medical image data to be segmented, the initial latent variable z1 and the temporal embedding information, fuse the spatial features of the 3D medical image data, and simultaneously receive the modulation parameters {S,B} and inject the modulation parameters into the intermediate feature map in a pixel-by-pixel linear modulation manner. Under the constraint of the modulation parameters, the average velocity field u from the noise distribution to the target segmentation mask is calculated using the average flow identity from the start time to the end time. A one-step generation module is used to generate a feature representation z0 of the target distribution by using the average velocity field u and the one-step mapping formula z0=z1-u; The 3D decoder module is used to restore the feature representation z0 to a segmentation mask in voxel space; The output module is used to output the final 3D medical image segmentation results.

[0050] Corresponding to the above method embodiments, this embodiment also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the above-described one-step generative segmentation method for 3D medical images based on the average flow model.

[0051] Although embodiments of the present invention have been disclosed above, they are not limited to the applications listed in the specification and embodiments. They can be applied to various fields suitable for the present invention. For those skilled in the art, other modifications can be easily made. Therefore, without departing from the general concept defined by the claims and their equivalents, the present invention is not limited to the specific details and illustrations shown and described herein.

Claims

1. A one-step generative segmentation method for 3D medical images based on an average flow model, characterized in that, Includes the following steps: Acquire 3D medical image data to be segmented and manually specified anatomical conditions; The initial latent variable z1 is obtained by sampling from a standard Gaussian distribution; The preset start and end times are respectively encoded in terms of location, and the time embedding information is obtained by processing through a multilayer perceptron. Anatomical condition information is input into the VeloMod module to extract multi-scale spatial volume features and generate modulation parameters to guide the generation process. The modulation parameters include the scale tensor S and the offset tensor B. The 3D medical image data to be segmented, the initial latent variable z1, and the temporal embedding information are input into the MeanFlow network. At the same time, the modulation parameters {S,B} are injected into the intermediate feature map of the MeanFlow network in a pixel-by-pixel linear modulation manner. This allows the MeanFlow network to fuse the spatial features of the 3D medical image data and, under the constraint of the modulation parameters, to calculate the average velocity field u from the noise distribution to the target segmentation mask from the start time to the end time using the average flow identity. Using the average velocity field u, the feature representation z0 of the target distribution is generated through a one-step mapping formula z0=z1-u; The feature representation z0 is input into the 3D decoder and restored to a voxel space segmentation mask aligned with the spatial dimensions of the 3D medical image data to be segmented, outputting the final 3D medical image segmentation result.

2. The one-step generative segmentation method for 3D medical images based on the average flow model as described in claim 1, characterized in that, The VeloMod module employs an asymmetric 3D U-Net architecture, where the encoder depth is greater than the decoder. It generates a scale tensor S and an offset tensor B through a dual-head output layer, which are used to perform pixel-wise linear modulation on the intermediate feature maps of the MeanFlow network. The modulation formula is as follows: Where h is the intermediate feature map before modulation, h' is the intermediate feature map after modulation, ⊙ represents element-level multiplication, and ⊕ represents element-level addition.

3. The one-step generative segmentation method for 3D medical images based on the average flow model as described in claim 1, characterized in that, The MeanFlow network is based on the Transformer architecture and integrates the adaLN-Zero mechanism. It receives temporal embedding information and outputs an average velocity field u; the average velocity field u is defined as: Where u is the average velocity field, z t Let z be the state variable at time t. τ Let v(z) be the state variable at time τ, r be the start time, t be the end time, and z be the state variable at time τ. τ ,τ) represents the instantaneous velocity field at time τ.

4. The one-step generative segmentation method for 3D medical images based on the average flow model as described in claim 1, characterized in that, The average flow identity is: Where u is the average velocity field and v is the instantaneous velocity field; During training, the instantaneous velocity field v is approximated by the difference between the noise n and the true segmentation mask x, i.e., v = nx, and the average velocity field u is optimized through self-supervised method, where n is the initial latent variable z1 obtained by sampling from the standard Gaussian distribution.

5. The one-step generative segmentation method for 3D medical images based on the average flow model as described in claim 1, characterized in that, The VeloMod module also includes a modal latent discarding mechanism, which randomly sets the latent features of a specific modality to zero at the output of a multi-channel parallel encoder according to a preset probability, so as to achieve robust processing of missing modalities.

6. The one-step generative segmentation method for 3D medical images based on the average flow model as described in claim 1, characterized in that, The total loss function of the method is: Among them, L MeanFlow L1(S) is the squared L2 error of the average flow identity, L1(B) is the L1 regularization term of the scale tensor S, L1(B) is the L1 regularization term of the offset tensor B, and λ is the regularization coefficient.

7. A one-step generative segmentation system for 3D medical images based on an average flow model, characterized in that, include: The data acquisition module is used to acquire the 3D medical image data to be segmented and the anatomical condition information specified by the user. The noise sampling module is used to sample the initial latent variable z1 from a standard Gaussian distribution; The time embedding module is used to encode the preset start time and end time at their respective positions, and then process them through a multilayer perceptron to obtain the time embedding information. The VeloMod module is used to receive anatomical condition information, extract multi-scale spatial volume features, and generate modulation parameters to guide the generation process. The modulation parameters include the scale tensor S and the offset tensor B. The MeanFlow network module is used to receive the 3D medical image data to be segmented, the initial latent variable z1 and the temporal embedding information, fuse the spatial features of the 3D medical image data, and simultaneously receive the modulation parameters {S,B} and inject the modulation parameters into the intermediate feature map in a pixel-by-pixel linear modulation manner. Under the constraint of the modulation parameters, the average velocity field u from the noise distribution to the target segmentation mask is calculated using the average flow identity from the start time to the end time. A one-step generation module is used to generate a feature representation z0 of the target distribution by using the average velocity field u and the one-step mapping formula z0=z1-u; The 3D decoder module is used to restore the feature representation z0 to a voxel space segmentation mask aligned with the spatial dimensions of the 3D medical image data to be segmented; The output module is used to output the final 3D medical image segmentation results.

8. The one-step generative segmentation system for 3D medical images based on the average flow model as described in claim 7, characterized in that, The VeloMod module is an asymmetric 3D U-Net architecture, in which the number of encoder channels doubles layer by layer, the number of decoder channels halve layer by layer, and the output layer is a dual-head structure, outputting the scale tensor S and the offset tensor B respectively.

9. The one-step generative segmentation system for 3D medical images based on the average flow model as described in claim 7, characterized in that, The MeanFlow network module adopts a Transformer structure, integrates temporal embedding and adaLN-Zero mechanism, and supports average velocity field prediction under single-step inference.

10. A computer-readable storage medium having a computer program stored thereon, the computer program being executed by a processor to implement the one-step generative segmentation method for 3D medical images based on the average flow model as described in any one of claims 1 to 6.