A report-guided three-dimensional CT lesion positioning and segmentation method based on a GLeVE model
By constructing a GLeVE neural network model and combining lesion semantic modeling, anatomical priors, and octree autoregressive refinement, the problem of non-one-to-one correspondence between report description and lesion localization in existing technologies is solved, achieving efficient and accurate segmentation of 3D CT lesions, which is suitable for multi-lesion and low-contrast lesion scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HANGZHOU DIANZI UNIV
- Filing Date
- 2026-03-17
- Publication Date
- 2026-06-19
AI Technical Summary
Existing 3D medical image segmentation methods struggle to establish a one-to-one correspondence between report descriptions and lesion instances, especially in scenarios with multiple lesions, small lesions, and low-contrast lesions where localization stability and boundary accuracy are insufficient. Furthermore, pixel-level mask annotations in clinical data are expensive and scarce, making it difficult for traditional strongly supervised segmentation frameworks to fully utilize weakly labeled samples.
A GLeVE neural network model was constructed, which, through a lesion semantic modeling and query module, an anatomical prior lesion proposal and verification module, and an octree autoregressive lesion refinement module, achieved a one-to-one correspondence between reports and lesions, interpretability, and pixel-level accurate localization and segmentation. Combined with organ segmentation anatomical priors and regional consistency verification, the lesion boundaries were gradually refined.
Under fully supervised and weakly supervised settings, the system achieves improved semantic integrity and boundary accuracy at the lesion level in multi-lesion scenarios, enhances the stability and accuracy of localization and segmentation, reduces dependence on annotation, and exhibits good data efficiency and annotation robustness.
Smart Images

Figure CN122244868A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of medical image processing, computer vision, and medical report understanding, and in particular to a report-guided three-dimensional CT lesion localization and segmentation method based on the GLeVE model. Background Technology
[0002] Radiology reports are crucial for physicians to summarize and record the location, organ affiliation, subregion, and imaging attributes of lesions in 3D CT images. They also serve as vital evidence for clinical diagnosis, efficacy evaluation, and treatment decisions. In a routine workflow, physicians typically need to retrieve lesion locations layer by layer from the original CT data based on the natural language descriptions in the report to verify and quantify key lesions. For multi-organ, multi-lesion scenarios, reports are often lengthy, and the conventional descriptions can easily obscure key lesion information. This leads clinicians to rely heavily on experience when quickly and accurately identifying and correlating imaging evidence, resulting in low efficiency and high subjectivity.
[0003] Existing 3D medical image segmentation methods mostly rely on global mask prediction. While they can output overall segmentation results, they struggle to establish a one-to-one correspondence between "report descriptions" and "lesion instances." Existing report-assisted supervision methods can reduce annotation dependence to some extent, but due to the lack of explicit inference-time matching mechanisms, they still cannot guarantee that each lesion description accurately maps to a unique lesion region. On the other hand, some medical vision-language localization methods mainly use phrase-level alignment, cross-modal attention heatmaps, or diffusion models for local localization. These methods often fail to treat an entire lesion description as a complete semantic unit with constraints on organ attribution, spatial relationships, and image attributes. Therefore, in scenarios with multiple lesions, small lesions, and low-contrast lesions, localization stability and boundary accuracy remain insufficient.
[0004] Furthermore, pixel-level mask annotations in clinical data are expensive and scarce, and a large number of cases only contain structured or free text reports. Traditional strongly supervised segmentation frameworks cannot make full use of such weakly annotated samples. Meanwhile, the task of 3D CT lesion localization requires taking into account global coordinate consistency, lesion-level semantic interpretability, and local boundary fineness. Existing methods cannot simultaneously meet the actual clinical application needs in these three aspects. Summary of the Invention
[0005] To achieve the above objectives, this invention provides a report-guided 3D CT lesion localization and segmentation method based on the GLeVE model. By constructing a GLeVE neural network model, the complete lesion description in the radiological report is used as an atomic semantic unit for graph structure reasoning. Combined with organ segmentation anatomical prior, regional consistency verification, and octree autoregressive boundary refinement, a one-to-one correspondence, interpretability, and pixel-level accurate localization and segmentation between the report and the lesion can be achieved.
[0006] A report-guided 3D CT lesion localization and segmentation method based on the GLeVE model includes the following steps:
[0007] Step 1: Construct a GLeVE neural network model for lesion localization and segmentation in 3D CT guided by radiology reports. The GLeVE neural network model includes a lesion semantic modeling and querying module (LeQu), an anatomy-prior lesion proposal and verification module (AnVer), and an octree autoregressive lesion refinement module (OcRe).
[0008] Step 2: Obtain a dataset containing 3D CT volume data, radiology reports, and organ segmentation priors. Perform structured parsing on the radiology reports to extract attributes such as the organ to which the lesion belongs, sublocation, size, and HU value. Divide the dataset into training, validation, and test sets.
[0009] Step 3: Train the GLeVE neural network model using the training set. For samples with pixel-level mask annotations, use segmentation loss for supervision, and for samples with missing pixel-level mask annotations, use report-driven weak supervision loss for optimization.
[0010] Step 4: Input the radiology report to be tested and the 3D CT volume data into the trained GLeVE neural network model, and sequentially perform lesion-level query generation, candidate proposal, region consistency verification and octree autoregressive boundary refinement, and output lesion localization results and segmentation masks that correspond one-to-one with the report description.
[0011] Furthermore, the Lesion Semantic Modeling and Querying (LeQu) module described in step 1 uses the complete lesion description as the atomic semantic unit. It employs a frozen Qwen3-8B large language model to perform structured parsing of the radiology report, obtaining fields such as lesion set, organ affiliation, sublocation, size, and HU value. It then constructs a lesion semantic graph composed of lesion nodes, anatomical nodes, and attribute nodes. Subsequently, a relation-aware graph Transformer is used to pass messages through the lesion semantic graph, thereby preserving the explicit relationships between lesions and organs, lesions and attributes, and lesions, generating a stable lesion-level semantic representation and query vector. Its relation-aware graph reasoning process is as follows:
[0012] ,
[0013] ,
[0014] in, Represents a node In the Layer embedding, Represents a node The neighborhood set, , , These represent linear mapping matrices for the query, key, and value, respectively. Represents the relational embedding mapping matrix. Representing an edge Relationship type embedding, Indicates a feedforward network; Indicates the first Layer nodes To the node Relationship-aware attention weights Represents a node Nodes in the neighborhood, Indicates the feature dimension.
[0015] The first after graph reasoning Each lesion node is represented as a semantic summary vector of the lesion. Then, the query database generator maps it to multiple queries for the lesion, as shown in the following formula:
[0016] ,
[0017] in, Indicates the first The first lesion corresponds to the first A query vector, Indicates a query library generator. This indicates the number of queries corresponding to each lesion;
[0018] This step can preserve the semantic integrity of lesions in multi-lesion scenarios and avoid the semantic fragmentation problem caused by relying solely on phrase-level alignment.
[0019] Furthermore, the Anatomy-Prior Lesion Proposal and Verification (AnVer) module described in step 1 uses the organ segmentation result as an anatomical prior, performs feature linear modulation on the voxel features output by the visual encoder, ensuring that the candidate lesion proposals are consistent with the target organ; it constructs a candidate set based on the voxel-level similarity response, and further achieves a one-to-one match between the report description and the lesion region through region consistency verification. The specific process is as follows:
[0020] First, Gaussian smoothing is applied to the binary organ mask output by the organ segmenter to obtain a soft mask. Learning embeddings for each organ And construct a linear modulation process for anatomical prior features:
[0021] ,
[0022] ,
[0023] in, Indicates the position of the visual encoder The original features of the place, This indicates the organ perception characteristics after anatomical prior injection. This refers to a lightweight multilayer perceptron. Indicates position Anatomical priori markers at the site Indicates the total number of organ categories. Indicates the feature channel dimension. and Representing positions respectively The channel scaling factor and offset factor at that location, This represents the organ prior aggregation mapping.
[0024] Subsequently, the first Multiple queries corresponding to each lesion Voxel-level cosine similarity matching is performed with the organ perception feature map to obtain the response map:
[0025] ,
[0026] According to the response diagram Perform adaptive thresholding and 3D connected component analysis to obtain a high-recall candidate set. Then, feature pooling is performed on the candidate regions, and combined with evidence vectors such as organ coverage, volume, and average HU value, a region consistency score is calculated:
[0027] ,
[0028] in, Indicates the first Visual features of each candidate region This indicates evidence of consistency between the candidate region and the reported attributes. This represents a semantic vector at the lesion level. Indicates the first The lesion corresponds to the first Consistency score of each candidate region This represents the activation function. and For learnable parameters, Indicates the first Each lesion describes the optimal candidate region. This indicates the number of candidate regions.
[0029] To ensure that each lesion description corresponds to only one dominant candidate region, a temperature-scaled unimodal distribution constraint is adopted, as shown in the following formula:
[0030] ,
[0031] ,
[0032] in, This represents the loss due to unimodal distribution constraint. Indicates the first The lesion corresponds to the first Temperature-scaled soft-assignment probability of each candidate region Indicates the temperature coefficient. Represents the numerical stability constant. Indicates the number of lesion descriptions. This indicates the number of candidate regions corresponding to each lesion.
[0033] This step can suppress the noise response caused by voxel-level similarity peaks and improve the semantic alignment accuracy between reports and lesions.
[0034] Furthermore, the Octtree Autoregressive Lesion Refinement (OcRe) module described in step 1 selects the optimal candidate region from the region consistency verification. Starting with the octree rooted at its circumscribed cube, a top-down, multi-level recursive refinement strategy is used to progressively narrow down the region of interest, achieving fine segmentation of small lesions and ambiguous boundaries. For each octree node... Clipping local organ sensory features and the coarse-grained prediction after upsampling from the previous layer As a conditional input, it is fed into the parameter-shared 3D residual refinement head for autoregressive correction, and the formula is as follows:
[0035] ,
[0036] in, This represents a 3D residual refinement decoder with shared parameters. Indicates the first The refined prediction results of the layers.
[0037] Considering the large amount of data lacking pixel-level mask annotations in clinical samples, the method also introduces a report-driven weakly supervised loss, which is jointly optimized with the segmentation loss of labeled samples, as shown in the following formula:
[0038] ,
[0039] ,
[0040] ,
[0041] in, Used to penalize overlap between predictions of different lesions Used to penalize the quality of predicted lesions that fall outside the target organ's mask. This represents the combination of Dice loss and cross-entropy loss. Indicates loss due to weak supervision. , and These represent the weighting coefficients of the unimodal distribution constraint loss, consistency loss, and separation loss, respectively. Indicates consistency loss. This represents the attribute consistency loss. Representing volume and average HU value Two attributes, and They represent the first Predictive and reporting attribute values for each lesion. To prevent constants with a denominator of zero; Indicates the total loss. Indicates the indicator function, when the sample It belongs to a sample set with pixel-level mask annotations. hour Select 1 if the value is 1, otherwise select 0. and These represent the weighting coefficients for the segmentation loss and the weak supervision loss, respectively.
[0042] Furthermore, the dataset described in step 2 uses the AbdomenAtlas 3.0 dataset, which contains 9262 cases from 138 institutions. Each case includes a radiology report reviewed by 12 licensed radiologists and a tumor mask aligned with the voxels of the 3D CT volume data. Among them, there are 2122 cases of kidney lesions, 1472 cases of liver lesions, and 361 cases of pancreatic lesions. The dataset is divided into training set, validation set, and test set in a ratio of 8:1:1.
[0043] Further, step 3 implements the GLeVE neural network model using the PyTorch 2.4 framework and trains it on an NVIDIA H100 GPU. The model uses MedFormer as the visual encoder-decoder backbone network and TotalSegmentator to obtain organ segmentation priors. During training, the batch size is set to 1, the number of training epochs is set to 100, the random seed is set to 2026, the optimizer is AdamW, and the initial learning rate is... Furthermore, a cosine annealing strategy is employed to decay the learning rate.
[0044] Furthermore, step 4 uses Dice, HD95, Lesion Recall, and Lesion Localization Score to evaluate model performance, where Lesion Recall and Lesion Localization Score are defined as follows:
[0045] ,
[0046] ,
[0047] in, This indicates the number of predicted lesions that successfully match actual lesions. Indicates the actual number of lesions. Indicates the first Distance between the center points of the matched lesions Take 20 mm, corresponding to the unmatched real lesion . Indicates the relationship with the first Predicted lesions that match actual lesions.
[0048] Compared with the prior art, the present invention has the following beneficial effects:
[0049] 1. This invention discloses a report-guided 3D CT lesion localization and segmentation method based on the GLeVE model. This method, based on the GLeVE neural network model, is specifically designed for lesion-level localization and segmentation tasks guided by radiology reports. By modeling the complete lesion description as atomic semantic units using a graph structure, this method overcomes the limitations of traditional phrase-level alignment methods, such as semantic fragmentation and susceptibility to confusion in multi-lesion scenarios.
[0050] 2. This invention provides a report-guided 3D CT lesion localization and segmentation method based on the GLeVE model. The lesion semantic modeling and query module captures the complex relationships between lesions and attributes such as organ attribution, spatial sublocation, size, and organ attachment (HU). The anatomical prior lesion proposal and validation module integrates organ segmentation priors and verifies the consistency of candidate lesion regions. The octree autoregressive lesion refinement module progressively improves the segmentation accuracy of small lesions and fuzzy boundaries. This invention exhibits excellent performance under both fully supervised and weakly supervised settings. By comprehensively utilizing lesion-level semantic graph reasoning, anatomical prior constraints, proposal validation, and hierarchical refinement mechanisms, it can promote applications in the field of report-guided 3D CT lesion localization and segmentation.
[0051] 3. The report-guided three-dimensional CT lesion localization and segmentation method based on the GLeVE model of the present invention maintains strong competitiveness under weak supervision settings and only uses low mask annotation, and has good data efficiency and annotation robustness. Attached Figure Description
[0052] Figure 1 This is a flowchart illustrating the overall process of the report-guided 3D CT lesion localization and segmentation method based on the GLeVE model of the present invention.
[0053] Figure 2 This is a schematic diagram of the network structure of the GLeVE model in this invention, which includes a Lesion Semantic Modeling and Querying (LeQu) module, an Anatomy-Prior Lesion Proposal and Verification (AnVer) module, and an Octree Autoregressive Lesion Refinement (OcRe) module. Detailed Implementation
[0054] To make the objectives, technical solutions, and beneficial effects of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. The following embodiments are for illustrative purposes only and are not intended to limit the scope of protection of this invention.
[0055] Example 1
[0056] This invention provides a report-guided three-dimensional CT lesion localization and segmentation method based on the GLeVE model, referring to... Figure 1 , Figure 2As shown, this method takes radiology reports and 3D CT volume data as input. First, it performs lesion-level structured parsing of the report through a large language model. Then, it uses graph reasoning to generate lesion-level queries. Subsequently, it combines organ segmentation priors to complete candidate lesion proposal and verification. Finally, it uses an octree autoregressive refinement module to output lesion segmentation masks and localization results.
[0057] Specifically, the lesion localization and segmentation method of the present invention includes the following steps:
[0058] Step 1: Construct a GLeVE neural network model for lesion localization and segmentation in 3D CT guided by radiology reports. The GLeVE neural network model includes a lesion semantic modeling and querying module (LeQu), an anatomy-prior lesion proposal and verification module (AnVer), and an octree autoregressive lesion refinement module (OcRe).
[0059] Step 2: Obtain a dataset containing 3D CT volume data, radiology reports, and organ segmentation priors. Perform structured parsing on the radiology reports to extract attributes such as the organ to which the lesion belongs, sublocation, size, and HU value. Divide the dataset into training, validation, and test sets.
[0060] Step 3: Train the GLeVE neural network model using the training set. For samples with pixel-level mask annotations, use segmentation loss for supervision, and for samples with missing pixel-level mask annotations, use report-driven weak supervision loss for optimization.
[0061] Step 4: Input the radiology report to be tested and the 3D CT volume data into the trained GLeVE neural network model, and sequentially perform lesion-level query generation, candidate proposal, region consistency verification and octree autoregressive boundary refinement, and output lesion localization results and segmentation masks that correspond one-to-one with the report description.
[0062] Preferably, the Lesion Semantic Modeling and Querying (LeQu) module in step 1 uses the complete lesion description as the atomic semantic unit. It employs a frozen Qwen3-8B large language model to perform structured parsing of the radiology report, obtaining fields such as lesion set, organ affiliation, sublocation, size, and HU value. It then constructs a lesion semantic graph composed of lesion nodes, anatomical nodes, and attribute nodes. Subsequently, a relation-aware graph Transformer is used to pass messages through the lesion semantic graph, thereby preserving the explicit relationships between lesions and organs, lesions and attributes, and lesions, generating a stable lesion-level semantic representation and query vector. Its relation-aware graph reasoning process is as follows:
[0063] ,
[0064] ,
[0065] in, Represents a node In the Layer embedding, Represents a node The neighborhood set, , , These represent linear mapping matrices for the query, key, and value, respectively. Represents the relational embedding mapping matrix. Representing an edge Relationship type embedding, Indicates a feedforward network; Indicates the first Layer nodes To the node Relationship-aware attention weights Represents a node Nodes in the neighborhood, Indicates the feature dimension.
[0066] The first after graph reasoning Each lesion node is represented as a semantic summary vector of the lesion. Then, the query database generator maps it to multiple queries for the lesion, as shown in the following formula:
[0067] ,
[0068] in, Indicates the first The first lesion corresponds to the first A query vector, Indicates a query library generator. This indicates the number of queries corresponding to each lesion.
[0069] This step can preserve the semantic integrity of lesions in multi-lesion scenarios and avoid the semantic fragmentation problem caused by relying solely on phrase-level alignment.
[0070] Preferably, the Anatomy-Prior Lesion Proposal and Verification (AnVer) module in step 1 uses the organ segmentation result as an anatomical prior, performs feature linear modulation on the voxel features output by the visual encoder, and ensures that the candidate lesion proposals are consistent with the target organ; it constructs a candidate set based on the voxel-level similarity response, and further achieves one-to-one matching between the report description and the lesion region through region consistency verification. The specific process is as follows:
[0071] First, Gaussian smoothing is applied to the binary organ mask output by the organ segmenter to obtain a soft mask. Learning embeddings for each organ And construct a linear modulation process for anatomical prior features:
[0072] ,
[0073] ,
[0074] in, Indicates the position of the visual encoder The original features of the place, This indicates the organ perception characteristics after anatomical prior injection. This refers to a lightweight multilayer perceptron. Indicates position Anatomical priori markers at the site Indicates the total number of organ categories. Indicates the feature channel dimension. and Representing positions respectively The channel scaling factor and offset factor at that location, This represents the organ prior aggregation mapping.
[0075] Subsequently, the first Multiple queries corresponding to each lesion Voxel-level cosine similarity matching is performed with the organ perception feature map to obtain the response map:
[0076] ,
[0077] According to the response diagram Perform adaptive thresholding and 3D connected component analysis to obtain a high-recall candidate set. Then, feature pooling is performed on the candidate regions, and combined with evidence vectors such as organ coverage, volume, and average HU value, a region consistency score is calculated:
[0078] ,
[0079] in, Indicates the first Visual features of each candidate region This indicates evidence of consistency between the candidate region and the reported attributes. This represents a semantic vector at the lesion level. Indicates the first The lesion corresponds to the first Consistency score of each candidate region This represents the activation function. and For learnable parameters, Indicates the first Each lesion describes the optimal candidate region. This indicates the number of candidate regions.
[0080] To ensure that each lesion description corresponds to only one dominant candidate region, a temperature-scaled unimodal distribution constraint is adopted, as shown in the following formula:
[0081] ,
[0082] ,
[0083] in, This represents the loss due to unimodal distribution constraint. Indicates the first The lesion corresponds to the first Temperature-scaled soft-assignment probability of each candidate region Indicates the temperature coefficient. Represents the numerical stability constant. Indicates the number of lesion descriptions. This indicates the number of candidate regions corresponding to each lesion.
[0084] This step can suppress the noise response caused by voxel-level similarity peaks and improve the semantic alignment accuracy between reports and lesions.
[0085] Preferably, the optimal candidate region selected by the Octree Autoregressive Lesion Refinement (OcRe) module in step 1 from the region consistency verification is... Starting with the octree rooted at its circumscribed cube, a top-down, multi-level recursive refinement strategy is used to progressively narrow down the region of interest, achieving fine segmentation of small lesions and ambiguous boundaries. For each octree node... Clipping local organ sensory features and the coarse-grained prediction after upsampling from the previous layer As a conditional input, it is fed into the parameter-shared 3D residual refinement head for autoregressive correction, and the formula is as follows:
[0086] ,
[0087] in, This represents a 3D residual refinement decoder with shared parameters. Indicates the first The refined prediction results of the layers.
[0088] Considering the large amount of data lacking pixel-level mask annotations in clinical samples, the method also introduces a report-driven weakly supervised loss, which is jointly optimized with the segmentation loss of labeled samples, as shown in the following formula:
[0089] ,
[0090] ,
[0091] ,
[0092] in, Used to penalize overlap between predictions of different lesions Used to penalize the quality of predicted lesions that fall outside the target organ's mask. This represents the combination of Dice loss and cross-entropy loss. Indicates loss due to weak supervision. , and These represent the weighting coefficients of the unimodal distribution constraint loss, consistency loss, and separation loss, respectively. Indicates consistency loss. This represents the attribute consistency loss. Representing volume and average HU value Two attributes, and They represent the first Predictive and reporting attribute values for each lesion. To prevent constants with a denominator of zero; Indicates the total loss. Indicates the indicator function, when the sample It belongs to a sample set with pixel-level mask annotations. hour Select 1 if the value is 1, otherwise select 0. and These represent the weighting coefficients for the segmentation loss and the weak supervision loss, respectively.
[0093] Preferably, in this embodiment, the dataset mentioned in step 2 is the AbdomenAtlas 3.0 dataset, which contains 9262 cases from 138 institutions. Each case includes a radiology report reviewed by 12 licensed radiologists and a tumor mask aligned with the voxels of the 3D CT volume data. Among them, there are 2122 cases of kidney lesions, 1472 cases of liver lesions, and 361 cases of pancreatic lesions. The dataset is divided into training set, validation set, and test set in a ratio of 8:1:1.
[0094] Preferably, step 3 uses the PyTorch 2.4 framework to implement the GLeVE neural network model and completes training on an NVIDIA H100 GPU; the model uses MedFormer as the visual encoder-decoder backbone network and TotalSegmentator to obtain organ segmentation priors; during training, the batch size is set to 1, the number of training epochs is set to 100, the random seed is set to 2026, the optimizer is AdamW, and the initial learning rate is... Furthermore, a cosine annealing strategy is employed to decay the learning rate.
[0095] Preferably, step 4 uses Dice, HD95, Lesion Recall, and Lesion Localization Score to evaluate model performance, where Lesion Recall and Lesion Localization Score are defined as follows:
[0096] ,
[0097] ,
[0098] in, This indicates the number of predicted lesions that successfully match actual lesions. Indicates the actual number of lesions. Indicates the first Distance between the center points of the matched lesions Take 20 mm, corresponding to the unmatched real lesion . Indicates the relationship with the first Predicted lesions that match actual lesions.
[0099] This invention compares the proposed GLeVE model with several state-of-the-art methods. The multimodal base models include M3D-LaMed, Merlin, SAT, R1Seg-3D, and CT-CLIP; the general segmentation methods include UNet, nnUNetV2, STUNet, SwinUNETR, Medformer, and Med-SAM3; and the report-assisted method includes R-Super. Experimental results show that, under fully supervised settings, this invention achieves a 7.26% improvement in average Dice compared to the best general segmentation method, a 7.3 mm reduction in Dice compared to the state-of-the-art report-assisted method HD95, and achieves a 76.2% Lesion Recall and a 33.7% Lesion Localization Score. Under weakly supervised settings, it maintains strong competitiveness even using only 10% or 25% mask annotation, indicating that this invention has good data efficiency and annotation robustness.
[0100] Furthermore, ablation experiments showed that replacing lesion semantic graph reasoning with structured descriptive text encoding reduced Lesion Recall by 2.9%, removing the anatomical prior lesion proposal and validation module reduced Lesion LocalizationScore by 7.3%, and removing the octree autoregressive lesion refinement module increased the average HD95 of organs by 4.2 mm. This indicates that the LeQu, AnVer, and OcRe modules have complementary roles in semantic modeling, candidate validation, and boundary refinement.
[0101] The CT lesion localization and segmentation method of this invention, based on the GLeVE neural network model, is specifically designed for lesion-level localization and segmentation tasks guided by radiology reports. This method overcomes the limitations of traditional phrase-level alignment methods, such as semantic fragmentation and susceptibility to confusion in multi-lesion scenarios, by modeling the complete lesion description as an atomic semantic unit using a graph structure. Specifically, the lesion semantic modeling and query module captures the complex relationships between lesions and attributes such as organ attribution, spatial sublocation, size, and HU (Head-of-Household) association; the anatomical prior lesion proposal and validation module fuses organ segmentation priors and verifies the consistency of candidate lesion regions; and the octree autoregressive lesion refinement module progressively improves the segmentation accuracy of small lesions and ambiguous boundaries. This invention demonstrates excellent performance under both fully supervised and weakly supervised settings, which has been verified in experiments on the AbdomenAtlas 3.0 dataset.
[0102] The embodiments and implementation process of the present invention have been described in detail above with reference to the accompanying drawings and tables, but the present invention is not limited to the described embodiments. For those skilled in the art, various changes, modifications, substitutions, and variations can be made to these embodiments, including components, without departing from the principles and spirit of the present invention, and these variations still fall within the protection scope of the present invention.
Claims
1. A report-guided three-dimensional CT lesion localization and segmentation method based on the GLeVE model, characterized in that, Includes the following steps, Step 1: Construct a GLeVE neural network model for lesion localization and segmentation in 3D CT guided by radiology reports. The GLeVE neural network model includes a lesion semantic modeling and query module, an anatomical prior lesion proposal and verification module, and an octree autoregressive lesion refinement module. Step 2: Obtain a dataset containing 3D CT volume data, radiology reports, and organ segmentation priors. Perform structured parsing on the radiology reports to extract attributes such as the organ to which the lesion belongs, sublocation, size, and HU value. Divide the dataset into training set, validation set, and test set. Step 3: Train the GLeVE neural network model using the training set. For samples with pixel-level mask annotations, use segmentation loss for supervision, and for samples without pixel-level mask annotations, use report-driven weak supervision loss for optimization. Step 4: Input the radiology report to be tested and the 3D CT volume data into the trained GLeVE neural network model, and sequentially perform lesion-level query generation, candidate proposal, region consistency verification and octree autoregressive boundary refinement, and output lesion localization results and segmentation masks that correspond one-to-one with the report description.
2. The report-guided three-dimensional CT lesion localization and segmentation method based on the GLeVE model according to claim 1, characterized in that: The lesion semantic modeling and query module described in step 1 uses the complete lesion description as the atomic semantic unit. It employs a frozen large language model to perform structured parsing of the radiology report, obtaining fields such as lesion set, organ affiliation, sublocation, size, and HU value. It then constructs a lesion semantic graph composed of lesion nodes, anatomical nodes, and attribute nodes. Subsequently, it uses a relationship-aware graph to perform message passing on the lesion semantic graph, thereby preserving the explicit relationships between lesions and organs, lesions and attributes, and lesions, generating a stable lesion-level semantic representation and query vector.
3. The report-guided three-dimensional CT lesion localization and segmentation method based on the GLeVE model according to claim 2, characterized in that: The reasoning process of the relation-aware graph in step 1 is as follows: , , in, Represents a node In the Layer embedding, Represents a node The neighborhood set, , , These represent linear mapping matrices for the query, key, and value, respectively. Represents the relational embedding mapping matrix. Representing an edge Relationship type embedding, Indicates a feedforward network; Indicates the first Layer nodes To the node Relationship-aware attention weights Represents a node Nodes in the neighborhood, Indicates the feature dimension; The first after graph reasoning Each lesion node is represented as a semantic summary vector of the lesion. Then, the query database generator maps it to multiple queries for the lesion, as shown in the following formula: , in, Indicates the first The first lesion corresponds to the first A query vector, Indicates a query library generator. This indicates the number of queries corresponding to each lesion.
4. The report-guided three-dimensional CT lesion localization and segmentation method based on the GLeVE model according to claim 3, characterized in that: The anatomical prior lesion proposal and verification module described in step 1 uses the organ segmentation result as the anatomical prior and performs feature linear modulation on the voxel features output by the visual encoder to ensure that the candidate lesion proposals are consistent with the target organ; a candidate set is constructed based on the voxel-level similarity response, and further regional consistency verification is used to achieve one-to-one matching between the report description and the lesion region.
5. The report-guided three-dimensional CT lesion localization and segmentation method based on the GLeVE model according to claim 4, characterized in that: The specific processing procedure of the anatomical prior lesion proposal and verification module mentioned in step 1 is as follows: First, Gaussian smoothing is applied to the binary organ mask output by the organ segmenter to obtain a soft mask. Learning embeddings for each organ And construct a linear modulation process for anatomical prior features: , , in, Indicates the position of the visual encoder The original features of the place, This indicates the organ perception characteristics after anatomical prior injection. This refers to a lightweight multilayer perceptron. Indicates position Anatomical priori markers at the site Indicates the total number of organ categories. Indicates the feature channel dimension. and Representing positions respectively The channel scaling factor and offset factor at that location, Represents organ prior aggregation mapping; Subsequently, the first Multiple queries corresponding to each lesion Voxel-level cosine similarity matching is performed with the organ perception feature map to obtain the response map: , According to the response diagram Perform adaptive thresholding and 3D connected component analysis to obtain a high-recall candidate set. Next, feature pooling is performed on the candidate regions, and the region consistency score is calculated by combining evidence vectors such as organ coverage, volume, and average HU value. , in, Indicates the first Visual features of each candidate region This indicates evidence of consistency between the candidate region and the reported attributes. Represents a lesion-level semantic vector; Indicates the first The lesion corresponds to the first Consistency score of each candidate region This represents the activation function. and For learnable parameters, Indicates the first Each lesion describes the optimal candidate region. This indicates the number of candidate regions.
6. The report-guided three-dimensional CT lesion localization and segmentation method based on the GLeVE model according to claim 5, characterized in that: To ensure that each lesion description corresponds to only one dominant candidate region, a temperature-scaled unimodal distribution constraint is adopted, as shown in the following formula: , , in, This represents the loss due to unimodal distribution constraint. Indicates the first The lesion corresponds to the first Temperature-scaled soft-assignment probability for each candidate region Indicates the temperature coefficient. Represents the numerical stability constant. Indicates the number of lesion descriptions. This indicates the number of candidate regions corresponding to each lesion.
7. The report-guided three-dimensional CT lesion localization and segmentation method based on the GLeVE model according to claim 6, characterized in that: The optimal candidate region selected by the octree autoregressive lesion refinement module in step 1 from the region consistency verification. Starting from this point, an octree is constructed with its circumscribed cube as the root node. A top-down, multi-level recursive refinement strategy is used to progressively narrow the region of interest, achieving fine segmentation of small lesions and ambiguous boundaries. For each octree node... Clipping local organ sensory features and the coarse-grained prediction after upsampling from the previous layer As a conditional input, it is fed into the parameter-shared 3D residual refinement head for autoregressive correction, and the formula is as follows: , in, This represents a 3D residual refinement decoder with shared parameters. Indicates the first The refined prediction results of the layers.
8. The report-guided three-dimensional CT lesion localization and segmentation method based on the GLeVE model according to claim 7, characterized in that: Considering the large amount of data lacking pixel-level mask annotations in clinical samples, a report-driven weakly supervised loss is also introduced and jointly optimized with the segmentation loss of labeled samples, as shown in the following formula: , , , in, Used to penalize overlap between predictions of different lesions Used to penalize the quality of predicted lesions that fall outside the target organ's mask. This represents the combination of Dice loss and cross-entropy loss; Indicates loss due to weak supervision. , and These represent the weighting coefficients of the unimodal distribution constraint loss, consistency loss, and separation loss, respectively. Indicates consistency loss. This represents the attribute consistency loss. Representing volume and average HU value Two attributes, and They represent the first Predictive and reporting attribute values for each lesion. To prevent constants with a denominator of zero; Indicates the total loss. Indicates the indicator function, when the sample It belongs to a sample set with pixel-level mask annotations. hour Select 1 if the value is 1, otherwise select 0. and These represent the weighting coefficients for the segmentation loss and the weak supervision loss, respectively.
9. The report-guided three-dimensional CT lesion localization and segmentation method based on the GLeVE model according to claim 1, characterized in that: The dataset described in step 2 uses the AbdomenAtlas 3.0 dataset; the dataset is divided into training, validation, and test sets in an 8:1:1 ratio. Step 3 uses the PyTorch 2.4 framework to implement the GLeVE neural network model and completes training on an NVIDIA H100 GPU. The model uses MedFormer as the visual encoder-decoder backbone network and TotalSegmentator to obtain organ segmentation priors. During training, a cosine annealing strategy is used to decay the learning rate.
10. The report-guided three-dimensional CT lesion localization and segmentation method based on the GLeVE model according to claim 1, characterized in that: Step 4 uses Dice, HD95, Lesion Recall, and Lesion Localization Score to evaluate model performance. The definitions of Lesion Recall and Lesion Localization Score are as follows: , , in, This indicates the number of predicted lesions that successfully match actual lesions. Indicates the actual number of lesions. Indicates the first Distance between the center points of the matched lesions Take 20 mm, corresponding to the unmatched real lesion , Indicates the relationship with the first Predicted lesions that match actual lesions.