A method for automatically delineating nasopharyngeal carcinoma lymph node tumor target region based on plain scan and enhanced CT
By combining plain and enhanced CT techniques and utilizing the spatiotemporal phase fusion network STPFNet, the accuracy and consistency issues of delineating lymph node tumor target areas in nasopharyngeal carcinoma have been resolved. This has enabled automated delineation, improved delineation accuracy and consistency, and reduced the burden on physicians.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG UNIV
- Filing Date
- 2023-11-02
- Publication Date
- 2026-06-16
AI Technical Summary
Existing techniques present significant challenges in delineating the target area of nasopharyngeal carcinoma lymph node tumors, especially since the tumors are highly similar to surrounding tissues on plain CT scans, resulting in poor delineation accuracy and consistency. Furthermore, existing methods require MRI for auxiliary delineation, which introduces registration errors.
By combining plain and enhanced CT scans with a deep learning model, and by comparing the differences in lymph nodes of metastatic nasopharyngeal carcinoma at different imaging stages, a spatiotemporal phase fusion network STPFNet was designed to achieve automatic delineation of lymph node tumor target areas and avoid elastic registration errors.
It improves the accuracy and consistency of nasopharyngeal carcinoma lymph node tumor target delineation, reduces the delineation burden on doctors, and provides higher accuracy and consistency support for subsequent radiotherapy.
Smart Images

Figure CN117455877B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of lesion segmentation in medical image processing, specifically a method for automatically delineating the target area of nasopharyngeal carcinoma lymph node tumors based on plain and enhanced CT scans and a deep learning model. Background Technology
[0002] Nasopharyngeal carcinoma is one of the most common cancers in East and Southeast Asia, and radiotherapy is one of the preferred treatment methods. Radiotherapy refers to the use of ionizing radiation from high-energy rays to kill cancer cells, thereby treating tumors. Successful radiotherapy for nasopharyngeal carcinoma relies on precise tumor target delineation. Before the advent of automated target delineation, doctors had to manually delineate the target area layer by layer, which was time-consuming and labor-intensive, and suffered from poor consistency due to inter-observer variability. In recent years, with the continuous development of deep learning, researchers have used deep learning models to achieve automated and accurate tumor target delineation, reducing the burden on doctors while improving the consistency of delineation.
[0003] The nasopharyngeal carcinoma target volume (GTV) mainly consists of the nasopharyngeal tumor target volume (GTVnx) and the lymph node tumor target volume (GTVnd). The lymph node tumor target volume (GTVnd) is very similar to the surrounding tissues on plain CT imaging, making it more difficult to delineate. Moreover, its characteristics are very different from those of the nasopharyngeal tumor target volume (GTVnx), but there are few methods to delineate it separately.
[0004] Because delineating the target volume (GTVnd) of nasopharyngeal carcinoma lymph node tumors on plain CT scans is quite challenging, current methods often utilize magnetic resonance imaging (MRI) for auxiliary delineation. Since MRI and CT are different modalities, and the patient's position differs during scanning, the delineation results from MRI need to be registered onto CT before subsequent radiotherapy procedures can be used. However, such non-rigid registration often introduces significant errors, greatly compromising the accuracy of target volume delineation.
[0005] Enhanced CT technology refers to CT scanning performed after an intravenous injection of a contrast agent into the patient. The contrast agent circulates through the bloodstream, gradually spreading to various parts of the body, including the lesion area, and presents different images based on different enhancement patterns. Because multiple scans can be performed rapidly on the same device using both plain and enhanced CT techniques while maintaining a fixed patient position, registration between scan results only requires rigid transformation. After contrast agent injection, different tissues, including normal tissue and metastatic lymph nodes of nasopharyngeal carcinoma, will exhibit different enhancement characteristics; therefore, comparing different scan phases is more helpful in identifying the target area of nasopharyngeal carcinoma lymph node tumors. Summary of the Invention
[0006] The main objective of this invention is to provide a method for automatically delineating the lymph node target region of nasopharyngeal carcinoma based on plain and enhanced CT scans. By comparing the imaging differences of metastatic lymph nodes in nasopharyngeal carcinoma during plain, delayed, and enhanced CT scans, a deep learning model is designed to achieve automatic delineation of the lymph node tumor target region (GTVnd). In clinical practice, this method eliminates the need for flexible registration and improves the accuracy of delineating the nasopharyngeal carcinoma lymph node tumor target region (GTVnd), reducing the workload of physicians while improving the consistency of the delineation.
[0007] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0008] This invention provides a method for automatically delineating the target region of nasopharyngeal carcinoma lymph node tumors based on plain and enhanced CT techniques, comprising the following steps:
[0009] Step 1: Obtain plain, enhanced, and delayed CT images of the head and neck of nasopharyngeal carcinoma patients and annotate them;
[0010] Step 2: Preprocess the CT images obtained in Step 1;
[0011] Step 3: Divide the preprocessed CT image data into training set and validation set;
[0012] Step 4: Input the preprocessed training set data into the deep learning model for training, and at the same time use the preprocessed validation set data for validation to obtain the trained nasopharyngeal carcinoma lymph node tumor target area automatic delineation model.
[0013] Step 5: Input the preprocessed three-phase CT images into the model trained in Step 4 to obtain the automatic delineation results of the nasopharyngeal carcinoma lymph node tumor target area.
[0014] Furthermore, in step 1, the plain CT image is an image obtained by performing a head and neck CT scan without injecting contrast agent; the enhanced CT image is an image obtained by performing a head and neck CT scan 20 seconds after injecting contrast agent; and the delayed CT image is an image obtained by performing a head and neck CT scan 60 seconds after injecting contrast agent.
[0015] The lymph node tumor target area was manually annotated on all CT images, so that each case had plain, enhanced, and delayed images and annotated lymph node tumor target area outlines.
[0016] The specific steps of step 2 are as follows:
[0017] Step 2-1: Use the enhanced and delayed phase CT images as floating images and the plain scan image as a fixed image. Use the SimpleITK tool in Python to perform rigid registration of the three phase scans for each case.
[0018] Step 2-2: Normalize the three phase images. The normalization steps are as follows: voxels with intensity in the range of [-310, 390] are scaled to [0, 1], and voxel intensities below and above the range are set to 0 and 1 respectively.
[0019] Steps 2-3: Calculate the minimum bounding box for each case, so that the voxel intensity outside the bounding box is 0 for all three phases of the same case; perform the same cropping operation on the three phases of scans and annotations according to the bounding box.
[0020] Furthermore, the specific steps of step 4 are as follows:
[0021] Step 4-1: Select the spatiotemporal fusion network STPFNet for the deep learning model. Input the three phase images of the preprocessed training set cases, and select batchsize=N, that is, randomly select N cases in the training set for each training iteration. Randomly segment a certain size image block from the selected cases and input it into the deep learning model for training. Compare the difference between the forward inference results of the deep learning model and the labels, and calculate the loss function. The loss function adopts Dice Loss+BCE Loss, the optimizer adopts SGD, and the learning rate is set for training.
[0022] Step 4-2: During training, a validation is performed after every n iterations. The three phase images of the preprocessed validation set cases are simultaneously divided into several image blocks with 50% overlap using the sliding window method and input into the current model for automatic delineation. After the model automatically delineates each image block, the image blocks belonging to the same case are merged back to their original positions. The average of the prediction results of the two image blocks is taken as the final delineation result for the overlapping part, and the Dice coefficient is used as the evaluation index. The average result of the automatic delineation of the validation set cases is calculated. If the result on the validation set is better than the current best result, the current model parameters are saved; otherwise, they are not saved, and the current best-performing model parameters are retained.
[0023] Furthermore, the spatiotemporal phase fusion network is U-shaped and consists of 5 encoder modules, 4 decoder modules, and 1 delineation prediction head. The encoder module is composed of a phase fusion module and a phase-by-phase downsampling module, which downsamples the input features step by step to expand the model's receptive field, extracts the semantic features of the lymph node target area, and inputs the features at each scale to the decoder modules at each level through skip connections. The decoder module is composed of a phase-by-phase upsampling module, a phase-by-phase splicing module, and a phase fusion module, which fuses the semantic and detailed features of the lymph node target area, performs phase-by-phase upsampling, and restores the features to the input size. Finally, the delineation prediction head provides the model prediction result, i.e., the automatically delineated result.
[0024] Furthermore, the phase fusion module includes a single-phase feature processing branch and a multi-phase feature fusion branch. The single-phase feature processing branch uses 3D convolution and cross-phase spatial attention modules to extract spatial information of target area features within a single phase. The multi-phase feature fusion branch uses 3D convolution and spatial attention modules to fuse temporal dimension information of features between different phases.
[0025] The input feature map of the phase fusion module is sent to the multi-phase feature fusion branch for time-dimensional feature fusion. At the same time, the input feature map is split into three single-phase features along the channel dimension, which are then input into the three single-phase feature processing branches to extract target area spatial information. Finally, the results of each branch are concatenated along the channel dimension and passed through BN and ReLU activation layers as the module output.
[0026] Furthermore, both the phase-by-phase upsampling module and the phase-by-phase downsampling module include a single-phase feature processing branch and a multi-phase feature fusion branch. The module input feature map is split into three single-phase features, which are then input into the single-phase feature processing branch to perform independent upsampling / downsampling of the features from each single phase. At the same time, the module input feature map is input into the multi-phase feature fusion branch for fusion upsampling / downsampling. After completing the feature upsampling / downsampling, the two results are concatenated along the channel dimension as the module output.
[0027] Furthermore,
[0028] The phase-by-phase stitching module has two inputs, one of which comes from the upsampled feature F. up Another part comes from the feature F of skip connections. skip The input feature maps of both are split into single-period features and multi-period fusion features respectively; then the features from the same period are spliced together as new single-period features and multi-period fusion features; the single-period features and multi-period fusion features are spliced together along the channel dimension as the module output.
[0029] Furthermore, the delineation prediction head consists of 3D convolutional layers and softmax layers. The output channels of the last decoder module are adjusted to 2, with the two channels representing the probability that the voxel is a nasopharyngeal carcinoma lymph node tumor target area or normal tissue.
[0030] The specific steps of step 5 are as follows:
[0031] Step 5-1: Load the model parameters from step 4, and divide the preprocessed phase III images of the case to be processed into several image blocks with 50% overlap using the sliding window method. Input the blocks into the model for automatic delineation. After the model automatically delineates each image block, it merges the image blocks belonging to the same case back to their original positions. The average of the prediction results of the two image blocks is taken as the final delineation result for the overlapping part.
[0032] The beneficial effects of this invention are mainly reflected in:
[0033] This invention uses three phases of CT scans—plain, enhanced, and delayed—simultaneously as input, providing information to the deep learning model in both temporal and spatial dimensions. The spatiotemporal phase fusion network, while progressively fusing features from multiple phases, provides independent processing channels for each phase feature, preventing important features from being buried in the early stages of the model and providing more effective features for the delineation task. Compared to other methods using magnetic resonance imaging (MRI), this invention avoids the errors introduced by elastic registration, improving the model's accuracy in delineating the nasopharyngeal carcinoma lymph node tumor target region (GTVnd).
[0034] This invention utilizes target region features from plain, enhanced, and delayed CT scans, combined with a deep learning model, to automatically delineate the nasopharyngeal carcinoma lymph node tumor target region (GTVnd). This greatly reduces the burden of manual delineation for doctors, while improving the consistency of target region delineation and facilitating the subsequent radiotherapy process. Attached Figure Description
[0035] The specific embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.
[0036] Figure 1 This is an overall flowchart of the present invention;
[0037] Figure 2 This is a diagram of the spatiotemporal phase fusion network model architecture proposed in this invention;
[0038] Figure 3 This is a structural diagram of the phase fusion module proposed in this invention;
[0039] Figure 4 This is a structural diagram of the phase-by-phase upsampling / downsampling module proposed in this invention.
[0040] Figure 5This is a structural diagram of the phase-by-phase splicing module proposed in this invention. Detailed Implementation
[0041] The present invention will be further described below with reference to specific embodiments, but the scope of protection of the present invention is not limited thereto.
[0042] like Figure 1 As shown, the method for automatically delineating the target area of nasopharyngeal carcinoma lymph node tumors based on plain and enhanced CT technology provided by the present invention specifically includes the following steps:
[0043] Step 1: Acquire plain, enhanced, and delayed CT images of the head and neck of nasopharyngeal carcinoma patients, and have experts mark the contours of the lymph node tumor target area of nasopharyngeal carcinoma.
[0044] Step 2: Preprocess the three-phase CT images obtained in Step 1;
[0045] Step 3: Divide the preprocessed case data. In this embodiment, the data is divided into a training set, a validation set, and a test set. The test set is used in this embodiment to test the model's delineation accuracy.
[0046] Step 4: Input the preprocessed training set case data into the deep learning model for training, and at the same time use the preprocessed validation set case data for validation to obtain the trained nasopharyngeal carcinoma lymph node tumor target area automatic delineation model.
[0047] Step 5: After preprocessing, the three-phase CT images to be processed are input into the model trained in Step 4 to obtain the automatic delineation results of the nasopharyngeal carcinoma lymph node tumor target area. In this embodiment, this step uses a test set of cases to test and evaluate the delineation accuracy. Specifically, the preprocessed test set of cases are input into the trained model for prediction, and the results are compared with those of manual delineation by experts to calculate the delineation accuracy.
[0048] In one specific embodiment of the present invention, step 1 is used for the dataset collection process of the deep learning model, which is implemented by the following sub-steps:
[0049] Step 1-1: With the patient lying flat, use a thermoplastic mask to fix their upper body to the treatment baseboard;
[0050] Steps 1-2: Perform a head and neck CT scan as a plain CT image without injecting contrast agent;
[0051] Steps 1-3: 20 seconds after injecting the contrast agent into the patient, perform a head and neck CT scan as an enhanced CT image;
[0052] Steps 1-4: 60 seconds after injecting the contrast agent into the patient, perform a head and neck CT scan as a delayed-phase CT image;
[0053] Steps 1-5: Doctors manually annotate the nasopharyngeal carcinoma lymph node tumor target areas of all collected cases. At this time, each case has three phase images: plain scan, enhancement scan, and delayed scan, as well as expert annotations as labels.
[0054] In one specific embodiment of the present invention, the purpose of step 2 is to preprocess the acquired three-phase CT scans and expert annotations, and the specific steps are as follows:
[0055] Step 2-1: Use the enhancement phase and delayed phase images as floating images and the plain scan image as a fixed image. Use the SimpleITK tool in Python to perform rigid registration of the three phase scans for each case.
[0056] Step 2-2: Normalize the three phase images. The normalization steps are as follows: Volume intensity in the range of [-310, 390]
[0057] The voxel scales to [0,1], and the voxel intensities outside the range are set to 0 and 1 respectively;
[0058]
[0059] Steps 2-3: Calculate the minimum bounding box for each case, ensuring that the voxel intensity outside the bounding box is 0 for all three phases of the same case's images. Perform the same cropping operation on the three phases of scans and expert annotations according to the bounding box.
[0060] In a specific embodiment of the present invention, the purpose of step 3 is to divide the dataset. Specifically, the cases are randomly divided into a non-overlapping training set, validation set, and test set according to a ratio of 7:1:2. Thus, 200 cases are divided into a training set of 140 cases, a validation set of 20 cases, and a test set of 40 cases.
[0061] In one specific embodiment of the present invention, step 4 is the training process of an automatic delineation model for the nasopharyngeal carcinoma lymph node target area, and its specific steps are as follows:
[0062] Step 4-1: Input the three phases of images from the preprocessed training set cases, selecting batchsize=4, meaning that 4 cases are randomly selected from the training set in each training iteration. Due to limited GPU memory, randomly segment 32*192*192 image patches from the selected cases and input them into the deep learning model for training. Compare the difference between the model's forward inference results and the labels, and calculate the loss function. The loss function used is Dice Loss+BCE Loss, the optimizer is SGD, and the learning rate is set to 1e-2 for training.
[0063] Step 4-2: During training, validation is performed every 180 iterations. The preprocessed validation set images from all three phases are simultaneously divided into several image patches with 50% overlap using a sliding window method. These patches are then input into the model for automatic delineation. After automatic delineation of each patch, patches belonging to the same case are merged back into their original positions. The average prediction result of the two overlapping patches is taken as the final delineation result, and the Dice coefficient is used as the evaluation metric. The average result of the automatic delineation on the validation set is calculated. If the result on the validation set is better than the current best validation result, the current model parameters are saved; otherwise, they are not saved, and the best-performing model parameters are retained.
[0064] As a preferred embodiment of the present invention, the deep learning model in step 4-1 adopts a Spatial-Temporal Phase Fusion Network (STPFNet), such as... Figure 2 As shown, its overall structure is U-shaped, consisting of 5 encoder modules, 4 decoder modules, and 1 delineation prediction head. The encoder module comprises a phase fusion module and a phase-wise downsampling module. It downsamples the input features step-by-step to expand the model's receptive field, extracts the semantic features of the lymph node target area, and inputs the output features at each scale to the corresponding decoder module via skip connections. The decoder module consists of a phase-wise upsampling module, a phase-wise stitching module, and a phase fusion module. It fuses the semantic and detailed features of the lymph node target area, performs phase-wise upsampling, and restores the features to the input size. Finally, the delineation prediction head provides the model's prediction result, i.e., the automatically delineated result.
[0065] Specifically, the five encoder modules in this embodiment are configured as follows: In encoder modules 2 to 5, the module input is fed into two consecutive phase fusion modules after passing through one phase-by-phase downsampling module. Notably, encoder module 1 consists of only three phase fusion modules. The input and output feature dimensions of each encoder module are shown in the table below:
[0066] name Input dimensions Output size Encoder Module 1 3*32*192*192 44*32*192*192 Encoder Module 2 44*32*192*192 88*16*96*96 Encoder Module 3 88*16*96*96 176*8*48*48 Encoder Module 4 176*8*48*48 352*4*24*24 Encoder Module 5 352*4*24*24 704*2*12*12
[0067] Specifically, the decoder module in this embodiment consists of a phase-by-phase upsampling module, a phase-by-phase stitching module, and a phase-by-phase fusion module. Unlike the encoder module, the decoder module has two inputs: one input receives features from the previous stage module, and the other input receives features from the skip connections. In encoder modules 1 to 4, the feature input from the previous stage module is first upsampled, then the upsampled features and the features from the skip connections are input to the phase-by-phase stitching module, and finally the output of the phase-by-phase stitching module is input to two consecutive phase-by-phase fusion modules. The input and output dimensions of each decoder module are shown in the table below:
[0068] name Upper-level input dimensions Jump connection input size Output size Decoder Module 1 704*2*12*12 352*4*24*24 352*4*24*24 Decoder Module 2 352*4*24*24 176*8*48*48 176*8*48*48 Decoder Module 3 176*8*48*48 88*16*96*96 88*16*96*96 Decoder Module 4 88*16*96*96 44*32*192*192 44*32*192*192
[0069] Specifically, the delineation prediction head in this embodiment consists of a 1*1*1 3D convolution and a softmax layer. After the output of the decoder module 4 is passed through the 1*1*1 3D convolution, the number of output channels is adjusted to 2, and the two channels represent the probability that the voxel is a nasopharyngeal carcinoma lymph node tumor target area or normal tissue.
[0070] Specifically, the phase fusion module in this embodiment is as follows: Figure 3 As shown, it consists of three single-phase feature processing branches and one multi-phase feature fusion branch. The single-phase feature processing branches are used to extract the spatial information of the target area from the single-phase features, while the multi-phase feature fusion branch mainly focuses on the temporal dimension information between features of different phases. The n-channel input feature F... in Along the channel dimension The partial split is divided into single-period features F1 and F2. 1C F 1D Let each of the three single-period feature processing branches be input, and let the input feature F be... in The input is a multi-period feature fusion branch. In the single-period feature processing branch, the single-period features undergo a 3D convolution with kernel = [3,3,3], stride = [1,1,1], and padding = [1,1,1]. The feature map size and number of channels remain unchanged. Then, an attention map is obtained through a cross-period spatial attention module. The Hadamard product of the attention map and the convolved features is calculated as the output of the single-period feature branch. In the multi-period feature fusion branch, the input feature F... in After a 3*3*3 3D convolution, the number of channels is transformed to The feature map size remains unchanged, and then it passes through a spatial attention module to obtain an attention map. The attention map is then convolved with the convolved feature map to calculate the Hadamard product, which serves as the output of the multi-phase feature fusion branch. Finally, the outputs of the three single-phase feature processing branches and one multi-phase feature fusion branch are concatenated along the channel dimension to obtain an n-channel result as the module output. Specifically, in the first phase fusion module of the spatiotemporal phase fusion network, F... in There are only three channels: the plain scan period, the enhancement period, and the delay period. At this time, F1 and F 1C F 1D Take one corresponding channel, F1, F 1C F 1D and F in After 3D convolution in each branch, they are converted to 4, 4, 4 and 32 channels respectively.
[0071] The calculation method for the inter-period attention module described in the inter-period fusion module is as follows:
[0072] The features after 3D convolution in the three single-phase feature processing branches are F α F β and F γ They are passed through an average pooling layer and a max pooling layer with kernel = [3,3,3], stride = [1,1,1], and padding = [1,1,1], respectively, and then concatenated along the channel dimension to obtain the reference feature F. ref The process is shown in the following formula:
[0073] F ref =[AvgPool(F α ),AvgPool(F β ),AvgPool(F γ ),MaxPool(F α ),MaxPool(F β ),MaxPool(F γ Then F was ordered ref After adjusting the number of channels to 1 using a 3D convolution with kernel=[3,3,3], stride=[1,1,1], and padding=[1,1,1], the attention feature map is obtained by passing it through a sigmoid layer. Although the cross-period attention module structure is consistent in the three single-period feature processing branches, each branch calculates its own attention feature map.
[0074] Specifically, the phase-by-phase upsampling / downsampling module in this embodiment is as follows: Figure 4 As shown, it consists of 3 single-period feature processing branches and 1 multi-period feature fusion branch. Let the input features F of the n channels be... in Along the channel dimension The partial split is divided into single-period features F1 and F2. 1C F 1D Let each of the three single-period feature processing branches be input, and let the input feature F be... in The input multi-period feature fusion branch performs upsampling / downsampling respectively. In the phase-by-phase downsampling module, the input of each branch is downsampled by a 3DConv+BN+ReLU module with kernel=[3,3,3], stride=[2,2,2], and padding=[1,1,1], so that F out_1 F out_1C F out_1D F out_fused The number of channels is transformed and The outputs of each branch are obtained; in the phase-by-phase upsampling module, the input of each branch is upsampled by Transposed3DConv+BN+ReLU with kernel=[2,2,2], stride=[2,2,2], padding=[0,0,0] to obtain the output F of each branch. out_1 F out_1C F out_1D and F out_fused The number of channels are respectively and Finally, the outputs of each branch are concatenated according to the channel dimension to obtain the module output F. out For an n-channel input F in The output F of the phase-by-phase downsampling module out The output F of the phase-by-phase upsampling module is 2n channels. out for aisle.
[0075] Specifically, the phase-by-phase splicing module in this embodiment is as follows: Figure 5 As shown.
[0076] Let the input feature F after upsampling have n channels. up Along the channel dimension Decomposed into single-period features F up_1 F up_1C F up_1D and multi-phase fusion feature F up_fused Let the input feature F from the skip connections have n channels. skip Along the channel dimension Decomposed into single-period features F skip_1 F skip_1C F skip_1D and multi-phase fusion feature F skip_fused Then F up_1 With F skip_1 F skip_1C With F up_1C F skip_1D With F up_1D F skip_fused With F up_fused A new single-period feature F is constructed by splicing along the channel dimension. out_1 F out_1C F out_1D F fused Finally, they are pieced together to form the module output F. out .
[0077] In one specific embodiment of the present invention, step 5 is as follows:
[0078] Loading the model parameters from step 4, the preprocessed test set case images from stage III are divided into several image patches with 50% overlap using the sliding window method, and each patch is input into the model for automatic delineation. After the model automatically delineates each image patch, image patches belonging to the same case are merged back to their original positions, and the average of the prediction results of the two overlapping image patches is taken as the final delineation result. The Dice coefficient between the automatic delineation result and the label of the test set cases is calculated as the evaluation index.
[0079] The spatiotemporal phase fusion network is designed with single-phase feature processing branches and multi-phase feature fusion branches, focusing on the spatial features of the target area within each single-phase scan and the temporal features between multiple phases, respectively. These two types of branches are integrated into every module of the network, better adapting to and fully utilizing the input from the plain scan, enhancement, and delayed scan phases. Compared to traditional 3D U-Net, which fuses multi-phase features in the first module, the spatiotemporal phase fusion network retains a single-phase feature processing branch at each feature scale. This allows the network to adaptively select the appropriate feature scale for fusing temporal and spatial features, thereby improving the accuracy of delineating the nasopharyngeal carcinoma lymph node tumor target area.
[0080] The system embodiments of the present invention can be applied to any device with data processing capabilities, such as a computer or other similar device. The system embodiments can be implemented in software, hardware, or a combination of both. Taking software implementation as an example, as a logical device, it is formed by the processor of any data processing device loading the corresponding computer program instructions from non-volatile memory into memory for execution.
[0081] Finally, it should be noted that the above examples are merely some specific embodiments of the present invention. Obviously, the present invention is not limited to the above embodiments and many variations are possible. All variations that can be directly derived or conceived by those skilled in the art from the disclosure of the present invention should be considered within the scope of protection of the present invention.
Claims
1. An automatic delineation method for nasopharyngeal carcinoma lymph node tumor target areas based on plain and enhanced CT scans, characterized in that: Includes the following steps: Step 1: Obtain plain, enhanced, and delayed CT images of the head and neck of nasopharyngeal carcinoma patients and annotate them; Step 2: Preprocess the CT images obtained in Step 1; Step 3: Divide the preprocessed CT image data into training set and validation set; Step 4: Input the preprocessed training set data into the deep learning model for training, and at the same time use the preprocessed validation set data for validation to obtain the trained nasopharyngeal carcinoma lymph node tumor target area automatic delineation model. The specific steps of step 4 are as follows: Step 4-1: Select the spatiotemporal fusion network STPFNet for the deep learning model. Input the three phase images of the preprocessed training set cases, and select batchsize=N, that is, randomly select N cases in the training set for each training iteration. Randomly segment a certain size image block from the selected cases and input it into the deep learning model for training. Compare the difference between the forward inference results of the deep learning model and the labels, and calculate the loss function. The loss function adopts Dice Loss + BCE Loss, the optimizer adopts SGD, and the learning rate is set for training. The spatiotemporal phase fusion network is U-shaped and consists of 5 encoder modules, 4 decoder modules and 1 delineation prediction head. The encoder module consists of a phase fusion module and a phase-by-phase downsampling module. It performs phase-by-phase downsampling on the input features to expand the receptive field of the model, extracts the semantic features of the lymph node target area, and inputs the features at each scale to the decoder modules at each level through skip connections. The decoder module consists of a phase-by-phase upsampling module, a phase-by-phase splicing module, and a phase fusion module. It fuses the semantic and detailed features of the lymph node target area, performs phase-by-phase upsampling, and restores the features to the input size. Finally, the delineation prediction head provides the model prediction result, which is the result of automatic delineation. Both the phase-by-phase upsampling module and the phase-by-phase downsampling module include a single-phase feature processing branch and a multi-phase feature fusion branch; The module input feature map is split into three single-period features, which are then fed into the single-period feature processing branch to perform independent upsampling / downsampling of the features from each single period. At the same time, the module input feature map is fed into the multi-period feature fusion branch for fusion upsampling / downsampling. After the feature upsampling / downsampling is completed, the two results are concatenated along the channel dimension as the module output. Step 4-2: During training, a validation is performed after every n iterations. The three phase images of the preprocessed validation set cases are simultaneously divided into several image patches with 50% overlap using the sliding window method and input into the current model for automatic delineation. After the model automatically delineates each image patch, the image patches belonging to the same case are merged back to their original positions. The average of the prediction results of the two overlapping image patches is taken as the final delineation result, and the Dice coefficient is used as the evaluation index. The average result of the automatic delineation of the validation set cases is calculated. If the result on the validation set is better than the current best result, the current model parameters are saved; otherwise, they are not saved, and the current best-performing model parameters are retained. Step 5: Input the preprocessed three-phase CT images into the model trained in Step 4 to obtain the automatic delineation results of the nasopharyngeal carcinoma lymph node tumor target area.
2. The method for automatic delineation of nasopharyngeal carcinoma lymph node tumor target areas based on plain and enhanced CT scans according to claim 1, characterized in that, In step 1, the plain CT image is the image obtained by performing a head and neck CT scan without injecting contrast agent; the enhanced CT image is the image obtained by performing a head and neck CT scan 20 seconds after injecting contrast agent; and the delayed CT image is the image obtained by performing a head and neck CT scan 60 seconds after injecting contrast agent. The lymph node tumor target area was manually annotated on all CT images, so that each case had plain, enhanced, and delayed images and annotated lymph node tumor target area outlines.
3. The method for automatic delineation of nasopharyngeal carcinoma lymph node tumor target areas based on plain and enhanced CT scans according to claim 1, characterized in that, The specific steps of step 2 are as follows: Step 2-1: Use the enhanced and delayed phase CT images as floating images and the plain scan image as a fixed image. Use the SimpleITK tool in Python to perform rigid registration of the three phase scans for each case. Step 2-2: Normalize the three phase images. The normalization steps are as follows: voxels with intensity in the range of [-310, 390] are scaled to [0, 1], and voxel intensities below and above the range are set to 0 and 1 respectively. Steps 2-3: Calculate the minimum bounding box for each case, so that the voxel intensity outside the bounding box is 0 for all three phases of the same case; perform the same cropping operation on the three phases of scans and annotations according to the bounding box.
4. The method for automatic delineation of nasopharyngeal carcinoma lymph node tumor target areas based on plain and enhanced CT scans according to claim 1, characterized in that, The phase fusion module includes a single-phase feature processing branch and a multi-phase feature fusion branch. The single-phase feature processing branch uses 3D convolution and cross-phase spatial attention modules to extract the spatial information of the target area features within a single phase. The multi-phase feature fusion branch uses 3D convolution and spatial attention modules to fuse the temporal dimension information of features between different phases. The input feature map of the phase fusion module is sent to the multi-phase feature fusion branch for time-dimensional feature fusion. At the same time, the input feature map is split into three single-phase features along the channel dimension, which are then input into the three single-phase feature processing branches to extract target area spatial information. Finally, the results of each branch are concatenated along the channel dimension and passed through BN and ReLU activation layers as the module output.
5. The method for automatic delineation of nasopharyngeal carcinoma lymph node tumor target areas based on plain and enhanced CT scans according to claim 1, characterized in that, The phase-by-phase stitching module has two inputs, one of which comes from the upsampled features. Another part comes from the features of skip connections. The input feature maps of both are then split into single-period features and multi-period fusion features. Subsequently, features from the same phase source are spliced together as new single-phase features and multi-phase fusion features; single-phase features and multi-phase fusion features are spliced together along the channel dimension as module output.
6. The method for automatic delineation of nasopharyngeal carcinoma lymph node tumor target areas based on plain and enhanced CT scans according to claim 1, characterized in that, The delineation prediction head consists of 3D convolutional layers and softmax layers. The output channels of the last decoder module are adjusted to 2, and the two channels represent the probability that the voxel is the target area of nasopharyngeal carcinoma lymph node tumor or normal tissue.
7. The method for automatic delineation of nasopharyngeal carcinoma lymph node tumor target areas based on plain and enhanced CT scans according to claim 1, characterized in that, The specific steps of step 5 are as follows: Step 5-1: Load the model parameters from step 4, and use the sliding window method to divide the preprocessed phase III images of the cases to be processed into several image blocks with 50% overlap. Input the blocks into the model for automatic delineation. After the model automatically delineates each image block, it merges the image blocks belonging to the same case back to their original positions. The average of the prediction results of the two image blocks is taken as the final delineation result for the overlapping part.