A medical image segmentation method for multi-modal brain tumor data

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing a 3D network segmentation model with multi-axis convolution and cross-dimensional semantic alignment, the problems of insufficient integration of multimodal MRI data and semantic consistency were solved, achieving accurate identification of tumor boundaries and improving the model's generalization performance, while reducing reliance on human experience.

CN122244455APending Publication Date: 2026-06-19WUHAN TEXTILE UNIV

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: WUHAN TEXTILE UNIV
Filing Date: 2026-05-22
Publication Date: 2026-06-19

Smart Images

Figure CN122244455A_ABST

Patent Text Reader

Abstract

This invention discloses a medical image segmentation method for multimodal brain tumor data, relating to the field of medical image segmentation. The method includes: integrating and filtering multimodal magnetic resonance imaging (MRI) data of different objects, and voxel-level brain tumor segmentation annotations corresponding to the multimodal MRI data; constructing a three-dimensional network segmentation model based on multiaxial convolution, and training the three-dimensional network segmentation model using the multimodal MRI data and corresponding voxel-level brain tumor segmentation annotations. This invention combines multiaxial convolution and axial attention mechanisms to fully mine complementary information in different modal MRI data, extract more discriminative three-dimensional spatial features, reduce semantic bias across different axes, and uses an innovative neural network architecture with a cross-dimensional semantic alignment mechanism to establish semantic consistency between different spatial dimensions and scales of three-dimensional medical images, thereby improving the accuracy of tumor boundary recognition.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of medical image segmentation technology, specifically a medical image segmentation method for multimodal brain tumor data. Background Technology

[0002] Brain tumors are common and highly dangerous diseases of the central nervous system. Their diagnosis and treatment heavily rely on accurate analysis of the tumor's location, morphology, and internal structure. Magnetic resonance imaging (MRI) is widely used in the clinical examination of brain tumors due to its excellent imaging capabilities for soft tissues. In clinical practice, multiple modalities of MRI images are typically acquired, such as T1, T1c, T2, and FLAIR, as different modalities can reflect the characteristics of the tumor and its surrounding tissues from different perspectives.

[0003] Deep learning techniques, especially convolutional neural networks (CNNs) and semantic segmentation algorithms, have demonstrated extremely high efficiency and accuracy in image recognition and processing. Deep learning-based semantic segmentation methods, particularly convolutional neural networks and their derived 3D segmentation models, have made significant progress in the field of medical image segmentation.

[0004] However, traditional image segmentation methods still have certain shortcomings in terms of multimodal data modeling capabilities and spatial semantic consistency, which limit segmentation accuracy and model generalization performance. Traditional methods often struggle to effectively integrate and collaboratively analyze data from different magnetic resonance imaging modalities. This difficulty in data integration leads to limitations in analytical capabilities, making it difficult to fully explore the correlations of complex tumor regions at different scales and dimensions, thereby affecting the accuracy and comprehensiveness of tumor boundaries.

[0005] Chinese patent CN202210172181 discloses a "brain tumor medical image segmentation method based on spatial information and feature channels." This method improves the fully convolutional network by introducing dual attention mechanisms such as spatial attention and channel attention. It enhances the response capability to key regions and effective features without significantly increasing network complexity, thus alleviating problems such as large parameter counts and long inference times in complex fully convolutional networks, and exhibits good module generalization. However, in utilizing multimodal MRI data, this method focuses more on spatial and channel reconstruction of fused features, lacking explicit modeling and adaptive selection mechanisms for intermodal complementary relationships. Furthermore, it has shortcomings in semantic consistency calibration across different axes in three-dimensional space, making it difficult to adapt to the complex spatial morphological features of brain tumors. Summary of the Invention

[0006] In view of the above-mentioned shortcomings of the existing technology, the present invention provides a medical image segmentation method for multimodal brain tumor data, which can effectively solve the problems of the existing technology.

[0007] To achieve the above objectives, the present invention is implemented through the following technical solutions; This invention discloses a medical image segmentation method for multimodal brain tumor data, comprising: Step 1: Integrate and filter multimodal magnetic resonance image data of different objects, as well as voxel-level brain tumor segmentation annotations corresponding to the multimodal magnetic resonance image data; Step 2: Construct a 3D network segmentation model based on multi-axis convolution. Use the multimodal magnetic resonance image data obtained in Step 1 and the corresponding voxel-level brain tumor segmentation annotations to train the 3D network segmentation model. Step 3: Input the multimodal magnetic resonance image data to be predicted into the trained 3D network segmentation model, and automatically output the segmentation results of the brain tumor region; The three-dimensional network segmentation model includes a data processing module, a multi-axis convolutional feature extraction module, a cross-dimensional semantic alignment module, and a semantic segmentation prediction module connected in sequence. In the training process of step 2, the total loss function is composed of the segmentation loss function and the consistency loss function of the multi-axis convolutional module.

[0008] Furthermore, the execution process of the three-dimensional network segmentation model includes the following steps: The multimodal magnetic resonance image data collected in step 1 is input into the data processing module to complete the data standardization process; The standardized data output from the data processing module is fed into the multi-axis convolution feature extraction module to complete the multi-scale three-dimensional spatial feature extraction. The high-resolution feature information extracted by the shallow layer of the multi-axis convolutional feature extraction module is fed into the deep network to continue multi-axis convolutional feature extraction and output deep features. The features output from different network layers of the multi-axis convolutional feature extraction module are fed into the cross-dimensional semantic alignment module to complete the semantic calibration and alignment of cross-layer features, and obtain the semantic alignment feature map. The feature results extracted layer by layer by the multi-axis convolution feature extraction module, together with the semantic alignment features output by the cross-dimensional semantic alignment module, are fed into the semantic segmentation prediction module to complete the prediction and analysis of brain tumor regions.

[0009] Furthermore, when the data processing module performs standardization processing on the input multimodal magnetic resonance image data, it follows the following rules: The medical images corresponding to the input multimodal magnetic resonance imaging data are subjected to space filling and center cropping operations to unify all images to a preset three-dimensional voxel size; Random spatial flipping and imaging intensity perturbation operations are performed on the medical images corresponding to the multimodal magnetic resonance imaging data. The calculation formula for the imaging intensity perturbation is as follows: ; In the formula: Indicates the intensity value of the original medical image. This represents the intensity value of the transformed image. and These represent the scaling factor and the offset factor, respectively. The multimodal magnetic resonance image data processed as described above is organized into a channel-dimensional format according to a preset order and converted into a three-dimensional tensor data format that can be processed by a network.

[0010] Furthermore, the multi-axis convolutional feature extraction module captures data features through an encoder structure: The three-dimensional tensor data output by the data processing module is downsampled through a convolutional layer with a kernel of 3×3×3 to obtain the downsampled three-dimensional feature map. The downsampled 3D feature map is input into a multi-axis convolutional feature extraction unit, and directional features are extracted along the three spatial axes through 7×1×1, 1×7×1 and 1×1×7 convolutional kernels respectively, to model the contextual dependencies of the 3D volume data in different directions. The convolution results in the three directions are added element by element to obtain a multi-axis aggregated feature map; The multi-axis aggregated feature map is expanded along three axes, and softmax normalization is performed on the corresponding axes. The weighted results of the three directions are then fused. The dimensions of the multi-axis aggregated feature map are C×H×W×D, where C is the number of feature channels, H is the feature map height, W is the feature map width, and D is the feature map depth. The fused feature map is then reconstructed using 1×1×7 and 1×7×1 convolutional kernels to complete channel semantic mapping and fusion, and output the final extracted features from the multi-axis convolutional feature extraction module.

[0011] Furthermore, when expanding the multi-axis aggregated feature map along the three axes, global average pooling is performed on the corresponding axes of height, width, and depth, and then the weights of the corresponding axes are normalized by the softmax function to obtain the attention weight map for each axis. The attention weight maps of the three axes are multiplied element-wise with the multi-axis aggregated feature map, and then weighted and fused to obtain a weighted feature map that incorporates multi-axis contextual information.

[0012] Furthermore, the cross-dimensional semantic alignment module performs semantic calibration and alignment operations on the features output by different network layers of the multi-axis convolutional feature extraction module, specifically including the following steps: The features output from different network layers of the multi-axis convolution feature extraction module are input into a convolutional layer with a kernel of 3×3×3 and a stride of 2 to complete feature downsampling and obtain a downsampled feature map. The downsampled feature map is input into two consecutive convolutional layers with a kernel size of 1×1×1, and channel expansion and compression mapping are performed according to a 2x expansion ratio to obtain a channel-mapped feature map. The channel mapping feature map is rearranged from the C×H×W×D dimension to H×W×D×C, and weighted and adjusted using a learnable scaling parameter corresponding to the channel dimension, and then restored to the channel calibration feature map of the C×H×W×D dimension. The channel calibration feature map is input into the residual unit to complete the residual feature mapping and output the semantic alignment feature map. The residual unit is a feature processing unit set within the cross-dimensional semantic alignment module, consisting of convolutional layers with a kernel size of 1×1×1 and a stride of 2. It is used to perform residual feature mapping on the input channel calibration feature map to achieve feature dimension matching and the preservation and transmission of core semantic information.

[0013] Furthermore, the semantic segmentation prediction module is used to accurately classify different regions of brain tumors: Deep features are input into the transposed convolutional unit, and the spatial dimension of the feature map is adjusted to match the spatial dimension of the shallow features, resulting in a dimension-matched deep feature map. The semantically aligned feature map is fused with the deep feature map after dimensional matching, and the fused feature is input into the decoding unit. The features output from the decoding unit are input into two consecutive convolutional layers with a kernel size of 3×3×3 to refine the features and obtain the refined segmentation feature map. The refined segmentation feature map is input into a convolutional layer with a kernel size of 1×1×1, and the final predicted segmentation result of the brain tumor region is output. The decoding unit is a feature decoding structure built into the semantic segmentation prediction module, used to upsample and semantically restore the fused features.

[0014] Furthermore, the formula for calculating the total loss function used in the training process in step 2 is as follows: ; In the formula: Represents the total loss function; This represents the consistency loss function in a multi-axis convolution module, used to constrain the semantic consistency between feature representations along different axes. The weighting coefficients representing consistency loss are used to balance axial consistency constraints and segmentation accuracy optimization.

[0015] Furthermore, the segmentation loss function is a weighted sum of the Dice loss function and the binary cross-entropy loss function, and the calculation formula is as follows: ; In the formula: and These are the loss weight coefficients, used to balance the optimization weights between Dice loss and binary cross-entropy loss; The formula for calculating the Dice loss function is as follows: ; ; In the formula: Dice is the Dice similarity coefficient, which measures the degree of regional overlap between the predicted segmentation result and the actual segmentation annotation of the brain tumor at the voxel level; N represents the total number of voxels; M represents the total number of categories; Indicates the smoothing term; This represents the predicted probability output of the nth voxel in the mth class; The true label represents the one-hot encoding form of the nth voxel in the mth class; The formula for calculating the binary cross-entropy loss function is as follows: ; In the formula: N′ represents the total number of samples; This represents the true probability value of the i-th sample; This represents the model's predicted probability output for the i-th sample.

[0016] Furthermore, the formula for calculating the consistency loss function of the multi-axis convolution module is as follows: ; ; In the formula: It is used to constrain the semantic consistency between feature representations of different axes, compare the distance between features of different axes, and reduce the semantic bias caused by directional differences; Let L2 norm represent the feature representation, used to normalize axial features; Ω represents the set of axial feature pairs, defined as... Where H is the feature map height, W is the feature map width, and D is the feature map depth; and These represent the axial feature representations obtained by aggregating along the p-th and q-th spatial dimensions, respectively. This represents the value of the k-th feature component in the feature vector obtained by aggregating in the p-th spatial direction; The weights of each axial feature pair in the consistency loss are adaptively adjusted by introducing corresponding axial weight coefficients, so as to adjust the constraint strength according to the importance of different spatial directions to the segmentation task. The Laxis, as an auxiliary constraint term during the training phase, together with the segmentation loss function, constitutes the total loss function and updates the parameters of the multi-axis convolutional feature extraction module through backpropagation.

[0017] Compared with the known prior art, the technical solution provided by this invention has the following beneficial effects: This invention combines multi-axis convolution and axial attention mechanisms to fully exploit complementary information in different modal magnetic resonance imaging data, extract more discriminative three-dimensional spatial features, reduce semantic bias across different axes, and utilizes an innovative neural network architecture with cross-dimensional semantic alignment to establish semantic consistency across different spatial dimensions and scales of three-dimensional medical images, thereby improving the accuracy of tumor boundary recognition. Simultaneously, a training strategy employing multi-loss function collaborative constraints balances segmentation accuracy and spatial semantic consistency, enhancing the model's generalization performance. Furthermore, by automating the processing and feature learning of medical image data, it reduces reliance on human experience, providing a reliable technical solution for brain tumor structural assessment, assisted diagnosis, and subsequent treatment decisions. Attached Figure Description

[0018] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are merely some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without any creative effort.

[0019] Figure 1 This is a flowchart illustrating the brain tumor segmentation method based on convolutional attention and cross-layer semantic calibration provided in an embodiment of the present invention. Figure 2 This is a schematic diagram of the module structure of the three-dimensional network segmentation model provided in the embodiment of the present invention; Figure 3 This is a multimodal example diagram of magnetic resonance imaging of brain tumors provided in an embodiment of the present invention. Detailed Implementation

[0020] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.

[0021] The present invention will be further described below with reference to embodiments.

[0022] Example 1: This embodiment presents a medical image segmentation method for multimodal brain tumor data, such as... Figure 1 As shown, it includes: Step 1: Integrate and filter multimodal magnetic resonance imaging data of different subjects, as well as voxel-level brain tumor segmentation annotations corresponding to the multimodal magnetic resonance imaging data; Step 2: Construct a 3D network segmentation model based on multi-axis convolution. Use the multimodal magnetic resonance image data obtained in Step 1 and the corresponding voxel-level brain tumor segmentation annotations to train the 3D network segmentation model. The execution process of the 3D network segmentation model includes the following steps: The multimodal magnetic resonance image data collected in step 1 is input into the data processing module to complete the data standardization process; The standardized data output from the data processing module is fed into the multi-axis convolution feature extraction module to complete the multi-scale three-dimensional spatial feature extraction. The high-resolution feature information extracted by the shallow layer of the multi-axis convolutional feature extraction module is fed into the deep network to continue multi-axis convolutional feature extraction and output deep features. The features output from different network layers of the multi-axis convolutional feature extraction module are fed into the cross-dimensional semantic alignment module to complete the semantic calibration and alignment of cross-layer features, and obtain the semantic alignment feature map. The feature results extracted layer by layer by the multi-axis convolution feature extraction module, together with the semantic alignment features output by the cross-dimensional semantic alignment module, are fed into the semantic segmentation prediction module to complete the prediction and analysis of brain tumor regions. When the data processing module performs standardization on the input multimodal magnetic resonance image data, it follows the following rules: The medical images corresponding to the input multimodal magnetic resonance imaging data are subjected to space filling and center cropping operations to unify all images to a preset three-dimensional voxel size; Random spatial flipping and imaging intensity perturbation operations are performed on the medical images corresponding to the multimodal magnetic resonance imaging data. The formula for calculating the imaging intensity perturbation is as follows: ; In the formula: Indicates the intensity value of the original medical image. This represents the intensity value of the transformed image. and These represent the scaling factor and the offset factor, respectively. The multimodal magnetic resonance image data processed as described above is organized into a channel-dimensional format according to a preset order and converted into a three-dimensional tensor data format that can be processed by a network. Step 3: Input the multimodal magnetic resonance image data to be predicted into the trained 3D network segmentation model, and automatically output the segmentation results of the brain tumor region; The 3D network segmentation model includes a data processing module, a multi-axis convolutional feature extraction module, a cross-dimensional semantic alignment module, and a semantic segmentation prediction module connected in sequence. During the training process in step 2, the total loss function is composed of the segmentation loss function and the consistency loss function of the multi-axis convolutional module. The multi-axis convolutional feature extraction module captures data features through the encoder structure: The three-dimensional tensor data output by the data processing module is downsampled through a convolutional layer with a kernel of 3×3×3 to obtain the downsampled three-dimensional feature map. The downsampled 3D feature map is input into the multi-axis convolution feature extraction unit. It is then processed by 7×1×1, 1×7×1 and 1×1×7 convolution kernels to extract directional features along the three spatial axes, thereby modeling the contextual dependencies of the 3D volume data in different directions. The convolution results in the three directions are added element by element to obtain a multi-axis aggregated feature map; The multi-axis aggregated feature map is expanded along three axes, and softmax normalization is performed on the corresponding axes. The weighted results of the three directions are then fused. The dimensions of the multi-axis aggregated feature map are C×H×W×D, where C is the number of feature channels, H is the feature map height, W is the feature map width, and D is the feature map depth. The fused feature map is then reconstructed using 1×1×7 and 1×7×1 convolution kernels to complete channel semantic mapping and fusion, and output the final extracted features of the multi-axis convolution feature extraction module. When expanding the multi-axis aggregated feature map along the three axes, global average pooling is performed on the corresponding axes of height, width and depth respectively, and then the weights of the corresponding axes are normalized by the softmax function to obtain the attention weight map of each axis. The attention weight maps of the three axes are multiplied element-wise with the multi-axis aggregated feature map, and then weighted fusion is performed to obtain a weighted feature map that incorporates multi-axis contextual information. The cross-dimensional semantic alignment module performs semantic calibration and alignment operations on the features output by different network layers of the multi-axis convolutional feature extraction module, specifically including the following steps: The features output from different network layers of the multi-axis convolutional feature extraction module are input into a convolutional layer with a kernel of 3×3×3 and a stride of 2 to perform feature downsampling and obtain a downsampled feature map. The downsampled feature map is input into two consecutive convolutional layers with a kernel size of 1×1×1. Channel expansion and compression mapping are performed by expanding the kernel by a factor of 2 to obtain the channel-mapped feature map. The channel mapping feature map is rearranged from the C×H×W×D dimension to H×W×D×C, and weighted and adjusted using a learnable scaling parameter corresponding to the channel dimension. Then it is restored to the channel calibration feature map with the C×H×W×D dimension. The channel calibration feature map is input into the residual unit to complete the residual feature mapping and output the semantic alignment feature map. The residual unit is a feature processing unit set within the cross-dimensional semantic alignment module, consisting of convolutional layers with a kernel size of 1×1×1 and a stride of 2. It is used to perform residual feature mapping on the input channel calibration feature map to achieve feature dimension matching and the preservation and transmission of core semantic information. The semantic segmentation prediction module is used to accurately classify different regions of brain tumors. Deep features are input into the transposed convolutional unit, and the spatial dimension of the feature map is adjusted to match the spatial dimension of the shallow features, resulting in a dimension-matched deep feature map. The semantically aligned feature map is fused with the dimension-matched deep feature map, and the fused feature is input into the decoding unit. The features output from the decoding unit are input into two consecutive convolutional layers with a kernel size of 3×3×3 to refine the features and obtain the refined segmentation feature map. The refined segmentation feature map is input into a convolutional layer with a kernel size of 1×1×1, and the final predicted segmentation result of the brain tumor region is output. The decoding unit is a feature decoding structure built into the semantic segmentation prediction module, used to upsample and semantically restore the fused features. The formula for calculating the total loss function used in the training process in step 2 is as follows: ; In the formula: Represents the total loss function; This represents the consistency loss function in a multi-axis convolution module, used to constrain the semantic consistency between feature representations along different axes. The weight coefficient representing the consistency loss is selected from preset values based on the convergence speed and validation set segmentation performance during training, and is used to balance axial consistency constraints and segmentation accuracy optimization.

[0023] The segmentation loss function is a weighted sum of the Dice loss function and the binary cross-entropy loss function, and the calculation formula is as follows: ; In the formula: and The loss weight coefficient is selected from preset values based on the convergence speed and validation set segmentation performance during training. It is used to balance the optimization weights of Dice loss and binary cross-entropy loss. The formula for calculating the Dice loss function is: ; ; In the formula: Dice is the Dice similarity coefficient, which measures the degree of regional overlap between the predicted segmentation result and the actual segmentation annotation of the brain tumor at the voxel level; N represents the total number of voxels; M represents the total number of categories; Indicates the smoothing term; This represents the predicted probability output of the nth voxel in the mth class; The true label represents the one-hot encoding form of the nth voxel in the mth class; The formula for calculating the binary cross-entropy loss function is: ; In the formula: N′ represents the total number of samples; This represents the true probability value of the i-th sample; This represents the model's predicted probability output for the i-th sample; The formula for calculating the consistency loss function of a multi-axis convolution module is as follows: ; ; In the formula: It is used to constrain the semantic consistency between feature representations of different axes, compare the distance between features of different axes, and reduce the semantic bias caused by directional differences; Let L2 norm represent the feature representation, used to normalize axial features; Ω represents the set of axial feature pairs, defined as... Where H is the feature map height, W is the feature map width, and D is the feature map depth; and These represent the axial feature representations obtained by aggregating along the p-th and q-th spatial dimensions, respectively. This represents the value of the k-th feature component in the feature vector obtained by aggregating in the p-th spatial direction; The weights of each axial feature pair in the consistency loss are adaptively adjusted by introducing corresponding axial weight coefficients, so as to adjust the constraint strength according to the importance of different spatial directions to the segmentation task. Laxis, as an auxiliary constraint term during the training phase, together with the segmentation loss function, constitutes the total loss function and updates the parameters of the multi-axis convolutional feature extraction module through backpropagation.

[0024] Example 2: First, multimodal brain magnetic resonance imaging data of different subjects were integrated and screened, along with corresponding voxel-level brain tumor segmentation annotations. The multimodal data covered multi-sequence magnetic resonance images that could reflect different characteristics of brain tissue. After removing invalid data with imaging quality defects or mismatched annotation information, the valid data was divided into training set, validation set, and test set, which were used for model training, effect verification, and generalization performance testing, respectively.

[0025] Standardization preprocessing is performed on all valid data. First, spatial filling and center cropping are performed on all images to unify them to a preset 3D voxel size. Random spatial flipping and image intensity perturbation are performed on the training set data to complete data augmentation and improve the model's generalization ability. Finally, the multimodal images are organized into multi-channel 3D tensor data in a fixed order to complete the standardization processing of all data and adapt it to the model's input requirements.

[0026] The 3D network segmentation model constructed in this implementation includes four core modules: data processing, multi-axis convolutional feature extraction, cross-dimensional semantic alignment, and semantic segmentation prediction. The complete execution flow is as follows: After the preprocessed training data is input into the model, it first enters the multi-axis convolutional feature extraction stage. Initial downsampling is performed using 3×3×3 convolutional kernels, followed by input to the multi-axis convolutional feature extraction unit. This unit extracts spatial features along the three spatial axes of the 3D volume data using convolutional kernels of corresponding sizes. The convolutional results in the three directions are then element-wise summed to obtain multi-axis aggregated features. Subsequently, the feature map is unfolded along the three axes and normalized and weighted. After fusing the weighted results from the three directions, feature reconstruction is performed using the corresponding convolutional kernels, achieving channel semantic mapping and fusion. Simultaneously, the model inputs high-resolution features extracted by the shallow network into the deep network, completing the reuse and depth extraction of multi-scale features.

[0027] After multi-scale feature extraction, features from different levels are fed into a cross-dimensional semantic alignment module to perform cross-layer semantic calibration. This module first downsamples the feature map using a 3×3×3 convolutional layer, then uses two consecutive 1×1×1 convolutional layers to perform channel expansion and compression mapping according to a preset expansion ratio. Subsequently, the dimensions of the feature map are rearranged, and after weighting adjustment using learnable scaling parameters corresponding to the channel dimensions, it is restored to the standard feature map format. Finally, feature processing is completed through a 1×1×1 convolutional layer of residual units, achieving semantic alignment of features from different levels and dimensions, and eliminating semantic bias of cross-layer features.

[0028] The semantically aligned features, along with the original features extracted layer by layer by the multi-axis convolutional feature extraction module, are fed into the semantic segmentation prediction module to generate the segmentation result. This module first adjusts the spatial dimension of the input deep features through a transposed convolutional unit, then fuses the dimension-matched deep features with the corresponding semantically aligned features and feeds them into the decoding unit; subsequently, it refines the features through two consecutive 3×3×3 convolutional layers, and finally outputs the final segmentation prediction result of the brain tumor region through a 1×1×1 convolution.

[0029] During the model training phase, preset training parameters are used to perform iterative optimization of the network. A combined total loss function is used to update the parameters during the training process. The voxel-level brain tumor segmentation annotations are defined as having corresponding category labels in the corresponding spatial dimensions of the multimodal image data, used to characterize whether the voxel belongs to the background region or the brain tumor region. During training, multimodal magnetic resonance image data is input into the 3D network segmentation model, which outputs the predicted probability results for each voxel in each category. The segmentation loss is calculated based on the difference between the predicted results and the ground truth annotations. The segmentation loss consists of Dice loss and binary cross-entropy loss, used to optimize the model's segmentation accuracy. The consistency loss of the multi-axis convolution module constrains the semantic consistency of features along different axes to reduce semantic bias caused by directional differences. Finally, the segmentation loss and consistency loss are weighted and summed to form the total loss function, and the model parameters are updated using backpropagation based on this total loss function, achieving end-to-end training of the model. During training, the segmentation accuracy on the validation set steadily improves. After training, the model's segmentation accuracy on the validation set reaches the expected standard, with good convergence and no obvious overfitting.

[0030] The trained model is used for multimodal brain tumor image segmentation to be predicted. After the original multimodal images to be analyzed are input into the model, the model automatically completes data standardization preprocessing, and completes feature extraction, cross-dimensional semantic alignment and segmentation prediction through the built-in network structure. It automatically outputs refined segmentation results of brain tumor regions, and the output segmentation results can accurately identify different structural regions of brain tumors.

[0031] In summary, the methods described above combine multi-axis convolution and axial attention mechanisms to fully exploit complementary information in different modal magnetic resonance imaging data, extract more discriminative three-dimensional spatial features, reduce semantic bias across different axes, and utilize an innovative neural network architecture with cross-dimensional semantic alignment to establish semantic consistency across different spatial dimensions and scales of three-dimensional medical images, thereby improving the accuracy of tumor boundary recognition. Furthermore, a training strategy employing multi-loss function collaborative constraints balances segmentation accuracy and spatial semantic consistency, enhancing the model's generalization performance. Finally, by automating the processing and feature learning of medical image data, reliance on human experience is reduced, providing a reliable technical solution for brain tumor structural assessment, assisted diagnosis, and subsequent treatment decisions.

[0032] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions will not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A medical image segmentation method for multimodal brain tumor data, characterized in that, include: Step 1: Integrate and filter multimodal magnetic resonance image data of different objects, as well as voxel-level brain tumor segmentation annotations corresponding to the multimodal magnetic resonance image data; Step 2: Construct a 3D network segmentation model based on multi-axis convolution. Use the multimodal magnetic resonance image data as input and the corresponding voxel-level brain tumor segmentation annotations as supervision signals to train the 3D network segmentation model. Step 3: Input the multimodal magnetic resonance image data to be predicted into the trained 3D network segmentation model to obtain the segmentation results of the brain tumor region; The three-dimensional network segmentation model includes a data processing module, a multi-axis convolutional feature extraction module, a cross-dimensional semantic alignment module, and a semantic segmentation prediction module connected in sequence.

2. The medical image segmentation method for multimodal brain tumor data according to claim 1, characterized in that, The execution process of the three-dimensional network segmentation model includes the following steps: The multimodal magnetic resonance image data is input into the data processing module to complete the data standardization process and obtain standardized data. Standardized data is fed into the shallow coding branch of the multi-axis convolutional feature extraction module to perform multi-scale three-dimensional spatial feature extraction and obtain shallow features. The shallow features are fed into the deep coding branch of the multi-axis convolutional feature extraction module to continue multi-axis convolutional feature extraction and output deep features. The features output from each layer of the multi-axis convolutional feature extraction module are fed into the cross-dimensional semantic alignment module to complete the semantic calibration and alignment of cross-layer features, and obtain the semantic alignment feature map. The shallow features, deep features, and semantically aligned feature maps extracted layer by layer by the multi-axis convolution feature extraction module are fed into the semantic segmentation prediction module to complete the segmentation prediction of the brain tumor region.

3. The medical image segmentation method for multimodal brain tumor data according to claim 2, characterized in that, The data processing module performs standardization processing on the input multimodal magnetic resonance image data, including: The input multimodal magnetic resonance image data is subjected to space filling and center cropping operations to unify all images to a preset three-dimensional voxel size; Random spatial flipping and imaging intensity perturbation operations are performed on the standardized multimodal magnetic resonance image data. The calculation formula for the imaging intensity perturbation is as follows: ； In the formula: Indicates the intensity value of the original medical image. This represents the intensity value of the transformed image. and These represent the scaling factor and the offset factor, respectively. The multimodal magnetic resonance image data processed as described above is organized into a three-dimensional tensor according to the channel dimension and output as a standardized data graph.

4. The medical image segmentation method for multimodal brain tumor data according to claim 2, characterized in that, The multi-axis convolutional feature extraction module performs the following operations on the standardized data: Standardized data is downsampled through a convolutional layer to obtain a downsampled 3D feature map. The downsampled 3D feature map is input into a multi-axis convolutional feature extraction unit, and directional features are extracted along the three spatial axes through 7×1×1, 1×7×1 and 1×1×7 convolutional kernels respectively, to model the contextual dependencies of the 3D volume data in different directions. The convolution results in the three directions are added element by element to obtain a multi-axis aggregated feature map; Axial attention weighting is performed on the multi-axis aggregated feature map along the height, width, and depth axes respectively to obtain a weighted feature map; The weighted feature map is reconstructed using an asymmetric convolution kernel to complete channel semantic mapping and fusion, and the extracted features of this layer are output.

5. The medical image segmentation method for multimodal brain tumor data according to claim 4, characterized in that, The specific steps of the axial attention weighting include: performing global average pooling on the multi-axis aggregated feature map on the three corresponding axes of height, width, and depth respectively, and then normalizing the weights of the corresponding axes through the softmax function to obtain the attention weight map of each axis. The attention weight maps of the three axes are multiplied element-wise with the multi-axis aggregated feature map, and then weighted and fused to obtain a weighted feature map that incorporates multi-axis contextual information.

6. The medical image segmentation method for multimodal brain tumor data according to claim 2, characterized in that, The cross-dimensional semantic alignment module performs semantic calibration and alignment on the features output by each network layer of the multi-axis convolutional feature extraction module, specifically including: The features output from each layer of the multi-axis convolution feature extraction module are downsampled through a convolutional layer with a stride of 2 to obtain a downsampled feature map. The downsampled feature map is passed through two consecutive pointwise convolutional layers to perform channel expansion and compression mapping at a scaling factor of 2, resulting in a channel-mapped feature map. The channel mapping feature map is rearranged from the C×H×W×D dimension to H×W×D×C, and a learnable scaling parameter is used for weighted adjustment. Then it is restored to the C×H×W×D dimension to obtain the channel calibration feature map. The channel calibration feature map is used to perform residual feature mapping through residual units to output a semantic alignment feature map.

7. A medical image segmentation method for multimodal brain tumor data according to claim 2, characterized in that, The semantic segmentation prediction module performs segmentation prediction on brain tumor regions, including: The deep features are upsampled by transposed convolution to match their spatial dimensions with those of the shallow features, resulting in an upsampled deep feature map. The semantically aligned feature map is fused with the upsampled deep feature map, and the fused feature is input into the decoding unit for upsampling and semantic recovery. The features output by the decoding unit are refined through at least two convolutional layers to obtain a refined segmentation feature map; The refined segmentation feature map is then used to output the segmentation result of the brain tumor region through pointwise convolution.

8. The medical image segmentation method for multimodal brain tumor data according to claim 1, characterized in that, In step 2, the formula for calculating the total loss function used in the training process of the 3D network segmentation model is as follows: ； In the formula: Represents the total loss function; This represents the consistency loss function in a multi-axis convolution module, used to constrain the semantic consistency between feature representations along different axes. The weighting coefficients representing consistency loss are used to balance axial consistency constraints and segmentation accuracy optimization.

9. A medical image segmentation method for multimodal brain tumor data according to claim 8, characterized in that, The segmentation loss function is a weighted sum of the Dice loss function and the binary cross-entropy loss function, and the calculation formula is as follows: ； In the formula: and These are the loss weight coefficients, used to balance the optimization weights between Dice loss and binary cross-entropy loss; The formula for calculating the Dice loss function is as follows: ；； In the formula: Dice is the Dice similarity coefficient, which measures the degree of regional overlap between the predicted segmentation result and the actual segmentation annotation of the brain tumor at the voxel level; N represents the total number of voxels; M represents the total number of categories; Indicates the smoothing term; This represents the predicted probability output of the nth voxel in the mth class; The true label represents the one-hot encoding form of the nth voxel in the mth class; The formula for calculating the binary cross-entropy loss function is as follows: ； In the formula: N′ represents the total number of samples; This represents the true probability value of the i-th sample; This represents the model's predicted probability output for the i-th sample.

10. A medical image segmentation method for multimodal brain tumor data according to claim 8, characterized in that, The formula for calculating the consistency loss function of the multi-axis convolution module is as follows: ；； In the formula: It is used to constrain the semantic consistency between feature representations of different axes, compare the distance between features of different axes, and reduce the semantic bias caused by directional differences; Let L2 norm represent the feature representation, used to normalize axial features; Ω represents the set of axial feature pairs, defined as... Where H is the feature map height, W is the feature map width, and D is the feature map depth; and These represent the axial feature representations obtained by aggregating along the p-th and q-th spatial dimensions, respectively. This represents the value of the k-th feature component in the feature vector obtained by aggregating in the p-th spatial direction; The Laxis, as an auxiliary constraint term during the training phase, together with the segmentation loss function, constitutes the total loss function and updates the parameters of the multi-axis convolutional feature extraction module through backpropagation.