A mamba-based ct metal artifact removal model
By using a Mamba-based CT metal artifact removal model, which combines multi-scale Mamba modules and a maximum average feedforward network, the problem of removing metal artifacts in CT imaging is solved, achieving efficient artifact removal and image quality preservation with low resource consumption.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CENT SOUTH UNIV
- Filing Date
- 2026-03-13
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies for removing metal artifacts in CT imaging rely on sinusoidal domain information, which is difficult to obtain, requires large computational resources, and damages the original boundary and structural information of the image.
A CT metal artifact removal model based on Mamba is adopted. By learning the feature map information in different orientations and combining saliency and average information, a multi-scale Mamba module and a maximum mean feedforward network module are used, combined with upsampling and downsampling mechanisms, to efficiently remove metal artifacts and preserve the original anatomical structure of the image.
With low computational burden and parameter count, it effectively removes metal artifacts, maintains image quality, and improves artifact removal effect and image detail restoration capability.
Smart Images

Figure CN122199746A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of medical artificial intelligence technology, specifically to a CT metal artifact removal model based on Mamba. Background Technology
[0002] Computed Tomography (CT) is a widely used technique in modern clinical diagnosis. It can be used to obtain detailed information about tissues, organs, and pathological structures, thereby better assisting doctors in making diagnoses.
[0003] However, in clinical applications, metallic implants (such as intraorthopedic, dental, and orthopedic implants and prostheses) can produce severe artifacts during CT imaging. These artifacts obscure the structure in the CT image, reduce image quality, and may mislead physicians' clinical decisions. Traditional methods focus on repairing information damaged in the sinusoidal domain.
[0004] With the rapid development of deep learning, many models have emerged that utilize dual-domain processing (combining image domain and sinusoidal domain methods) or specialized single-domain augmentation techniques. However, most current methods face the following challenges: 1) Many methods rely on information from the sinusoidal domain; however, in real-world scenarios, information from the sinusoidal domain is not readily available.
[0005] 2) Cross-domain fusion processing of multi-domain information also brings a large computational burden and memory overhead, requiring a large amount of computing resources.
[0006] 3) Current MAR methods often damage the original boundary and structural information of the image during the removal of metal artifacts, thereby reducing the original image quality.
[0007] Based on this, the present invention designs a CT metal artifact removal model based on Mamba to solve the above problems. Summary of the Invention
[0008] The purpose of this invention is to provide a CT metal artifact removal model based on Mamba, which learns different orientation information of feature maps and combines saliency and average information to efficiently remove metal artifacts while maintaining the original anatomical structure of the image, thereby solving the problems mentioned in the background art.
[0009] To achieve the above objectives, the present invention provides the following technical solution: A Mamba-based CT metal artifact removal model includes: The input processing module includes a 3×3 convolutional layer for receiving CT images with metal artifacts. The CT images with metal artifacts are processed by the 3×3 convolutional layer to expand the number of channels from 1 to D. The input dimension of the CT image is B×1×H×W, where B represents the batch size, and H and W represent the height and width of the image, respectively. The downsampling module is used to process the input feature map, including multiple downsampling stages. After each stage of processing, the number of channels of the feature map doubles and the height and width are halved. The upsampling module is used to process the feature map output by the downsampling module to gradually restore image details. It includes multiple upsampling stages. After each upsampling stage, the number of channels of the feature map is halved and the height and width are doubled. The output reconstruction module includes a 3×3 convolutional layer, which shrinks the number of channels of the feature map output by the upsampling module back to 1 to obtain the predicted artifact feature map. The obtained feature map is then added element-wise to the original CT image with metal artifacts to constrain the model to learn the artifact features. The final output is an artifact-free CT image with dimensions B×1×H×W, which is the same size as the input. The upsampling and downsampling modules both include multi-scale Mamba modules, which consist of a normalization layer, a flipped Mamba module (FMB), and a maximum average feedforward network module (AMFN) connected in sequence. This combination of techniques enables our model to efficiently remove high-quality artifacts. The Flip Mamba module is used to scan the input feature map in multiple directions to learn feature map information from different directions, thereby capturing comprehensive contextual information; The maximum average feedforward network module is used to perform nonlinear mapping and enhance the model's feature learning ability, thereby improving artifact removal by fusing key features and average features.
[0010] Preferably, the flipped Mamba module includes: The feature input adjustment unit includes a 1×1 convolution used to split the input of dimension B×D×H×W into three parts; The feature transformation unit is used to flatten and transform the feature map of the three parts into a feature map of size B×N×C, where N equals H×W. The Mamba processing unit is used to process the three transformed feature maps separately. The first feature map is directly processed by Mamba, while the second and third feature maps are flipped vertically and horizontally, respectively. The feature aggregation unit aggregates features from three feature maps through element-wise multiplication. The channel restoration unit, consisting of a 1×1 convolution, is used to restore the number of channels in the feature map.
[0011] The Flip Mamba module (FMB) uses Mamba to learn information about different orientations of an image, thereby alleviating Mamba's problems in image processing.
[0012] Preferably, the maximum average feedforward network module includes: The maximum average module has two branches: the maximum pooling branch and the average pooling branch. The maximum pooling branch uses the maximum pooling layer to extract salient features of the image, while the average pooling branch uses the average pooling layer to extract overall information of the image. The fusion unit is adjusted and fused by element-wise multiplication to allow different features to interact.
[0013] Preferably, the maximum average module also includes a restoration unit. Since the maximum pooling layer and the average pooling layer downsample the feature map, in order to maintain the consistency of the height and width of the input and output feature maps, the height and width of the image are finally restored by interpolation.
[0014] A method for removing CT metal artifacts based on Mamba includes the following steps: S1. In the first 3×3 convolutional layer, increase the number of channels of the input image from 1 to D, while keeping other dimensions unchanged; S2. Perform step-by-step downsampling on the feature map, doubling the number of channels in each stage and reducing the height and width to half of the original. S3. Perform phased upsampling on the feature map, reducing the number of channels to half of the original in each phase, and doubling the height and width. S4. Perform fine-tuning in the last stage of upsampling, while keeping the feature map dimensions unchanged; S5. Reduce the number of channels of the image from D to 1 using a 3×3 convolutional layer; In particular, the multi-scale Mamba module is used in each stage of steps S2, S3 and S4.
[0015] Preferably, the number of channels D in steps S1 and S5 is 12.
[0016] Compared with the prior art, the beneficial effects of the present invention are: The model of this invention processes information only in the image domain, without requiring the sinusoidal domain as model input, making it more suitable for CT metal artifact removal tasks in real-world scenarios. It also achieves a good balance between performance and effect, achieving good artifact removal results with a low number of parameters, computational burden, and memory overhead. The model of this invention combines upsampling and downsampling mechanisms with a multi-scale Mamba module. The multi-scale Mamba module flips the feature map by using a flipped Mamba module, thereby learning features from different orientations of the feature map to overcome the limitations of traditional Mamba. It also introduces a maximum average feedforward network module, combining max pooling and average pooling to improve feature extraction capabilities, thereby improving image detail restoration capabilities while ensuring artifact removal capabilities. Attached Figure Description
[0017] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0018] Figure 1 This is a schematic diagram of the overall structure of the model of the present invention; Figure 2 This is a schematic diagram of the structure of the multi-scale Mamba module of the present invention. Detailed Implementation
[0019] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention. Example
[0020] This invention proposes a single-domain CT metal artifact removal model based on Mamba, employing a UNet network structure and using multi-scale Mamba as the core module. Within the multi-scale Mamba, we designed a "flipped Mamba module" that combines information from different image orientations, and a "maximum average feedforward network module" that extracts and fuses salient and average features of the image. Through this combination of techniques, our model can efficiently learn detailed features of metal artifacts at various scales while maintaining the original anatomical structure of the image, by learning feature map information from different orientations and combining salient and average information, while balancing performance.
[0021] I. Model Structure: The model of this invention adopts a structure similar to UNet, such as... Figure 1 As shown.
[0022] The model's input dimension is B×1×H×W, where B represents the batch size, and H and W represent the height and width of the image, respectively. Images with metallic artifacts are first passed through a 3×3 convolutional layer to expand the number of channels from 1 to D.
[0023] Then, the image goes through each downsampling stage in sequence. After each stage, the number of feature maps in the image doubles, and the height and width are halved. Then, it goes through upsampling restoration. In each upsampling stage, the number of channels in the image is halved, and the height and width are doubled.
[0024] Finally, a 3×3 convolutional layer is used to reduce the number of channels from D back to 1, and element-wise addition is performed with the original input to constrain the model to learn artifact features; the final output image with artifacts removed is the same size as the input.
[0025] At each stage of the model, the core module is the multi-scale Mamba module. The module uses a "flipped Mamba module" to combine context and aggregates features through a "maximum average feedforward network module". By combining upsampling and downsampling mechanisms with the multi-scale Mamba module, the model can learn artifact features at different scales and better leverage the context combination capabilities of Mamba.
[0026] Mamba, as a sequence model, performs well in processing sequence data with long contexts. However, since the pixels of an image are not arranged in a certain order, images are not strictly sequence data, and Mamba usually does not perform well in image processing.
[0027] To alleviate this problem, a multi-scale Mamba module was constructed using multiple Mamba instances, such as... Figure 2 The MS-Mamba example shown consists of two submodules and a normalization layer.
[0028] The first submodule is the Flipped Mamba module (FMB). In FMB, we use Mamba to scan the same feature map in different directions to learn features from different perspectives.
[0029] Specifically, the input with dimensions B×D×H×W is first processed by a 1x1 convolution, and then split into three parts in the channel dimension. The feature map of each part is transformed into a feature map of size B×N×C through flattening and dimension transformation operations, where N equals H×W. The first part of the feature map is directly processed by Mamba, while the second and third parts of the feature map are first flipped vertically and horizontally, respectively, and then processed by Mamba. Finally, the features of the three feature maps are aggregated by element-wise multiplication, and the number of channels is restored by a 1×1 convolution.
[0030] FMB uses Mamba to learn information about different orientations of an image, thereby alleviating Mamba's problems in image processing.
[0031] The second submodule is the maximum average feedforward network module (AMFN) that we designed, which combines maximum pooling layers and average pooling layers.
[0032] This module is used to perform nonlinear mapping and enhance the model's feature learning ability. Its core part is the maximum average module (AMB). One branch uses a maximum pooling layer to extract salient features of the image, and the other branch uses an average pooling layer to extract overall information of the image. Then, the information from the two branches is fused through element-wise multiplication to allow different features to interact.
[0033] Since the max pooling and average pooling layers downsample the feature maps, in order to maintain the consistency of the height and width of the input and output feature maps, the height and width of the image are finally restored by interpolation.
[0034] This model uses a combination of two loss functions. The first loss function is the Pseudo-Huber loss function: , Y represents an image without weather interference, I represents the model's output, and c is an adjustable coefficient set to 0.03.
[0035] The second loss function is the LPIPS loss function, which uses a pre-trained VGG model to calculate the difference. The two loss functions are combined as follows: , Here, α and β are the weights, respectively.
[0036] II. Experiment Setup 1. Dataset SynDeepLesion: SynDeepLesion is synthesized from 1200 images selected from DeepLesion, divided into a training set and a test set. The training set contains 1000 image pairs, and the test set contains 200 image pairs. The metal sizes in the test set are [2061, 890, 881, 451, 254, 124, 118, 112, 53, 35]. We divide them into four groups: large, medium, small, and tiny. The large size contains [2061, 890, 881], the medium size contains [451, 254], the small size contains [124, 118, 112], and the tiny size contains [53, 35].
[0037] 2. Baseline and Evaluation Indicators In our experiments, we used the following comparative experiments to evaluate the model's performance: (1) Evaluate the repair effect of different metal sizes on the test set, and evaluate only the area outside the metal area; (2) Full-map evaluation, including metal regions, as well as parameter quantity and inference time. The evaluation metrics used are PSNR, SSIM and RMSE, and the baseline model is OSCNet+ [1].
[0038] 3. Implementation details Configure the model as follows: The input channel count is 1, the first 3×3 convolutional layer has 12 output channels, and the number of multi-scale Mamba channels in each layer is set to [1, 2, 2, 4, 2, 2, 1, 1]. The loss function has α = 0.8, c = 0.03, and the pre-trained model for LPIPS uses VGG with β set to 0.2.
[0039] The model was trained and evaluated on a single NVIDIA RTX 3090 GPU, using PyTorch version 2.3.0 and CUDA version 12.1. The Adam optimizer and CosineAnnealingLR learning rate scheduling strategy were used during model training.
[0040] Training is divided into three phases. The first phase has 100,000 iterations, an image size of 256x256, and a batch size of 8. The second phase has 200,000 iterations, an image size of 336x336, and a batch size of 4. The last phase has 20,000 iterations, an image size of 416x416, and a batch size of 2. The initial learning rate for all three phases is set to 0.0002, T_max is set to 1000, and the minimum learning rate is set to 1e-8.
[0041] 4. Experimental Results 4.1 Quantitative Analysis According to Table 1, our model outperforms OSCNet+ [1] in terms of repair performance for all metal sizes. In particular, for large metal sizes, our model outperforms OSCNet+ by 37% in PSNR and reduces RMSE by 78%, indicating that our model is more suitable for removing metal artifacts of any size, including large sizes.
[0042] Table 1: Comparison of repair effects for different metal sizes. The area outside the metal was evaluated using PSNR / SSIM / RMSE as indicators.
[0043] According to Table 2, although our model is 5 milliseconds longer than OSCNet+ [1] in terms of inference time, our model has 53% fewer parameters, and our PSNR is 13% higher, our SSIM is 0.0029 dB higher, and our RMSE is 53% lower. For the improved performance, the slightly longer inference time is acceptable.
[0044] Table 2: Comparative Experiment Results of Full-Map Evaluation (Including Metallic Areas)
[0045] 4.2 Qualitative Analysis We have developed a Mamba-based model for removing CT metal artifacts. By using Mamba to scan feature maps from different orientations, capturing comprehensive context, and fusing salient and global features from the feature maps, our model can efficiently and flexibly remove metal artifacts of various sizes while preserving the original anatomical structure of the image. Experimental results show that our model achieves a good balance between performance and effectiveness, and performs well in both synthetic datasets and real-world scenarios.
[0046] In the description of this specification, references to terms such as "an embodiment," "example," "specific example," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0047] The preferred embodiments of the present invention disclosed above are merely illustrative of the invention. These preferred embodiments do not exhaustively describe all details, nor do they limit the invention to any specific implementation. Clearly, many modifications and variations can be made based on the content of this specification. This specification selects and specifically describes these embodiments to better explain the principles and practical applications of the invention, thereby enabling those skilled in the art to better understand and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.
Claims
1. A CT metal artifact removal model based on Mamba, characterized in that, include: The input processing module includes a 3×3 convolutional layer for receiving CT images with metal artifacts, which expand the number of channels from 1 to D through the 3×3 convolutional layer. The downsampling module is used to process the input feature map, including multiple downsampling stages. After each stage of processing, the number of channels of the feature map doubles and the height and width are halved. The upsampling module is used to process the feature map output by the downsampling module to gradually restore image details. It includes multiple upsampling stages. After each upsampling stage, the number of channels of the feature map is halved and the height and width are doubled. The output reconstruction module includes a 3×3 convolutional layer, which shrinks the number of channels of the feature map output by the upsampling module back to 1, and adds the obtained feature map to the original CT image with metal artifacts element-wise to constrain the model to learn artifact features and output an artifact-free CT image. The upsampling module and the downsampling module both include a multi-scale Mamba module, which includes a normalization layer, a flipped Mamba module and a maximum average feedforward network module connected in sequence. The flipped Mamba module is used to scan the input feature map in multiple directions to learn feature map information in different directions, thereby capturing comprehensive contextual information. The maximum average feedforward network module is used to perform nonlinear mapping and enhance the model's feature learning ability.
2. The CT metal artifact removal model based on Mamba according to claim 1, characterized in that, The flipped Mamba module includes: The feature input adjustment unit includes a 1×1 convolution used to split the input of dimension B×D×H×W into three parts; The feature transformation unit is used to flatten and transform the feature map of the three parts into a feature map of size B×N×C, where N equals H×W. The Mamba processing unit is used to process the three transformed feature maps separately. The first feature map is directly processed by Mamba, while the second and third feature maps are flipped vertically and horizontally, respectively. The feature aggregation unit aggregates features from three feature maps through element-wise multiplication. The channel restoration unit, consisting of a 1×1 convolution, is used to restore the number of channels in the feature map.
3. The CT metal artifact removal model based on Mamba according to claim 1, characterized in that, The maximum average feedforward network module includes: The maximum average module has two branches: the maximum pooling branch and the average pooling branch. The maximum pooling branch uses the maximum pooling layer to extract salient features of the image, while the average pooling branch uses the average pooling layer to extract overall information of the image. The fusion unit is adjusted and fused by element-wise multiplication to allow different features to interact.
4. The CT metal artifact removal model based on Mamba according to claim 3, characterized in that: The maximum averaging module also includes a restoration unit that restores the height and width of the image through interpolation.
5. A method for removing CT metal artifacts based on Mamba, characterized in that, Includes the following steps: S1. In the first 3×3 convolutional layer, increase the number of channels of the input image from 1 to D, while keeping other dimensions unchanged; S2. Perform step-by-step downsampling on the feature map, doubling the number of channels in each stage and reducing the height and width to half of the original. S3. Perform phased upsampling on the feature map, reducing the number of channels to half of the original in each phase, and doubling the height and width. S4. Perform fine-tuning in the last stage of upsampling, while keeping the feature map dimensions unchanged; S5. Reduce the number of channels of the image from D to 1 using a 3×3 convolutional layer; In particular, the multi-scale Mamba module is used in each stage of steps S2, S3 and S4.
6. The CT metal artifact removal model based on Mamba according to claim 5, characterized in that, The number of channels D in steps S1 and S5 is 12.