A medical image classification method and system

By integrating a lightweight convolutional neural network with a multi-plane Transformer encoding structure, the problems of high model complexity and large number of parameters in existing technologies are solved, achieving efficient medical image classification in resource-constrained environments and improving diagnostic accuracy and computational efficiency.

CN122244558APending Publication Date: 2026-06-19UNIV OF JINAN

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
UNIV OF JINAN
Filing Date
2026-04-30
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing medical image classification methods have complex model structures and a large number of parameters, making them difficult to apply in resource-constrained environments. In existing technologies, the large number of model structures and parameters makes it difficult to deploy on mobile devices. In addition, feature fusion between different planes is insufficient, and there is a lack of effective global dependencies. Existing technologies have not fully addressed or effectively addressed or fully utilized the three-dimensional spatial correlation, making it difficult to achieve efficient diagnosis in resource-constrained environments.

Method used

By fusing a lightweight convolutional neural network with a Transformer coding structure designed for multi-plane processing, efficient modeling and classification of 3D medical images can be achieved through multi-plane slicing, local feature extraction, global feature embedding, and dynamic integration.

Benefits of technology

It significantly reduces the number of model parameters and computational complexity, while improving classification accuracy and computational efficiency, enabling efficient medical image diagnosis in resource-constrained environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244558A_ABST
    Figure CN122244558A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of medical image processing and artificial intelligence technology, specifically relating to a medical image classification method and system, comprising: acquiring a three-dimensional medical image to be classified; performing multi-planar slicing processing on the acquired three-dimensional medical image to obtain a two-dimensional slice sequence; extracting local features from the obtained two-dimensional slice sequence to obtain a two-dimensional feature map; embedding features into the obtained two-dimensional feature map to obtain a structured token sequence; capturing the global dependencies between slices in each plane based on the obtained structured token sequence, multi-head self-attention mechanism, and residual structure to obtain global semantic features for each plane; dynamically integrating the obtained global semantic features for each plane to obtain a three-dimensional global feature vector; and classifying the three-dimensional medical image based on the obtained three-dimensional global feature vector to complete the medical image classification.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of medical image processing and artificial intelligence technology, specifically relating to a medical image classification method and system. Background Technology

[0002] The statements in this section are merely background information related to the present invention and do not necessarily constitute prior art.

[0003] With the rapid development of deep learning technology, Convolutional Neural Networks (CNNs) have become the mainstream method in the field of medical image analysis. For three-dimensional medical images (such as MRI and CT), traditional methods usually use three-dimensional convolutional neural networks (3D CNNs), such as 3D-ResNet and 3D-DenseNet, to extract voxel-level features by performing three-dimensional convolution in the spatial dimension. They can directly model the spatial information of volumetric data, but their model parameters are large, the computational complexity is high, and the requirements for graphics memory and computing power are extremely high, which is not conducive to deployment on clinical equipment or mobile devices.

[0004] To reduce model complexity, a multi-plane learning (MPL) strategy based on two-dimensional slices can be adopted. This involves slicing the 3D medical image along different planes (such as axial, coronal, and sagittal), extracting features from each slice using a 2D network, and then fusing them into a holistic representation. Such methods (e.g., Multi-Plane ResNet) can significantly reduce computational cost while preserving some 3D structural information. However, most existing multi-plane networks rely heavily on a heavy feature extraction backbone (e.g., ResNet50, EfficientNet), resulting in a large number of parameters. Furthermore, feature fusion between different planes often involves simple concatenation or averaging, lacking effective global dependency modeling.

[0005] With the rise of Transformer models in the field of vision, structures such as Vision Transformer (ViT) and Swing Transformer have been introduced into medical image analysis tasks. Transformers excel at capturing long-range dependencies and have potential advantages in slice-level feature fusion. However, directly using Transformers for medical image classification still has the following shortcomings: native Transformers usually require large-scale training data and computational power, making them difficult to transfer directly to the medical field; they lack the ability to model local textures using convolution, resulting in poor performance when directly processing raw slices; existing Transformers mostly model only on a single plane or single view, failing to fully utilize three-dimensional multi-view information; and they often sacrifice model size to improve accuracy, making it difficult to achieve a balance between high performance and low computational cost.

[0006] Multi-Plane and Multi-Slice Transformer (M3T) frameworks have made some progress in medical image classification, achieving strong 3D feature representation through planar segmentation and slice sequence modeling. However, M3T still uses multiple ResNet50 networks as front-end encoders, resulting in a high overall parameter count and computational complexity, which is not conducive to deployment in real-world scenarios.

[0007] In summary, existing medical image classification methods have complex model structures and a large number of parameters, making them difficult to apply in resource-constrained environments; they lack sufficient fusion between features on the same plane, making it difficult to fully express the three-dimensional spatial relationships; and they lack a unified framework that balances lightweight design and accuracy.

[0008] Therefore, there is an urgent need for a medical image classification method that can significantly reduce computational complexity while maintaining high accuracy and achieving deep fusion of multi-plane features, in order to meet the needs of efficient and intelligent diagnosis of brain diseases (such as Alzheimer's disease and Moyamoya disease). Summary of the Invention

[0009] To address the aforementioned issues, this invention proposes a medical image classification method and system. By innovatively integrating a lightweight convolutional neural network with a Transformer encoding structure designed specifically for multi-plane models, it significantly reduces the number of model parameters and computational complexity while achieving comprehensive and efficient modeling of spatial dependencies within and between planes in three-dimensional medical images. This results in a substantial improvement in computational efficiency and deployment flexibility while ensuring or even enhancing classification accuracy.

[0010] According to some embodiments, the first aspect of the present invention provides a medical image classification method, which adopts the following technical solution: A medical image classification method, comprising: Acquire 3D medical images to be classified; The acquired three-dimensional medical images are processed into two-dimensional slice sequences by multiplanar slicing. Local features of the obtained two-dimensional slice sequence are extracted to obtain a two-dimensional feature map; The obtained two-dimensional feature map is embedded to obtain a structured token sequence; Based on the obtained structured token sequence, multi-head self-attention mechanism and residual structure, the global dependencies between slices in each plane are captured to obtain the global semantic features of each plane; The obtained global semantic features of each plane are dynamically integrated to obtain a three-dimensional global feature vector; The three-dimensional medical images are classified based on the obtained three-dimensional global feature vectors to complete the medical image classification.

[0011] As a further technical limitation, the obtained three-dimensional medical images are sliced ​​along three orthogonal directions: axial, coronal, and sagittal, to obtain two-dimensional slice sequences in the three directions respectively.

[0012] As a further technical limitation, the obtained two-dimensional slice sequences are input into a lightweight feature extraction network to extract local features from each two-dimensional slice sequence. The extracted local features constitute a two-dimensional feature map. The lightweight feature extraction network is built based on MobileNetV2 and includes a depthwise separable convolutional layer, an inverse residual structure, and a convolutional layer for feature dimension mapping.

[0013] Furthermore, the inverse residual structure includes three stages: pointwise convolution, depthwise convolution, and linear convolution. The feature calculation formula is: F i+1 =W i (3) ·σ(BN(W i (2) σ(BN(W i (1) F i ))))+F i Among them, W i (1) W i (2) W i (3) These represent the weights of pointwise convolution, depthwise convolution, and linear convolution, respectively; σ( BN( is a non-linear activation function); ) is for batch normalization; F i With F i+1 These represent the input feature map and the output feature map, respectively.

[0014] As a further technical limitation, attention weighting or vector concatenation is used to process the global semantic features of each plane, and the spatial semantics between the planes are complemented and fused to enhance them, resulting in a three-dimensional global feature vector.

[0015] As a further technical limitation, the obtained three-dimensional global feature vector is input into a linear classification head to map the category dimension of the medical image. The probability of the mapped category dimension is calculated using the Softmax function to obtain the result classification of the three-dimensional medical image.

[0016] According to some embodiments, a second aspect of the present invention provides a medical image classification system, employing the following technical solution: A medical image classification system, comprising: The acquisition module is configured to acquire three-dimensional medical images to be classified. The processing module is configured to perform multiplanar slicing on the acquired three-dimensional medical images to obtain a two-dimensional slice sequence; The extraction module is configured to extract local features from the obtained two-dimensional slice sequence to obtain a two-dimensional feature map; The embedding module is configured to embed features into the obtained two-dimensional feature map to obtain a structured token sequence; The capture module is configured to capture the global dependencies between slices in each plane based on the obtained structured token sequence, multi-head self-attention mechanism and residual structure, so as to obtain the global semantic features of each plane. The integration module is configured to dynamically integrate the obtained global semantic features of each plane to obtain a three-dimensional global feature vector. The classification module is configured to classify the results of three-dimensional medical images based on the obtained three-dimensional global feature vectors, thereby completing the medical image classification.

[0017] According to some embodiments, a third aspect of the present invention provides a computer-readable storage medium, employing the following technical solution: A computer-readable storage medium having a program stored thereon, which, when executed by a processor, implements the steps in the medical image classification method as described in the first aspect of the present invention.

[0018] According to some embodiments, the fourth aspect of the present invention provides an electronic device, which adopts the following technical solution: An electronic device includes a memory, a processor, and a program stored in the memory and running on the processor, wherein the processor executes the program to implement the steps in the medical image classification method as described in the first aspect of the present invention.

[0019] According to some embodiments, the fifth aspect of the present invention provides a computer program product, which adopts the following technical solution: A computer program product includes software code, wherein the program in the software code performs the steps in the medical image classification method as described in the first aspect of the present invention.

[0020] Compared with the prior art, the beneficial effects of the present invention are as follows: This invention constructs a hierarchical feature extraction and fusion framework of "local-global-cross-viewpoint". It efficiently extracts fine texture and local morphological features within slices through a MobileNetV2 encoder. Each independent Transformer encoder fully utilizes its self-attention mechanism to perfectly capture the long-range spatial dependencies between different slices within the same anatomical plane. Combined with a multi-plane adaptive fusion module, it dynamically and intelligently integrates complementary information from three orthogonal viewpoints through learnable weights, simulating the cognitive process of radiologists rotating and observing three-dimensional images, and achieving true three-dimensional spatial understanding. The multi-level modeling strategy from local to global and from intra-plane to inter-plane significantly improves the representation ability and classification accuracy of complex three-dimensional lesions. Attached Figure Description

[0021] The accompanying drawings, which form part of this embodiment, are used to provide a further understanding of this embodiment. The illustrative embodiments and their descriptions are used to explain this embodiment and do not constitute an improper limitation of this embodiment.

[0022] Figure 1 This is a flowchart of a medical image classification method according to Embodiment 1 of the present invention; Figure 2 This is an overall flowchart of the lightweight medical image classification method based on multi-planar Mobile Net-Transformer fusion in Embodiment 1 of the present invention. Figure 3 This is a schematic diagram of multiplanar slicing of a three-dimensional medical image along the axial, coronal, and sagittal planes in Embodiment 1 of the present invention; Figure 4 This is a schematic diagram of the structure of the improved MobileNetV2 feature extraction module (Mobile NetEmbedding encoder) in Embodiment 1 of the present invention; Figure 5 This is a schematic diagram of the Transformer encoding and multi-plane semantic fusion module in Embodiment 1 of the present invention; Figure 6 This is a schematic diagram of the multi-plane feature fusion and classification module in Embodiment 1 of the present invention; Figure 7 This is a structural block diagram of a medical image classification system according to Embodiment 2 of the present invention. Detailed Implementation

[0023] The present invention will be further described below with reference to the accompanying drawings and embodiments.

[0024] It should be noted that the following detailed descriptions are exemplary and intended to provide further illustration of the invention. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0025] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of exemplary embodiments according to the invention. As used herein, the singular form is intended to include the plural form as well, unless the context clearly indicates otherwise. Furthermore, it should be understood that when the terms "comprising" and / or "including" are used in this specification, they indicate the presence of features, steps, operations, devices, components, and / or combinations thereof.

[0026] In this invention, terms such as "upper," "lower," "left," "right," "front," "back," "vertical," "horizontal," "side," and "bottom" indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. These terms are used only to facilitate the description of the structural relationships of the various components or elements of this invention and do not specifically refer to any component or element in this invention. They should not be construed as limiting the invention.

[0027] In this invention, terms such as "fixed connection," "connected," and "linked" should be interpreted broadly, indicating a fixed connection, an integral connection, or a detachable connection; a direct connection or an indirect connection through an intermediate medium. Those skilled in the art can determine the specific meaning of these terms in this invention based on the specific circumstances, and they should not be construed as limitations on the invention.

[0028] Where there is no conflict, the embodiments and features in the embodiments of the present invention can be combined with each other.

[0029] Example 1 Embodiment 1 of this invention introduces a medical image classification method.

[0030] like Figure 1 and Figure 2 One medical image classification method shown includes: Acquire 3D medical images to be classified; The acquired three-dimensional medical images are processed into two-dimensional slice sequences by multiplanar slicing. Local features of the obtained two-dimensional slice sequence are extracted to obtain a two-dimensional feature map; The obtained two-dimensional feature map is embedded to obtain a structured token sequence; Based on the obtained structured token sequence, multi-head self-attention mechanism and residual structure, the global dependencies between slices in each plane are captured to obtain the global semantic features of each plane; The obtained global semantic features of each plane are dynamically integrated to obtain a three-dimensional global feature vector; The three-dimensional medical images are classified based on the obtained three-dimensional global feature vectors to complete the medical image classification.

[0031] This embodiment takes 3D medical image volume data as input. First, it performs multi-planar slicing operations along three orthogonal directions: axial, coronal, and sagittal, generating three sets of 2D slice sequences. The multi-planar slicing strategy can preserve 3D spatial structural information while significantly reducing the computational burden of 3D convolution. The slice sequence in each direction is used as an independent input channel into the lightweight feature extraction module.

[0032] The input object is the patient's three-dimensional medical image volume data V∈R C×D×H×W Where C is the number of channels (usually 1), and D, H, and W represent depth, height, and width, respectively; it can adapt to different modalities of images such as MRI and CT; it performs image normalization processing, mapping pixel values ​​to the range of [0,1]; it offers optional contrast enhancement, histogram equalization, or noise removal (such as Gaussian filtering and median filtering) to improve slice quality; it resamples the 3D volume data to a standard size, such as 256×256×256 or 224×224×N, to ensure consistency of data from different cases; it can use brain mask or ROI extraction to retain only the target region and improve the feature signal-to-noise ratio.

[0033] like Figure 3 As shown, in this embodiment, when performing multi-planar slice extraction, slices are taken along the axial, coronal, and sagittal planes, generating a two-dimensional slice sequence in each direction. The slice spacing can be a fixed value (e.g., 1–3 mm) or an adaptive interval. Optional key slice extraction strategies include using traditional segmentation or pre-trained networks to select slices containing key structures to improve information density. The length of the slice sequence in each direction can be uniform (e.g., N slices), and if insufficient, it can be supplemented by repetition or interpolation.

[0034] like Figure 4As shown, this embodiment improves the traditional MobileNetV2 network structure by adding a 1×1 convolutional mapping layer at the end of the network for Transformer alignment and adding a grayscale channel replication structure to the input layer, forming a lightweight encoder suitable for medical images—the MobileNet Embedding module. This module is based on MobileNetV2 and retains its depthwise separable convolution and inverted residual structure to balance the model's feature representation ability and computational efficiency.

[0035] Assume the input two-dimensional medical slice is X∈R 1×H×W Where H and W represent the height and width of the slice, respectively. Since medical images are typically single-channel grayscale data, a channel replication operation is used during the input stage to expand them into a three-channel input, i.e., X′=Repeat(X,3)∈R. 3×H×W This operation not only preserves the structural information of the input image, but also enables the encoder to be compatible with MobileNetV2 weights pre-trained on natural images, thereby improving transfer learning performance.

[0036] In this embodiment, the expanded slices are sequentially passed through multiple inverse residual modules. Each inverse residual module includes three stages: pointwise convolution, depthwise convolution, and linear convolution. The feature calculation formula is F. i+1 =W i (3) ·σ(BN(W i (2) σ(BN(W i (1) F i ))))+F i Among them, W i (1) W i (2) W i (3) These represent the pointwise convolution, depthwise convolution, and linear convolution weights of this module, respectively; σ( ) is the nonlinear activation function ReLU6; BN( ) represents batch normalization; F i With F i+1These are the input and output feature maps, respectively.

[0037] The above structure achieves efficient feature extraction through depthwise separable convolution, effectively reducing the number of convolution kernel parameters and floating-point operations.

[0038] At the end of the encoder, this embodiment adds a 1×1 convolutional mapping layer (Conv Projection) to map the convolutional feature map to the embedding dimension d required by the Transformer. emb That is, F out ∈R C×H′×W′ E=W p F out W p ∈R demb×C×1×1 Among them, F out Output feature maps for MobileNet, E∈R demb×H′×W These are the embedded features after mapping.

[0039] The mapped feature map is flattened into a token sequence in the spatial dimension, i.e., T = Flatten(E) ∈ R. N×demb N = H′ × W′; standardization is performed through layer normalization, i.e., T^ = LayerNorm(T); so that the features of each token remain numerically stable in the Transformer encoding space, which facilitates the efficient convergence of the multi-head attention mechanism.

[0040] It should be noted that this embodiment retains the depthwise separable convolutional design of MobileNetV2 when extracting lightweight features, achieving efficient local feature capture and parameter minimization; it improves the performance of medical image feature transfer through grayscale channel expansion and pre-trained weight adaptation; the Transformer alignment mechanism achieves the alignment of the embedding dimension and numerical standardization of the convolution output and Transformer input through one-dimensional convolutional mapping and layer normalization; when outputting serialized features, the two-dimensional convolutional features are converted into token sequences, providing structured input for the multi-plane Transformer encoder and realizing unified modeling of features within and between planes; that is, while maintaining the lightweight feature extraction capability, the MobileNet Embedding encoder establishes an efficient interface between the convolutional network and the Transformer structure, providing a stable and high-dimensional semantic representation foundation for subsequent multi-plane feature fusion.

[0041] like Figure 5As shown, in the feature embedding stage, this embodiment uses a dynamic positional-plane embedding module to transform slice feature sequences in different directions into high-dimensional semantic inputs that can be processed by Transformer. This module includes three parts: category token embedding (CLS Token), dynamic positional encoding module, and plane embedding module.

[0042] For any planar direction (axial, coronal, or sagittal), let the input slice feature sequence be X={x1,x2,...,x...} N},x i ∈R d Where N represents the number of slices in the plane, and d represents the feature dimension.

[0043] Add a learnable class identifier vector (CLS token) to the beginning of the sequence to obtain the extended sequence X. ~ ={x cls ,x1,x2,...,x N The CLS vector is used to aggregate the global features of the entire sequence after Transformer encoding, serving as input to the subsequent classification module; a corresponding dynamic position encoding vector p is generated for each slice position i. i .

[0044] Unlike the static sinusoidal coding used in traditional ViT, this embodiment employs a learnable adaptive positional coding mechanism, through the parameter matrix W. p ∈R (N+1)×d Dynamically generate positional embeddings, i.e., P=Linear(W p )=[p cls ,p1,p2,...,p N ].

[0045] When the length N of the input sequence changes, W is automatically adjusted based on linear interpolation or parameter expansion mechanism. p The dimension of the model enables it to handle different numbers of slice inputs; this dynamic embedding strategy allows the network to maintain structural consistency and feature alignment capabilities under different scanning protocols and voxel resolutions.

[0046] Meanwhile, to distinguish different slice plane orientations (Axial, Coronal, Sagittal), this embodiment introduces a planar embedding vector E. plane _, which is a learnable parameter, namely E plane ∈R d .

[0047] Different planar directions correspond to independent embedding parameters E axial E coronal Esagittal During the embedding stage, it is fused with positional encoding through element-wise addition, i.e., Z i =x i +p i +E plane ; This forms the final input sequence containing both position and orientation information, i.e., Z = {z cls ,z1,z2,...,z N},z i ∈R d .

[0048] It should be noted that this embodiment can distinguish the slice order information within the same plane and can also explicitly encode the feature semantics of different plane directions, realizing all-round spatial location modeling; through dynamic generation and interpolation mechanisms, it can flexibly cope with the input of different numbers of slices without the need for fixed input size; it has a strong tolerance for common problems in medical images such as MRI, such as size changes and slice missing; through planar embedding to achieve directional distinction, it enables Transformer to have a stronger structural understanding ability when modeling across planes.

[0049] In the global feature modeling process, this embodiment designs independent Transformer encoder structures for slice feature sequences in different planar directions (axial, coronal, and sagittal). Each Transformer encoder consists of a multi-head self-attention module, a feed-forward MLP, layer normalization, and residual connections. The self-attention module captures long-range dependencies between different slices within the same plane, the feed-forward network performs nonlinear feature transformation and high-dimensional semantic abstraction, and layer normalization and residual structures ensure the stability and information fidelity of feature flow. This structure enables contextual modeling between slices within a single plane, learning global structural patterns and spatial semantic dependencies, thereby significantly enhancing feature consistency and semantic expressiveness within the plane.

[0050] Unlike traditional methods that rely on simple feature splicing or mean fusion, this embodiment achieves deep semantic association modeling between slices through in-plane Transformer encoding. The three Transformer encoders are parameter-independent and functionally equivalent, each responsible for modeling information in one of the three orthogonal planes (axial, coronal, and sagittal), thereby maximizing the preservation of the integrity and complementarity of features in each direction. This independent design avoids information interference between planes, enabling the model to more accurately extract anatomical structures and lesion patterns from various perspectives.

[0051] like Figure 6 As shown, in the multi-plane feature fusion process, this embodiment extracts CLS token representations from the Transformer outputs of the three plane directions, respectively, as global semantic feature vectors for their respective planes. The CLS features of the three planes are then fused using a weighted average or attention-based weighted fusion mechanism to form a comprehensive three-dimensional global feature representation. This fusion strategy can dynamically allocate weights according to the importance of features from different planes, thereby achieving adaptive integration of multi-view information. The fused global features are then processed by layer normalization and input into a linear classification head, ultimately outputting the classification result of the target disease.

[0052] It should be noted that in this embodiment, the in-plane Transformer module captures the spatial relationships between slices in the same direction, and the multi-plane fusion module realizes the comprehensive expression of semantic features in different directions. The overall structure maintains a lightweight design while possessing powerful three-dimensional spatial understanding and discrimination capabilities; it can achieve hierarchical semantic integration from local slices to global volume data. The method of this invention can be widely applied to computer-aided diagnosis of brain MRI images, especially in the early screening and classification of Alzheimer's disease and Moyamoya disease, enabling rapid and accurate reasoning on resource-constrained medical devices. Furthermore, the method in this embodiment has good versatility and can be extended to other three-dimensional medical image analysis tasks, including multimodal fusion, organ segmentation, and lesion localization, demonstrating high engineering application value and research and promotion potential.

[0053] To verify the effectiveness of the classification method in this embodiment, the Alzheimer's Disease Dataset (ADNI), the OASIS Brain Imaging Dataset, and the Moyamoya Disease MRI Dataset were selected for experimental verification. Under the same hardware and data preprocessing conditions, various existing 3D medical image classification models were compared, including the 3D ResNet series, MedicalNet, the original M3T model, MRNet, FCNlinksCNN, etc., and the lightweight performance comparison table is shown in Table 1.

[0054] Table 1 Comparison of Lightweighting Effects

[0055] The number of parameters in this embodiment is only 7.96M, which is about 76.2% less than the original M3T and about 87.5% less than 3D ResNet50. The computational complexity (FLOPs) is reduced to 1.70G, which is significantly better than other models, verifying the lightweight design advantages of the MobileNetEmbedding encoder in this embodiment.

[0056] Table 2 Classification Performance Comparison

[0057] As shown in Table 2, this embodiment achieves optimal performance on all three datasets: On the ADNI dataset, the AUC improved to 0.9906, with an accuracy of 95.96%. On the OASIS dataset, the AUC improved to 0.9655, with an accuracy of 96.88%. On the Moyamoya disease dataset, the AUC reached 0.9881, with an accuracy of 94.59%.

[0058] Compared to traditional 3D convolutional networks and the original M3T model, the method of this invention achieves higher diagnostic accuracy while maintaining a lightweight structure, indicating that the design balances computational efficiency and recognition performance.

[0059] To further verify the effectiveness of this embodiment, multiple ablation experiments were designed, and the performance of modules such as multi-planar structure, position encoding, and independent Transformer was compared after removal. The results are shown in Tables 3, 4 and 5.

[0060] Table 3 ADNI Dataset

[0061] Table 4 OASIS Dataset

[0062] Table 5 Moyamoya disease dataset

[0063] The ablation results show that multi-plane input, positional encoding, and independent Transformer structures all significantly contribute to performance improvement. In particular, removing positional encoding or sharing the Transformer leads to performance degradation, validating the rationality and necessity of the module design in this invention.

[0064] Example 2 Embodiment 2 of the present invention introduces a medical image classification system.

[0065] like Figure 7 The medical image classification system shown includes: The acquisition module is configured to acquire three-dimensional medical images to be classified. The processing module is configured to perform multiplanar slicing on the acquired three-dimensional medical images to obtain a two-dimensional slice sequence; The extraction module is configured to extract local features from the obtained two-dimensional slice sequence to obtain a two-dimensional feature map; The embedding module is configured to embed features into the obtained two-dimensional feature map to obtain a structured token sequence; The capture module is configured to capture the global dependencies between slices in each plane based on the obtained structured token sequence, multi-head self-attention mechanism and residual structure, so as to obtain the global semantic features of each plane. The integration module is configured to dynamically integrate the obtained global semantic features of each plane to obtain a three-dimensional global feature vector. The classification module is configured to classify the results of three-dimensional medical images based on the obtained three-dimensional global feature vectors, thereby completing the medical image classification.

[0066] The detailed steps are the same as those provided in Example 1 for the medical image classification method, and will not be repeated here.

[0067] Example 3 Embodiment 3 of the present invention provides a computer-readable storage medium.

[0068] A computer-readable storage medium having a program stored thereon, which, when executed by a processor, implements the steps in the medical image classification method as described in Embodiment 1 of the present invention.

[0069] The detailed steps are the same as those provided in Example 1 for the medical image classification method, and will not be repeated here.

[0070] Example 4 Embodiment 4 of the present invention provides an electronic device.

[0071] An electronic device includes a memory, a processor, and a program stored in the memory and running on the processor, wherein the processor executes the program to implement the steps in the medical image classification method as described in Embodiment 1 of the present invention.

[0072] The detailed steps are the same as those provided in Example 1 for the medical image classification method, and will not be repeated here.

[0073] Example 5 Embodiment 5 of the present invention provides a computer program product.

[0074] A computer program product includes software code, wherein the program in the software code performs the steps of the medical image classification method as described in Embodiment 1 of the present invention.

[0075] The detailed steps are the same as those provided in Example 1 for the medical image classification method, and will not be repeated here.

[0076] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code. The solutions in the embodiments of the present invention can be implemented using various computer languages, such as the object-oriented programming language Java and the interpreted scripting language JavaScript.

[0077] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0078] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0079] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0080] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention.

[0081] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, this invention also intends to include these modifications and variations.

[0082] The above description is merely a preferred embodiment of this practice and is not intended to limit the scope of this practice. Various modifications and variations can be made to this practice by those skilled in the art. Any modifications, equivalent substitutions, or improvements made within the spirit and principles of this practice should be included within the protection scope of this practice.

Claims

1. A medical image classification method, characterized in that, include: Acquire 3D medical images to be classified; The acquired three-dimensional medical images are processed into two-dimensional slice sequences by multiplanar slicing. Local features of the obtained two-dimensional slice sequence are extracted to obtain a two-dimensional feature map; The obtained two-dimensional feature map is embedded to obtain a structured token sequence; Based on the obtained structured token sequence, multi-head self-attention mechanism and residual structure, the global dependencies between slices in each plane are captured to obtain the global semantic features of each plane; The obtained global semantic features of each plane are dynamically integrated to obtain a three-dimensional global feature vector; The three-dimensional medical images are classified based on the obtained three-dimensional global feature vectors to complete the medical image classification.

2. The medical image classification method as described in claim 1, characterized in that, The obtained three-dimensional medical images were sliced ​​along three orthogonal directions: axial, coronal, and sagittal, to obtain two-dimensional slice sequences in each of the three directions.

3. The medical image classification method as described in claim 1, characterized in that, The obtained two-dimensional slice sequences are input into a lightweight feature extraction network to extract local features from each two-dimensional slice sequence. The extracted local features constitute a two-dimensional feature map. The lightweight feature extraction network is built based on MobileNetV2 and includes a depthwise separable convolutional layer, an inverse residual structure, and a convolutional layer for feature dimension mapping.

4. A medical image classification method as described in claim 3, characterized in that, The inverse residual structure comprises three stages: pointwise convolution, depthwise convolution, and linear convolution. The feature calculation formula is: F i+1 =W i (3) ·σ(BN(W i (2) σ(BN(W i (1) F i ))))+F i ; Among them, W i (1) W i (2) W i (3) These represent the weights of pointwise convolution, depthwise convolution, and linear convolution, respectively; σ( BN( is a non-linear activation function); ) is for batch normalization; F i With F i+1 These represent the input feature map and the output feature map, respectively.

5. A medical image classification method as described in claim 1, characterized in that, We use attention weighting or vector concatenation to process the global semantic features of each plane, and enhance the spatial semantics between planes by complementing and fusing them to obtain a three-dimensional global feature vector.

6. A medical image classification method as described in claim 1, characterized in that, The obtained three-dimensional global feature vector is input into the linear classification head to map the category dimension of the medical image. The probability of the mapped category dimension is calculated using the Softmax function to obtain the result classification of the three-dimensional medical image.

7. A medical image classification system, characterized in that, include: The acquisition module is configured to acquire three-dimensional medical images to be classified. The processing module is configured to perform multiplanar slicing on the acquired three-dimensional medical images to obtain a two-dimensional slice sequence; The extraction module is configured to extract local features from the obtained two-dimensional slice sequence to obtain a two-dimensional feature map; The embedding module is configured to embed features into the obtained two-dimensional feature map to obtain a structured token sequence; The capture module is configured to capture the global dependencies between slices in each plane based on the obtained structured token sequence, multi-head self-attention mechanism and residual structure, so as to obtain the global semantic features of each plane. The integration module is configured to dynamically integrate the obtained global semantic features of each plane to obtain a three-dimensional global feature vector. The classification module is configured to classify the results of three-dimensional medical images based on the obtained three-dimensional global feature vectors, thereby completing the medical image classification.

8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the steps of the medical image classification method as described in any one of claims 1-6.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the program, it implements the steps of the medical image classification method as described in any one of claims 1-6.

10. A computer program product, comprising software code, characterized in that, The program in the software code performs the steps of the medical image classification method as described in any one of claims 1-6.