A multi-instance learning classification method based on bidirectional manba and adaptive mask

By improving pyramid positional encoding and bidirectional Mamba model, combined with an adaptive masking strategy, the problems of uneven attention distribution and spatial topology loss in breast cancer whole-slice image classification are solved, achieving higher diagnostic accuracy and model generalization ability.

CN122244002APending Publication Date: 2026-06-19SOUTHWEAT UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SOUTHWEAT UNIV OF SCI & TECH
Filing Date
2026-04-28
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing multi-instance learning methods suffer from problems such as excessive concentration of attention distribution, loss of local spatial topological information, and insufficient global modeling ability for ultra-long sequences when processing whole-slice images of breast cancer, which affect the diagnostic accuracy and generalization ability of the model.

Method used

An improved pyramid position encoding generator is used to reconstruct two-dimensional physical space. Combined with a bidirectional Mamba model and an adaptive masking strategy, the masking intensity is dynamically adjusted by aggregating features through multi-branch gating attention, thereby capturing multi-scale spatial features and global contextual information.

Benefits of technology

It significantly improves the classification accuracy and robustness of whole-slice images of breast cancer, enhances the diagnostic and generalization capabilities of the model, and solves the problems of information loss and high computational complexity in traditional methods.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244002A_ABST
    Figure CN122244002A_ABST
Patent Text Reader

Abstract

This invention proposes a multi-instance learning classification method based on bidirectional Mamba and adaptive masking, applicable to the fields of digital pathology and assisted diagnosis. Addressing the problems of concentrated attention distribution and insufficient utilization of spatial topological information in existing whole-slice image analysis, this method first extracts multi-scale local spatial features through a pyramid positional encoding generator. Then, it combines a bidirectional state-space model with a progressive gating mechanism to achieve adaptive global context fusion of ultra-long sequences without directionality. Finally, it employs a multi-branch gated attention and a dynamic adaptive masking strategy based on attention entropy, dynamically adjusting the mask ratio according to the concentration of model attention, prompting the model to mine mutually exclusive and diverse morphological discriminative features. This invention effectively reduces the model's over-reliance on single salient local features, significantly improving the robustness of bag-level representation of pathological images, the model's generalization ability, and classification accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of medical image processing and deep learning, specifically to a full-slice image classification method based on multi-instance learning. More specifically, it is a multi-instance learning classification method based on bidirectional Mamba and adaptive masking, which is particularly suitable for intelligent auxiliary diagnosis of pathological images such as breast cancer. Background Technology

[0002] Breast cancer is one of the most common malignant tumors among women worldwide, and its accurate pathological grading and early-to-mid-stage screening are directly related to patient survival rates and prognostic interventions. With the development of digital pathology, whole-slice images (WSI) are increasingly widely used. Their ultra-high resolution preserves complete tissue morphology information, providing data support for deep learning-assisted diagnosis. Because WSI images are extremely large (typically containing billions of pixels), they cannot be directly input into conventional deep neural networks. Therefore, multi-instance learning has become the mainstream paradigm for current pathological image analysis. In the multi-instance learning framework, the entire WSI is treated as a "package" containing multiple image patches (instances), requiring only slice-level weak supervision labels for end-to-end network training.

[0003] In recent years, multi-instance learning methods based on attention mechanisms have made significant progress. For example, ABMIL introduced a gated network to learn instance weights, significantly improving model performance. However, existing multi-instance learning methods still have three major drawbacks when dealing with ultra-long WSI sequences: First, attention distribution is over-concentrated. Conventional attention aggregation mechanisms are prone to "shortcut learning," excessively concentrating weights on a few highly significant lesion regions, thus ignoring other subtle clues with important diagnostic value, resulting in limited model generalization ability. Although some studies have proposed fixed-ratio instance masks (such as STKIM), fixed masks are difficult to adapt to slice samples with different complexities and lesion distribution densities.

[0004] Secondly, local spatial topological information is lost. WSI is usually cut and flattened into a one-dimensional sequence during preprocessing. This serialization process directly severs the original two-dimensional spatial physical structure of pathological tissue, which seriously affects the localization accuracy of small lesions and the analysis of the cellular microenvironment.

[0005] Finally, the global modeling capability for ultra-long sequences is insufficient. Although traditional Transformer-based methods can establish global dependencies, the computational complexity of their self-attention mechanism increases quadratically when dealing with tens of thousands of pathological sequences. While recent state-space models (such as Mamba) have linear complexity, standard Mamba is a unidirectional causal model, which loses reverse contextual information when applied to non-directional two-dimensional pathological slices.

[0006] In summary, how to capture multi-scale spatial feature representations, process the global context of undirected ultra-long sequences, and extract robust features with diversity to improve the accuracy and robustness of pathological image classification are urgent problems to be solved in multi-instance learning classification research. Summary of the Invention

[0007] To address the aforementioned problems, this invention provides a multi-instance learning (MIL) classification method based on bidirectional Mamba and adaptive masking. First, it applies an improved pyramid positional encoding generator to reconstruct two-dimensional physical space anchors for feature sequences, capturing multi-scale micro-environment topological information. Next, this invention introduces bidirectional Mamba and progressive gating logic to solve the global context interaction problem of non-directional, ultra-long pathological sequences. Finally, this invention designs a multi-branch gating attention and adaptive masking aggregation strategy, dynamically adjusting the masking intensity based on the statistical distribution characteristics of the model's current attention, solving the information loss or local optima problem caused by fixed masks, and forcing the network to mine mutually exclusive and diverse morphological discriminative features. These modules enable this invention to outperform other traditional MIL methods in terms of classification accuracy and generalization ability, accurately predicting the diagnosis of breast cancer whole-slice images.

[0008] To achieve the above objectives, the technical solution of the present invention is as follows: a multi-instance learning classification method based on bidirectional Mamba and adaptive masking is provided. First, the whole slice image is segmented and features are extracted. Then, the feature matrix is ​​input into the main network module containing spatial awareness, bidirectional global modeling and adaptive masking attention aggregation. After training, the disease category of the whole slice is predicted and the diagnostic label is output.

[0009] The aforementioned multi-instance learning classification method based on bidirectional Mamba and adaptive masking specifically includes the following steps: Step 1: Preprocess and segment the whole-slice pathological image, use a pre-trained convolutional neural network to obtain instance-level visual features, and construct a one-dimensional initial feature matrix sequence. Step 2: Use an improved pyramid position coding generator to perform local spatial perception processing on the input feature matrix sequence, and capture the local microenvironment and cellular tissue features of different receptive fields. Step 3: A bidirectional state-space model with progressive gating mechanism is used to perform global context modeling of local spatial features, so as to realize dynamic information exchange and fusion between local spatial features and global temporal features; Step 4: Feature aggregation is performed using a multi-branch gating attention mechanism and an adaptive masking strategy based on attention entropy. Global package features are obtained by integrating the pattern embedding representations of different branches, and then the final disease classification prediction result is output through a fully connected layer.

[0010] Preferably, the improved pyramid location encoding generator captures the fine-grained features of the instance itself, the features of the adjacent local microenvironment, and the broader tissue edge structure features by using three different receptive fields (1×1, 3×3, 5×5) of depth-separable convolution in parallel, thereby reconstructing two-dimensional physical space anchoring for one-dimensional serialized instance features.

[0011] Preferably, the adaptive masking strategy calculates the dynamic masking ratio in real time by statistically analyzing the normalized entropy, maximum value, and standard deviation of the current network multi-branch attention distribution; when attention is overly concentrated (low entropy), the masking ratio is increased to force the model to explore suboptimal regions, and when attention is scattered (high entropy), the masking ratio is reduced to avoid the loss of key lesion information.

[0012] Preferably, step 2 specifically includes: The core of this step lies in capturing spatial topological information through depthwise separable convolutions at different scales, which can be formally expressed as follows: First, the initial feature sequence is rearranged... Restored to a two-dimensional spatial feature map; Then, the depth can be separable by using different kernel sizes. Spatial features are extracted, and the features at the three scales are flattened back into a one-dimensional sequence and then summed element by element. Finally, compared with the original input Perform residual connections to obtain feature representations that include local spatial topological information. : in, For the number of instances, For feature dimension, This indicates the size of the convolution kernel.

[0013] Preferably, step 3 specifically includes: To adaptively fuse local priors and global semantics, a progressive gating logic is designed to control the information flow between feature channels, and its formal expression is as follows: First, the bidirectional Mamba model outputs forward scan features. and backward flip scan features The two are concatenated along the feature dimension to generate a non-directional global temporal feature. : Then, the dynamic soft mask gating matrix for control information exchange is calculated. : in, These are learnable weight parameters. It controls the bias of the initial feature preferences. This indicates a feature concatenation operation. It is the Sigmoid activation function.

[0014] Finally, local spatial features are controlled through a gating mechanism. With global temporal features The feature representation is merged and updated to enhance the feature context. : in, This represents the Hadamard product (element-by-element multiplication).

[0015] Preferably, step 4 specifically includes: To extract discriminative and diverse packet-level representations, multi-branch gated attention and dynamic probabilistic masks are combined, and their formal expression is as follows: First, regarding the first The first branch Individual Instance Features Calculate its unnormalized attention score : in, and It is a feature mapping matrix shared by all branches. It is the first Each branch has its own unique attention aggregation vector.

[0016] Next, the adaptive masking module calculates the masking ratio based on the overall score and identifies high-scoring instances. It then introduces a probability function that incorporates the learned temperature parameter to perform the masking operation. Set the score of the selected instance to the minimum value. ).

[0017] Subsequently, the scores obtained after masking are normalized using Softmax to obtain the final attention weights. : Finally, the pattern embedding representations of each branch are generated by weighted summation and aggregation. and calculate all The mean of each branch yields the global package feature. : This global package feature The data is then fed into a multilayer perceptron (MLP) and mapped to a final disease prediction probability score.

[0018] The present invention has the following advantages over the prior art: (1) The improved pyramid position encoding generator module proposed in this invention effectively compensates for the loss of spatial topological information caused by the serialization operation and enhances the model's ability to perceive the two-dimensional space of the local microenvironment and cell community arrangement. (2) The bidirectional state space model (Bi-Mamba) with progressive gating mechanism designed in this invention achieves high-quality global context interaction without directionality for ultra-long pathological sequences while maintaining linear computational complexity, thus making up for the semantic defects of unidirectional scanning. (3) The innovative attention entropy-based adaptive masking strategy proposed in this invention solves the problem of the lack of flexibility in traditional fixed masks. By dynamically balancing the "utilization" and "exploration" of features, the network is successfully forced to discover mutually exclusive and diverse morphological discrimination features, which significantly improves the robustness of pathological representation and the generalization ability of the model. Attached Figure Description

[0019] Various other advantages and features will become more apparent from the following detailed description of preferred embodiments. The accompanying drawings are for illustrative purposes only and are not intended to limit the invention. Figure 1 This is a flowchart of a multi-instance learning full-slice image classification method according to an embodiment of the present invention; Figure 2 This is an overall architecture diagram of a multi-instance learning full-slice image classification method according to an embodiment of the present invention; Figure 3 This is a comparison diagram of the results of this invention with other methods. Detailed Implementation

[0020] The technical solutions of the present invention will be clearly, completely, and in detail described below with reference to the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.

[0021] This invention provides a multi-instance learning classification method based on bidirectional Mamba and adaptive masking, overcoming the problems of attention concentration, spatial topology loss, and low efficiency in long sequence modeling in WSI analysis. The Camelyon16 dataset is used, and the specific implementation process is as follows:

[0022] Using the Camelyon16 dataset, high-resolution whole-slice images (WSI) of breast cancer were obtained. The Otsu algorithm was used to filter out blank background regions in the slices. The remaining valid tissue regions were then segmented using a fixed-size 256×256 sliding window. The model uses image patches containing actual pathological content. Due to the excessively high pixel dimension of the original image patches, a pre-trained convolutional neural network (such as ResNet-18 and ViT-S / 16) is used as a visual feature extractor to perform forward propagation and dimensionality reduction on each image patch. Finally, the entire WSI is converted into a file of length [missing information]. Feature dimension is One-dimensional feature matrix sequence .

[0023] To recover the physical spatial relationships of image patches within the original slices, the model introduces an improved pyramid positional encoding generator. First, a rearrangement function is used to transform the one-dimensional sequence... Folded back to an approximately square two-dimensional feature map .

[0024] Subsequently, the feature maps are fed in parallel into three branches of depthwise separable convolution with different receptive fields: Convolutional branch: responsible for cross-channel information interaction, preserving fine-grained features at the instance level; Convolutional branches: sensing adjacent elements The neighborhood microenvironment captures the community arrangement patterns of local cells; Convolutional branches: provide a larger receptive field for perceiving broader tissue edge morphology and matrix structure.

[0025] The two-dimensional spatial features extracted by the three convolutional methods are flattened back into a one-dimensional sequence, added element-wise, and then added to the input sequence. Establishing residual connections ensures the stability of deep network training. Its highly detailed formalized calculation formula is as follows: After this step, each isolated instance feature not only contains its own visual semantics but also embeds rich two-dimensional local physical anchoring information. The output feature is denoted as... .

[0026] Pathological sections do not have an absolute temporal or directional order, therefore unidirectional modeling will lead to information asymmetry. This implementation method will... It is fed into a multi-layered Bidirectional Mamba (Bi-Mamba) encoder.

[0027] In the core architecture of Bi-Mamba, the input is updated through a dynamically discretized parameter matrix to achieve state-space equation updates, giving the model hardware-aware selective information filtering capabilities. Bi-Mamba employs two parallel scan branches: the forward branch scans sequentially according to the original sequence. The backward branch reverses the sequence and then scans to obtain... The two are concatenated along the channel dimension to generate an undirected global context sequence. .

[0028] To prevent the introduction of global semantics from disrupting the representation of key local lesions, the model employs progressive gate logic. This logic adaptively computes the soft mask matrix for each instance using a fully connected layer and a sigmoid function. : Here, bias It is initialized to a large positive number (such as 3.0) so that in the early stages of training... The value is close to 1, indicating that the model relies more on reliable local spatial features. As network optimization progresses, the model adaptively adjusts... Values, smoothly merging global features : This generates It combines local topological details with a macro-global semantic perspective.

[0029] To obtain bag-level (WSI-level) features for final classification, the feature sequence... Entering the included Independent branches (such as) (Multi-branch gating attention module)

[0030] For the The first branch Features First, calculate the unnormalized raw attention score. .

[0031] The core innovation lies in the subsequent introduction of an adaptive masking module based on attention entropy: 1. Statistical Feature Extraction: The model first performs Softmax on the unnormalized scores to calculate the current attention distribution probability. Based on this probability, the normalized attention entropy, maximum attention value, standard deviation, and mean are calculated.

[0032] 2. Dynamic Mask Rate Prediction: The above four statistics are concatenated into a vector, which is then input into a small multilayer perceptron prediction network. The output is a value between... Dynamic mask ratio between The algorithm logic is as follows: if the current attention entropy is extremely low (indicating that the model heavily relies on a very small region and has fallen into "shortcut learning"), the network will output a high mask ratio; conversely, if the attention distribution is relatively uniform, it will output a low mask ratio to prevent the loss of effective features.

[0033] 3. Implementation of probability masking: Based on Determine the number of instances that need to be masked, and select the highest-scoring instances from highest to lowest. Using a masking probability generated by a temperature parameter learned by the network, apply a random masking operation to these high-scoring instances. — Force its score to be modified to the minimum value This forces the network to shift weights to other suboptimal potential diagnostic regions during backpropagation.

[0034] After masking, the score is then normalized again using Softmax to obtain the true aggregate weights. : Finally, the instance features are linearly weighted and summed using these weights to obtain the first... Branch pattern embedding representation All of them are processed by the mean operator. The representations of each branch are fused to obtain the global package feature. : Should The data is then fed into the final classifier, mapped to a predicted probability score for breast cancer in the entire slice. During training, a diversity regularization penalty for the similarity between branches is also calculated simultaneously, completing the joint optimization of the entire network in an end-to-end manner, and finally obtaining the classification result.

Claims

1. A multi-instance learning classification method based on bidirectional Mamba and adaptive masking, with specific features. Includes the following steps: Step 1: Preprocess and segment the whole-slice pathological image, use a pre-trained convolutional neural network to obtain instance-level visual features, and construct a one-dimensional initial feature matrix sequence. Step 2: Use an improved pyramid position encoding generator to perform local spatial perception processing on the input feature matrix sequence, while capturing the local microenvironment and cellular tissue features of different receptive fields. Step 3: A bidirectional state-space model with progressive gating mechanism is used to perform global context modeling of local spatial features, thereby realizing dynamic information exchange and fusion between local spatial features and global temporal features. Step 4: Feature aggregation is performed using a multi-branch gating attention mechanism and an adaptive masking strategy based on attention entropy. Global package features are obtained by integrating the pattern embedding representations of different branches, and then the final disease classification prediction result is output through a fully connected layer.

2. The multi-instance learning classification method based on bidirectional Mamba and adaptive masking according to claim 1, characterized in that, The improved pyramid location encoding generator captures fine-grained features of the instance itself, features of the immediate local microenvironment, and features of the broader tissue edge structure by employing three different receptive fields (1×1, 3×3, 5×5) of depth-separable convolution in parallel, thereby reconstructing two-dimensional physical space anchoring for one-dimensional serialized instance features.

3. The multi-instance learning classification method based on bidirectional Mamba and adaptive masking according to claim 1, characterized in that, The bidirectional state-space model with progressive gating mechanism consists of two parallel Mamba scan branches, forward and backward, used to achieve non-directional global contextual interaction of ultra-long pathological sequences, and utilizes dynamic soft mask gating to adaptively balance the proportion of local topological details and macro-global semantics.

4. The multi-instance learning classification method based on bidirectional Mamba and adaptive masking according to claim 1, characterized in that, The adaptive masking strategy calculates the dynamic masking ratio in real time by statistically analyzing the normalized entropy, maximum value, and standard deviation of the current multi-branch attention distribution of the network. When attention is overly concentrated (low entropy), the masking ratio is increased to force the model to explore suboptimal regions. When attention is scattered (high entropy), the masking ratio is reduced to avoid the loss of key lesion information.

5. The multi-instance learning classification method based on bidirectional Mamba and adaptive masking according to claim 1, characterized in that, Step 2 specifically refers to: The core of this step lies in capturing spatial topological information through depthwise separable convolutions at different scales, which can be formally expressed as follows: First, the initial feature sequence is rearranged. The data is then converted back to a two-dimensional spatial feature map, and then depthwise separable convolutions with different kernel scales are applied. Spatial feature extraction is performed, and the features at the three scales are flattened back into a one-dimensional sequence and summed element by element. Finally, the summation is performed and compared with the original input. Perform residual connections to obtain feature representations that include local spatial topological information. : in, For the number of instances, For feature dimension, This indicates the size of the convolution kernel.

6. The multi-instance learning classification method based on bidirectional Mamba and adaptive masking according to claim 1, characterized in that, Step 3 specifically refers to: To adaptively fuse local priors and global semantics, a progressive gating logic is designed to control the information flow between feature channels, and its formal expression is as follows: First, the bidirectional Mamba model outputs forward scan features. and backward flip scan features The two are concatenated along the feature dimension to generate a non-directional global temporal feature. : Then, the dynamic soft mask gating matrix for control information exchange is calculated. : in, These are learnable weight parameters. It controls the bias of the initial feature preferences. This indicates a feature concatenation operation. It uses the Sigmoid activation function, and finally uses a gating mechanism to convert local spatial features. With global temporal features The feature representation is merged and updated to enhance the feature context. : in, This represents the Hadamard product (element-by-element multiplication).

7. The multi-instance learning classification method based on bidirectional Mamba and adaptive masking according to claim 1, characterized in that, Step 4 specifically refers to: To extract discriminative and diverse packet-level representations, multi-branch gated attention and dynamic probabilistic masks are combined, and their formal expression is as follows: First, regarding the first The first branch Individual Instance Features Calculate its unnormalized attention score : in, and It is a feature mapping matrix shared by all branches. It is the first Each branch has its own unique attention aggregation vector. Then, the adaptive masking module calculates the masking ratio based on the overall score and identifies high-scoring instances. A probability function incorporating the learned temperature parameter is introduced to perform the masking operation. Set the score of the selected instance to the minimum value. The scores obtained after masking are then normalized using Softmax to obtain the final attention weights. : Finally, the pattern embedding representations of each branch are generated by weighted summation and aggregation. and calculate all The mean of each branch yields the global package feature. : This global package feature The data is then fed into a multilayer perceptron (MLP) and mapped to a final disease prediction probability score.