A large-scale image efficient feature extraction and multi-scale information modeling method and system

By employing window scale decay attention and region gating mechanisms in a cascaded architecture, the problems of high computational complexity and poor scale adaptability in large-scale image analysis are solved, achieving efficient feature extraction and multi-scale information modeling, and improving the model's adaptability and accuracy.

CN121544951BActive Publication Date: 2026-06-19BEIJING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING UNIV OF POSTS & TELECOMM
Filing Date
2025-12-10
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies have high computational complexity in large-scale image analysis, making it difficult to meet the computational and memory requirements of large-scale images. At the same time, fixed-scale mechanisms cannot fully capture lesion features when faced with different image region scales and structural diversity, affecting the model's generalization ability.

Method used

A cascaded architecture is adopted, including a window scale decay attention (WSDA) module and a squeeze-excitation-based region gating (SERG) mechanism. The computational complexity is reduced by feature clustering and hierarchical sampling strategies, and local correlations are modeled by a multi-scale window attention mechanism to dynamically adjust region weights to enhance feature extraction of key regions.

Benefits of technology

It significantly reduces the computational complexity of the self-attention mechanism, improves the model's adaptability and accuracy to pathological structures at different scales, and enhances the robustness and computational efficiency of large-scale image classification.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121544951B_ABST
    Figure CN121544951B_ABST
Patent Text Reader

Abstract

This invention relates to the field of image feature extraction technology, and in particular to a method and system for efficient feature extraction and multi-scale information modeling of large-scale images. The method includes: acquiring image data; preprocessing the acquired image data; extracting features from the preprocessed image; constructing a WSI classification model using a cascaded architecture; training the constructed WSI classification model; and performing classification prediction using the trained WSI classification model. This invention significantly reduces the computational complexity of the self-attention mechanism through feature clustering and hierarchical sampling strategies in the Window Scale Decaying Attention (WSDA) module, enabling the model to efficiently process large-scale images containing tens of thousands of instances. Simultaneously, by modeling local correlations from coarse to fine through a multi-scale window attention mechanism, it effectively improves the model's adaptability to pathological structures at different scales.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image feature extraction technology, and in particular to a method and system for efficient feature extraction and multi-scale information modeling of large-scale images. Background Technology

[0002] With the rapid development of high-resolution imaging technology, large-scale images have been widely used in various fields such as medicine, remote sensing, and materials science. The digitization of large-scale images has provided abundant data resources and research opportunities for deep learning-driven automated analysis. However, these images often have ultra-high resolution (up to billions of pixels) and lack pixel-level annotations, posing a significant challenge to traditional supervised learning methods. To address these issues, recent research has explored weakly supervised learning paradigms, particularly the potential of Multiple Instance Learning (MIL) methods in large-scale image analysis. MIL requires no pixel or region-level labels, relying solely on image-level labels for modeling, and is therefore widely used in large-scale image classification, segmentation, and retrieval tasks. Taking whole-slide images (WSI) in digital pathology as an example, the MIL method effectively improves diagnostic efficiency and robustness by dividing the entire slice into multiple patches and performing end-to-end training during feature extraction and aggregation. Meanwhile, the Transformer architecture has demonstrated powerful capabilities in modeling nonlinear relationships between instances, gradually becoming a new mainstay in large-scale image modeling. Specifically, Transformer-based frameworks typically introduce multi-layered self-attention mechanisms between frozen feature extractors and aggregators to uncover complex semantic dependencies between patches. However, the main bottleneck of these methods lies in their computational complexity of O(N²), making it difficult to adapt to the significant computational and memory burden brought about by tens of thousands of patches in large-scale images. For example, in WSI analysis, each image may contain tens of thousands of patches, making it difficult to directly deploy conventional global self-attention mechanisms. Furthermore, applying the same attention mechanism to all patches globally may lead to feature homogenization, weakening the importance of key local regions. Existing works mostly employ fixed-window local self-attention strategies to alleviate computational pressure and model patch dependencies in local regions. However, this fixed-scale mechanism has significant limitations when facing the diversity of regional scales and structures in different images. Taking tumor lesions as an example, their spatial distribution and morphology are highly heterogeneous among different patients, and a fixed receptive field cannot fully capture lesion features, affecting the model's generalization ability. Furthermore, many structures in reality (such as tumor cells) exhibit clustered distributions, and the correlation between patches within these clusters should decrease with spatial distance. However, the current approach of uniform modeling within a window fails to reflect this spatial prior. In summary, large-scale image modeling urgently requires a method that simultaneously possesses computational efficiency, spatial adaptability, and dynamic correlation modeling capabilities. This method should reduce dependence on computational resources while improving the model's responsiveness to key local structures and overall predictive performance.Taking large-scale image analysis tasks, such as WSI, as a practical scenario, we propose a novel model framework with an efficient attention mechanism to address the aforementioned challenges and promote the research and application of intelligent large-scale image analysis. Summary of the Invention

[0003] To address the aforementioned problems, this invention provides a method and system for efficient feature extraction and multi-scale information modeling of large-scale images.

[0004] Firstly, the present invention provides a method for efficient feature extraction and multi-scale information modeling of large-scale images, employing the following technical solution:

[0005] A method for efficient feature extraction and multi-scale information modeling of large-scale images includes:

[0006] Acquire image data;

[0007] Perform data preprocessing on the acquired image data;

[0008] Feature extraction is performed on the preprocessed image;

[0009] A cascaded architecture is used to construct the WSI classification model;

[0010] Train the constructed WSI classification model;

[0011] The trained WSI classification model is used for classification prediction.

[0012] Thirdly, the present invention provides a computer-readable storage medium storing a plurality of instructions adapted to be loaded and executed by a processor of a terminal device as described in the method for efficient feature extraction and multi-scale information modeling of large-scale images.

[0013] Fourthly, the present invention provides a terminal device, including a processor and a computer-readable storage medium, wherein the processor is used to implement various instructions; the computer-readable storage medium is used to store multiple instructions, the instructions being adapted to be loaded and executed by the processor to provide the method for efficient feature extraction and multi-scale information modeling of large-scale images.

[0014] In summary, the present invention has the following beneficial technical effects:

[0015] This invention significantly reduces the computational complexity of the self-attention mechanism by using feature clustering and hierarchical sampling strategies in the Window Scale Decaying Attention (WSDA) module, enabling the model to efficiently process large-scale images containing tens of thousands of instances. At the same time, by modeling local correlations from coarse to fine through the multi-scale window attention mechanism, the model's adaptability to pathological structures at different scales is effectively improved.

[0016] This invention introduces a squeeze-excitation-based region gating (SERG) mechanism, which learns adaptive gating weights for region blocks to enhance the features of discriminative regions and suppress redundant regions, thereby strengthening the global region-level information modeling capability and improving the model's focus on key pathological regions.

[0017] This invention organically combines feature extraction, WSDA, SERG, and aggregation classification modules through a cascaded architecture. While maintaining the stability of feature extraction, it effectively models the complex semantic dependencies between instances, significantly improving the accuracy and robustness of large-scale image classification, while reducing the computational resource requirements. Attached Figure Description

[0018] Figure 1 This is a schematic diagram of the overall architecture of the efficient feature extraction and multi-scale information modeling method for large-scale images according to Embodiment 1 of the present invention.

[0019] Figure 2 This is a schematic diagram of the attenuation window Transformer and the region gating module based on Squeeze-Excitation proposed in Embodiment 1 of the present invention.

[0020] Figure 3 This is a schematic diagram of image data preprocessing in Embodiment 1 of the present invention. Detailed Implementation

[0021] The present invention will be further described in detail below with reference to the accompanying drawings.

[0022] Example 1

[0023] Reference Figure 1 This embodiment presents an efficient feature extraction and multi-scale information modeling method for large-scale images. Specifically, in the feature extraction stage, the WSI image is segmented into non-overlapping small blocks at a resolution of 20x, and instance features are extracted using a pre-trained feature extractor. Next, in the model training stage, the extracted features are input into an attention module based on window scale decay. This module eliminates redundant block-level features through a clustering sampling strategy and then uses a decaying window transformer to accurately model the instance correlations within tumor regions at different scales. Subsequently, all region features are processed by a squeeze-excitation-based region gating module, which dynamically assigns weights to different window regions to capture global cross-region dependencies. Finally, bag-level label prediction is completed using a full-slice-level attention aggregator and a linear classifier.

[0024] Specifically:

[0025] S1. Acquire image data;

[0026] S1-1 Data Sources and Imaging

[0027] This step aims to clarify the specific technical specifications and acquisition standards for digital pathology image data applicable to this method. This method primarily targets large-scale digital pathology images at the gigapixel level, with whole-slide images (WSI) as the core processing object. Regarding staining types, the standard workflow supports the most widely used hematoxylin-eosin (H&E) staining, which clearly displays the morphological characteristics of cell nuclei and cytoplasm. Simultaneously, this method is designed with good scalability, compatible with WSI data from various staining types, including immunohistochemical (IHC) staining, special staining, and even multispectral imaging, to adapt to different pathological diagnostic and research needs. Regarding scanning magnification, a default optical magnification of 20x is used as the core analysis scale, achieving a good balance between cell morphological details and overall tissue structure. To support multi-resolution analysis, this method is compatible with multi-level pyramid image structures including 5x, 10x, 20x, and 40x magnification, allowing for cross-scale information fusion when necessary. The imaging equipment must be a clinical-grade digital slide scanner that has undergone rigorous quality control. Its core requirement is that it must provide an optical magnification of at least 20x; currently, mainstream equipment typically offers two standard configurations: 20x and 40x. Regarding image resolution and file format, a typical WSI image can reach tens of thousands of pixels by tens of thousands of pixels, forming gigapixel-level ultra-large image files. Supported common formats include, but are not limited to, professional pathology image formats with multi-layered pyramid structures such as SVS and TIFF. These formats can efficiently store and manage image data at different resolution levels, facilitating subsequent processing.

[0028] S1-2 Tags and Metadata

[0029] This step aims to define the types and sources of annotation information required for model training and evaluation. This method employs a weakly supervised learning paradigm, requiring only slice-level category labels for model training, eliminating the need for extensive pixel-level or region-level fine-grained annotation. These weakly supervised labels are directly derived from diagnostic reports issued by pathologists, commonly including binary positive / negative classifications (e.g., the presence or absence of tumor metastasis) or multi-class subtype diagnoses (e.g., differentiating between invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC)). A key advantage is that this method is completely independent of any form of pixel-level annotation, such as lesion boundary delineation or cell nuclear segmentation masks. This significantly reduces the cost and barrier to data annotation, enabling rapid deployment in real-world clinical scenarios.

[0030] S1-3 Dataset Examples and Splitting

[0031] This step aims to explain how to construct specific datasets for training and testing, and to formulate a reasonable data partitioning scheme. To verify the effectiveness and generalization ability of this method, the study used two publicly available authoritative digital pathology datasets as benchmarks. The CAMELYON16 dataset focuses on lymph node breast cancer metastasis detection, containing a total of 399 WSIs, of which 239 are positive samples containing metastatic lesions and 160 are negative samples. According to the official partitioning, 270 WSIs were used for training (162 positive and 108 negative), and 129 were used for testing (77 positive and 52 negative). This method strictly follows its weakly supervised setting, using only slide-level positive / negative labels. The TCGA-BRCA dataset focuses on breast cancer subtype classification, containing 977 breast cancer diagnostic images (WSIs), including 779 invasive ductal carcinoma (IDC) and 198 invasive lobular carcinoma (ILC) samples. To ensure the rigor of the evaluation and prevent patient data leakage, images were randomly split according to patient identification, with 780 WSIs assigned to the training set and 197 to the test set. These two datasets represent classification tasks of different difficulty and focus, together forming a reliable benchmark for evaluating the performance of this method.

[0032] S2. Perform data preprocessing on the acquired image data;

[0033] After obtaining the raw WSI data, a systematic preprocessing procedure is required to adapt it to the input requirements of subsequent feature extraction and model training. Firstly, since WSI typically contains a large number of background areas (such as glass slide substrates, unevenly stained areas, or blank areas outside the tissue), such as... Figure 3 The diagram illustrates the preprocessing of acquired image data. It systematically preprocesses raw WSI images containing background and redundancy, such as glass slide substrates, uneven staining, or blank areas outside the tissue, to meet the input requirements of subsequent feature extraction and model training. This includes directly processing the entire image. This introduces a large amount of redundant information and increases the computational burden. Therefore, a threshold-based image segmentation method is first used to extract the foreground tissue region: typically, color features are used to binarize the low-resolution overview image to generate a mask for the tissue region. This allows for accurate differentiation between diagnostically valuable tissue regions and the background.

[0034] ,

[0035] Subsequently, based on the binary mask The original high-resolution WSI is cropped, retaining only the tissue area, effectively reducing the number of pixels required for subsequent processing. Next, within the retained tissue area... Up, with step length Divide it into a series of sizes A set of non-overlapping image patches of pixels To achieve non-overlapping segmentation, set The default parameter is: magnification 20×. , Finally, all preprocessed image patches are passed through a pre-trained feature extraction network. (Virchow, a basic pathology model) is converted into high-dimensional feature vectors. ,in, For feature dimensions (Virchow model).

[0036] S3. Perform feature extraction on the preprocessed image;

[0037] After introducing the image data preprocessing, this section details the feature extraction process. This process aims to transform the visual information of image patches into high-dimensional semantic feature representations, providing input for subsequent multi-instance learning modeling. In a specific embodiment of this invention, we explicitly use the Virchow model, pre-trained on a large amount of pathological data, as the core encoder. Its advantage lies in the high specificity and discriminative power of the extracted features for pathological morphology (such as nuclear atypia, glandular structure, etc.). First, the encoder is loaded and configured. We load the pre-trained Virchow model and modify its output layer to directly output high-dimensional feature vectors instead of classification probabilities. The encoder is set to a frozen state, meaning all its weights are not updated in subsequent training. This aims to maintain the stability of feature extraction, prevent overfitting, and significantly reduce the number of trainable parameters of the model. The Virchow encoder has a preset input size of 224×224 pixels and an output feature dimension of... ,Right now Then comes image patch preprocessing and batch loading, which divides the image into 256×256 pixel blocks from the preprocessing stage. First, the image patches are cropped or scaled to 224×224 pixels to fit the input requirements of the Virchow encoder. The image patches are then standardized using the mean and standard deviation employed during Virchow model pre-training. The processed image patches are then batched and fed into the encoder for forward propagation. For each input image patch... Frozen Virchow encoder Perform forward propagation to map it into a 1280-dimensional feature vector. This process applies to all n image patches of a WSI. Execute to generate the corresponding instance-level feature set. .

[0038] S4. Model Building

[0039] The core network of this invention adopts a cascaded architecture, consisting sequentially of a Window Scale Decaying Attention Module (WSDA), a SE Region Gating Module (SERG), a packet-level attention aggregator, and a linear classifier. This design aims to gradually extract highly discriminative packet-level representations from instance-level features through multi-stage refinement processing, ultimately achieving accurate WSI classification.

[0040] S4-1 Problem Formalization and MIL Process Framework Establishment

[0041] This step aims to formalize the whole-slice image classification task into a multi-instance learning problem and construct a complete mathematical model and processing flow. Given a dataset containing N WSI images... ,in For slice-level labels, our goal is to learn a mapping function from WSI to predicted labels in a weakly supervised environment. This makes the predicted value It can approximate the real label as accurately as possible. This goal is formally defined as: This clarifies the core setting of the present invention: using only slice-level labels for learning and inference.

[0042] Under the MIL paradigm, each WSI The image is divided into n non-overlapping patches, forming a "packet": Following the standard MIL assumption, a bag is considered positive if it contains at least one positive instance; a bag is considered negative only if all instances are negative. This assumption allows the model to be trained using only bag-level labels, eliminating the need for expensive pixel-level annotations. Each instance is processed through a pre-trained feature extractor. Encode as feature vector All instance features constitute a set Subsequently, the polymerizer with no change in displacement was... Mapping instance feature sets to package-level representations Ultimately, it is determined by the classifier. Output prediction: This process clearly distinguishes three core capabilities: instance feature extraction, inter-instance relationship modeling, and final classification.

[0043] Traditional MIL methods mostly introduce weighting mechanisms only during the aggregation phase, making it difficult to effectively capture spatial and semantic dependencies between instances before aggregation. While Transformer-based methods can model instance dependencies before aggregation, their... The computational complexity of handling WSIs containing tens of thousands of instances presents significant computational and memory challenges, and the fixed attention scale struggles to adapt to scale variations in tumor regions across different cases. Furthermore, task-relevant instances in WSIs are typically extremely sparse, and global average attention often dilutes the impact of key local features. To address these issues, this invention introduces two key enhancement modules into the traditional workflow: in the feature extractor... with aggregator A window-scale decay attention module is inserted between the nodes to reduce redundancy through a clustering-driven sampling strategy and to characterize the dependencies between instances using a multi-scale window attention mechanism; in the aggregator... Previously, an SE region gating module was inserted to enhance global region-level information modeling by learning the gating weights of region blocks. The overall data flow follows the pipeline of "feature extraction → WSDA → SERG → WSI-level aggregation → linear classification", forming a complete WSD-MIL framework.

[0044] S4-2 WSDA Module: Multi-scale Attention Modeling Based on Feature Clustering and Decaying Windows

[0045] This step aims to address the high computational complexity and poor scale adaptability issues in WSI processing through a window-scale decaying attention module. The WSDA module employs a two-stage processing strategy: first, it significantly reduces the number of instances through hierarchical sampling based on feature clustering while maintaining the diversity of organizational morphology; then, it uses a decaying window multi-head self-attention mechanism combined with Nystrom approximation to model multi-scale local correlations from coarse to fine.

[0046] In the feature clustering and hierarchical sampling stages, image patches acquired at 20x magnification are converted into instance-level feature sets through a pre-trained encoder. This step maps the original pixel space to the semantic embedding space, providing a metric basis for subsequent dependency modeling and sampling. To reduce the computational complexity of subsequent attention processing while preserving histological diversity, the feature set... Perform K-means clustering by minimizing the within-cluster squared error. Divide features into In the clusters, among which For the first Clusters, The cluster centers are the nodes of the clusters. The clustering process groups instances with similar morphological characteristics together, laying the foundation for stratified sampling. Within each cluster, a fixed proportion is used to select the nodes. Perform random sampling to obtain a subset This ultimately forms a compact feature pool. This strategy significantly reduces the number of instances, saving computational and memory resources, while maintaining the representativeness of various morphological patterns and avoiding the loss of rare but crucial lesion features by only retaining common patterns. Recommended parameters are set to... In public experiments, using this setting can reduce memory consumption by approximately 58% to 62% while maintaining or even slightly improving performance.

[0047] In the global attention approximation stage, the Nystrom mechanism is used to efficiently model long-range dependencies. For batch input... (in Perform a linear projection to obtain the query, key, and value matrix: Through the analysis of and Regional sampling was performed to obtain and Global attention is approximated using Nystrom: ,in This represents the Moore-Penrose generalized inverse. This step approximates global relevance using a small number of representative key and query vectors, reducing computational complexity from... It effectively reduces the level to approximately linear.

[0048] In the multi-head attention stage of the decaying window, the feature representation is... Reshape into multiple local windows, sequentially in , and Local multi-head self-attention computation is performed on three scales, with typical settings. Attention calculation within each window is represented as follows: Position encoding This is achieved through one-dimensional convolution. The output of this layer is then obtained through layer normalization, linear transformation, and residual connections. The entire multi-scale processing flow can be denoted as: This decreasing window scale is equivalent to gradually increasing the resolution of the receptive field, achieving a local correlation characterization from coarse to fine, while incorporating relative spatial information through position encoding to ensure that the model can perceive the spatial relationships between instances.

[0049] S4-3 SERG Module: Region Gating and Global Feature Optimization Based on Squeeze-Excitation Mechanism

[0050] Although the previous stage of WSDA has characterized instance-level (patch-level) relevance from "coarse to fine," the contribution of different spatial regions to the final discrimination is significantly uneven at the whole-bag scale: regions containing positive microfoci / metastatic lesions tend to carry stronger discriminative power, while large areas of tissue background or negative regions have lower information density. To address this, this module introduces a region gating mechanism based on squeeze-and-excitation (SE), which learns adaptive weights for each "region block" at the global scale, thereby strengthening high-value regions and suppressing low-information regions, further improving global feature modeling capabilities and the final bag-level discrimination effect. This idea and structure are explicitly proposed as SERG (Squeeze-and-Excitation Region Gate) in this paper and work in conjunction with "Window Scale Decaying Attention (WSDA)" to form a three-level modeling chain of "global-local-region".

[0051] 1. Region division and input representation

[0052] Let the intermediate features after passing through the WSDA module be... (Where B is the batch size, M is the number of instances, and F is the channel dimension). To perform region-level reweighting, [the following will be used]. Divide the space or instance order into L×L "region windows", denoted as . In this invention, L=8 is used, meaning the entire image is uniformly divided into 64 regions. For each region, global average pooling (GAP) is first applied to extract its global representation:

[0053] ,

[0054] And order This represents the concatenated set of regional features. Spatial partitioning can be based on the order of instances sorted by WSI coordinates to ensure that adjacent patches are placed into adjacent windows; when the instance distribution is extremely uneven, a lower bound constraint can be imposed on the number of samples within the window, or a "filling" strategy can be enabled to maintain statistical stability.

[0055] 2. Two-step mapping of "squeeze-excitation": learning region gating weights

[0056] SERG uses the standard SE concept to perform a two-layer mapping on region-level features:

[0057] (1) Squeeze: to squeeze Projecting onto a low-dimensional embedding space yields an intermediate representation. :

[0058] ,

[0059] Where r is the dimensionality reduction ratio. Activate ReLU. This step "compresses" global regional statistics to suppress collinear / redundant information and establish global dependencies between regions.

[0060] (2) Excitation: The low-dimensional representation is then "extended" back to the original dimension, and the final region-gated weight matrix G is obtained through a Sigmoid mapping.

[0061] ,

[0062] Finally, G is reshaped into an L×L region weight map, which is used as a weighting factor in subsequent calculations.

[0063] 3. Regional-level reweighting and feature output

[0064] After obtaining G, perform region-by-region multiplication (broadcast to all instances within the region) at the region granularity. Perform weighting to obtain the weighted output features. :

[0065] ,

[0066] Where ⊙ represents element-wise multiplication at the region level; intuitively, all instance representations in high-weight regions are "amplified," while low-weight regions are "compressed," thus highlighting discriminative regions and suppressing redundant background at the global level. Weighted The data is then fed into subsequent network layers, and finally, packet-level label prediction is achieved through WSI-level attention aggregation and a linear classifier.

[0067] S5. Model Training

[0068] This step aims to establish the optimization objective and corresponding loss function for model training. For the C-class classification task across the entire WSI dataset, cross-entropy loss is used as the training objective function, mathematically expressed as:

[0069] ,

[0070] in, Indicates the first The true label of each sample in the category One-hot encoding on The model represents the first Each sample in category The predicted probability, The number of training samples is denoted as . This loss function effectively measures the difference between the model's predicted probability distribution and the true label distribution, and drives model parameter updates through gradient descent, allowing the model's predictions to continuously approach the true situation. In binary classification scenarios (such as cancer / normal), this formula simplifies to binary cross-entropy loss, while still maintaining the same optimization principle.

[0071] S5-2 Optimizer Selection and Hyperparameter Configuration

[0072] This step aims to determine the optimization algorithm for model parameter updates and its associated hyperparameter settings. The Adam optimizer is chosen for parameter updates. This optimizer combines the advantages of momentum and adaptive learning rates, exhibiting stability in non-stationary objectives and sparse gradient scenarios. The learning rate is set to 1e-5, employing a fixed learning rate strategy to avoid instability in the early stages of training while ensuring convergence accuracy in later stages. The batch size is set to 1, meaning a single WSI image constitutes a training batch, which aligns with the characteristics of WSI data and fully utilizes GPU memory. The total number of training epochs is set to 100 to ensure sufficient model convergence while preventing overfitting.

[0073] Regarding the sampling strategy, the sampling rate is set. A sampling rate of 20% is recommended, as publicly available experiments have shown that this setting significantly improves training efficiency while maintaining classification accuracy. The window attention mechanism sets the base window size. ,use , , The multi-scale reduction strategy, i.e., gradually refining the window size from 16×16 to 4×4, enables adaptive modeling of tumor regions at different scales. The region gating module divides the feature map into... That is, 8×8=64 region blocks, balancing global information capture and computational complexity. The number of clusters is set in the feature clustering stage. The instance features are divided into 10 semantic clusters for hierarchical sampling. All the above hyperparameter configurations have been thoroughly validated on public datasets and can be used as default configurations.

[0074] S5-3 End-to-End Training Process Execution

[0075] This step details the complete training process from data input to model parameter updates. First, data preprocessing and image patch segmentation are performed: for each training WSI image, foreground region segmentation based on a threshold is performed to remove background interference; at 20x magnification, the foreground region is divided into non-overlapping image patches of 256×256 pixels, forming a set of instances for the packet.

[0076] Next, instance-level feature extraction is performed: deep features are extracted for each image patch using a pre-trained ResNet-50 or Virchow encoder. The encoder parameters are kept frozen throughout the training process to ensure the stability of feature extraction and reduce the number of trainable parameters.

[0077] Then, feature clustering and stratified sampling are performed: K-means clustering is applied to all instance features of a single WSI, and instances are divided into groups based on feature similarity. Clusters; within each cluster, a preset sampling rate is applied. Random sampling is performed to construct a compact subset of features, which significantly reduces the computational burden while maintaining the diversity of organizational forms.

[0078] Then, the WSDA module performs forward propagation: the sampled features are first approximated with the global dependency using the Nystrom attention mechanism, and then sequentially... , , Multi-head self-attention computation is performed on a decreasing window scale to model local instance correlations from coarse to fine and incorporate spatial structure information through convolutional position encoding.

[0079] Next, the SERG module forward propagation is performed: the WSDA output features are divided into... For each region block, global average pooling is applied to obtain region-level representations. Region gating weights are learned through a squeeze-excitation network consisting of two fully connected layers. The first fully connected layer performs dimensionality reduction compression, and the second fully connected layer restores the original dimension and generates weight values ​​between 0 and 1 through Sigmoid activation. Finally, the learned region weights are multiplied with the original features at the region level to enhance discriminative regions and suppress redundant regions.

[0080] Then, packet-level aggregation and classification output are performed: the weighted instance features are integrated using an attention aggregator, the attention weight of each instance is calculated using a learnable query vector, and the instance features are weighted and summed based on the weights to obtain the packet-level representation; the packet-level representation is input into a linear classifier, and the predicted probability of each class is output through a fully connected layer and a Softmax activation function.

[0081] Finally, backpropagation and parameter updates are performed: the cross-entropy loss value is calculated based on the model's predicted output and the true label, and the gradient of the loss function with respect to the parameters of each component of the model is calculated using the backpropagation algorithm; the trainable parameters of the WSDA module, SERG module, attention aggregator, and classifier are updated using the Adam optimizer based on the gradient information, while the learning rate is kept at 1e-5.

[0082] Throughout the training process, model selection and early stopping are determined based on validation set performance: after each training epoch, the model performance is evaluated on the validation set, mainly monitoring the Acc, AUC and F1-score metrics; when the validation set performance no longer improves within several consecutive epochs, an early stopping mechanism is triggered to prevent overfitting, and the model parameters that perform best on the validation set are selected as the final training result.

[0083] S6. Model Inference and Prediction

[0084] S6-1 Reasoning Flow

[0085] This step details the complete forward computation process of the trained WSD-MIL model on the new WSI data. The data processing flow in the inference phase is largely the same as in the training phase, but backpropagation parameter updates and data augmentation operations are not required, ensuring the efficiency and determinism of the prediction process.

[0086] First, foreground tissue extraction and image patch segmentation are performed: the entire WSI to be inferred is loaded, and the same color threshold-based segmentation algorithm as in the training phase is used to distinguish the foreground tissue region from the background; at a magnification of 20x, the foreground region is systematically divided into non-overlapping image patches of 256×256 pixels to ensure that each image patch contains diagnostically valuable tissue structure and cell morphology information.

[0087] Next, instance-level deep feature extraction is performed: the same pre-trained encoder (ResNet-50 or Virchow) used in the training phase is used to extract feature vectors for each image patch, and the encoder parameters are kept frozen; this step converts visual information into high-dimensional semantic features, providing a unified feature representation basis for subsequent clustering sampling and attention modeling.

[0088] Then, feature space clustering and stratified sampling are performed: all instance features of a single WSI are taken as input, and the K-means clustering algorithm is applied to divide them into K=10 feature clusters. The cluster centers can be directly used from the center points learned during the training phase or quickly approximated by online calculation. Within each feature cluster, stratified random sampling is performed according to the sampling rate α=20% determined during the training phase to construct a representative subset of instance features, which significantly improves computational efficiency while ensuring inference accuracy.

[0089] Then, the forward computation of the WSDA module is performed: the Nystrom attention mechanism is first applied to the sampled feature matrix to approximate the global instance dependency relationship and capture long-range spatial correlation; then, multi-head self-attention computation is performed sequentially on decreasing window scales to refine the local feature representation from coarse to fine granular, thus completely reproducing the multi-scale attention modeling process in the training stage.

[0090] Next, the SERG module region reweighting is performed: the refined features output by WSDA are divided into 8×8=64 region blocks in spatial order, and global average pooling is applied to each region to obtain region-level statistical features; the gating weight of each region is calculated through the trained squeeze-excitation network, and the weight is multiplied by all instance features in the corresponding region to strengthen the feature response of discriminative regions and suppress the contribution of non-critical regions.

[0091] Finally, the packet-level representation is aggregated and the classification output is performed: the trained attention aggregator is used to integrate all weighted instance features, and the instance features are weighted and summed using the learned attention weights to generate a compact and highly discriminative packet-level representation vector; this packet-level representation is input into the trained linear classifier, and the probability distribution of WSI belonging to each category is calculated through a fully connected layer and a Softmax activation function to complete the final classification prediction.

[0092] S6-2 Predictive Probability Analysis and Class Decision

[0093] This step aims to transform the probability distribution output by the model into specific category predictions, providing interpretable classification decisions. For multi-class classification tasks (such as breast cancer subtype classification), the argmax decision rule is used to determine the final predicted category from the probability distribution output by the model: Let the probability vector output by the model be... ,in Indicates that the sample belongs to the first The predicted probability of the class is then used to determine the final predicted label. That is, the class with the highest probability value is selected as the prediction result of the model.

[0094] For binary classification tasks (such as cancer / normal classification), a threshold decision mechanism is used to convert continuous probability outputs into discrete class labels: Let the positive class probability output by the model be... The preset decision threshold is (Default setting is 0.5), when The system classifies samples as positive only if they are positive, otherwise as negative. In practical applications, this threshold can be optimized based on performance on the validation set. By plotting precision-recall curves or ROC curves, the optimal threshold point that balances recall and precision can be selected to adapt to different clinical application needs. This flexible threshold adjustment mechanism allows the model to adjust the classification strictness according to specific task requirements. In sensitive scenarios, the threshold can be appropriately lowered to improve disease detection rate, while in scenarios with high specificity requirements, the threshold can be increased to reduce false positives.

[0095] S7. Experimental Verification

[0096] To verify the effectiveness of the proposed method, we conducted experiments on two widely cited public histopathology datasets: CAMELYON-16 and TCGA-BRCA. Both datasets contain whole-slice images (WSI) stained with hematoxylin and eosin (H&E) and digitized using a high-resolution scanner. These datasets have become authoritative reference standards in the field of computational pathology due to their widespread use in cancer diagnostic algorithm development and benchmarking. Detailed descriptions of the datasets and experimental protocols will be provided below.

[0097] The CAMELYON-16 dataset contains 399 whole-slice images (WSI) of sentinel lymph node biopsies from breast cancer patients. Of these, 239 slices showed the presence of metastases (positive), and 160 showed no metastases (negative). The official partitioning assigned 270 slices to the training set (162 positive, 108 negative) and 129 slices to the test set (77 positive, 52 negative). Metastases are clearly delineated at the pixel level, and each slice is accompanied by a slice-level label provided by the pathologist. This study only uses slice-level labels and employs a weakly supervised learning paradigm to avoid pixel-level annotation. A key challenge of the CAMELYON-16 dataset is that metastases are often very small, making precise location within a large amount of benign tissue particularly difficult.

[0098] TCGA-BRCA: The TCGA-BRCA dataset contains diagnostic whole-slice images (WSI) of 977 cases of primary invasive breast cancer. We focus on two main histological subtypes—invasive ductal carcinoma (IDC, 779 slices) and invasive lobular carcinoma (ILC, 198 slices). Slices were randomly assigned to a training set (780 slices) and a test set (197 slices) by patient. Only slice-level subtype labels extracted from pathology reports are provided; region-level annotations are not provided. Compared to CAMELYON16, tumor regions typically occupy a larger proportion of tissue slices in TCGA-BRCA. However, significant morphological heterogeneity, abundant adipose stroma, and inter-scanner color variations make subtype classification challenging, making TCGA-BRCA a realistic and challenging benchmark model for weakly supervised breast cancer analysis.

[0099] To systematically evaluate the applicability and superiority of the efficient feature extraction and multi-scale information modeling method (WSD-MIL) for large-scale images under different data distributions, feature extractors, and mainstream multi-instance learning (MIL) frameworks, we conducted experiments on two representative public datasets: the Camelyon16 dataset (with a low proportion of metastatic tissue) and the TCGA-BRCA dataset (known for its significant morphological heterogeneity). For both datasets, we used a shared ResNet-50 backbone network to extract full-slice features; for the TCGA-BRCA dataset, we also specifically used the pathology-based model Virchow for feature extraction. While maintaining consistency in all experimental conditions—including training hyperparameters, five-fold cross-validation, and evaluation metrics (accuracy, area under the curve, F1 score)—we compared our method with seven representative baseline models: simple aggregation techniques (mean pooling, max pooling), attention-based multi-scale information models (ABMIL, CLAM, S4-MIL), and fixed-scale transformer multi-scale information models (TransMIL, R2T-MIL). This technological hierarchy—from traditional pooling to fine-grained attention mechanisms, to fixed-scale transformers, and finally to our proposed adaptive decay transformer—enables us to objectively quantify the advantages of WSD-MIL in modeling local-global correlations and addressing tumor size variability.

[0100] Table 1 compares the experimental results of Embodiment 1 of this invention with those of currently popular methods on two datasets. As shown in Table 1, the comparison results are summarized, indicating that WSD-MIL exhibits top-tier performance in all experimental settings. On the Camelyon16 dataset using the ResNet-50 backbone network, its accuracy reaches 91.20%, F1 score is 87.89%, and AUC reaches 92.74%—a 1.5% and 1.4% improvement in accuracy and F1 score respectively compared to R2TMIL, and a 2% improvement in AUC compared to TransMIL—highlighting its high sensitivity to sparse metastatic lesions. When the features are transferred to the morphologically heterogeneous TCGA-BRCA cohort, WSD-MIL still maintains a significant advantage (accuracy of 90.38%, AUC of 91.14%, F1 score of 77.68%), surpassing the traditional attention model ABMIL by more than 7 percentage points in F1 score, confirming its robustness in complex tumor scenarios. Even with the replacement of ResNet-50 features with high-dimensional semantic embeddings from the Virchow base model, the overall score still improved, but WSD-MIL remained the leader (accuracy of 93.35%, AUC of 94.83%, and F1 score of 84.19%), demonstrating its strong synergistic effect with large-scale pre-trained representations. In summary, these findings indicate that the proposed adaptive mechanism—progressive attention window decay combined with region-gated weighting—effectively overcomes the scalability limitations of fixed-scale transformers, setting a new benchmark for weakly supervised whole-slice image classification.

[0101] Table 1 Comparison of Experimental Results

[0102]

[0103] like Figure 1 This paper presents the overall structure of the weakly supervised tumor classification algorithm based on attention scale decay (WSD-MIL). First, in the feature extraction stage, the WSI image is segmented into non-overlapping patches at a 20x resolution, and instance features are extracted using a pre-trained feature extractor. Next, in the model training stage, the extracted features are input into a window scale decay-based attention module (WSDA). This module eliminates redundant block-level features through a clustering sampling strategy and then uses a decaying window transformer to accurately model instance relationships within tumor regions at different scales. Subsequently, all region features are processed by a squeeze-excitation-based region gating module (SERG), which dynamically assigns weights to different window regions to capture global cross-region dependencies. Finally, bag-level label prediction is completed using a full-slice-level attention aggregator and a linear classifier.

[0104] like Figure 2The detailed structures of the Transformer module with window scale decay and the region gating module based on Squeeze-Excitation are shown.

[0105] Example 2

[0106] This embodiment provides a system for efficient feature extraction and multi-scale information modeling of large-scale images, including:

[0107] The data acquisition module is configured to acquire image data;

[0108] The data preprocessing module is configured to perform data preprocessing on the acquired image data;

[0109] The feature extraction module is configured to extract features from the preprocessed image.

[0110] The model building module is configured to build WSI classification models using a cascaded architecture;

[0111] The model training module is configured to train the constructed WSI classification model;

[0112] The prediction module is configured to perform classification prediction using the trained WSI classification model.

[0113] A computer-readable storage medium storing a plurality of instructions adapted for loading and execution by a processor of a terminal device, the method for efficient feature extraction and multi-scale information modeling of large-scale images.

[0114] A terminal device includes a processor and a computer-readable storage medium, the processor being configured to implement various instructions; the computer-readable storage medium being configured to store multiple instructions adapted to be loaded and executed by the processor as described in the method for efficient feature extraction and multi-scale information modeling of large-scale images.

[0115] The above are all preferred embodiments of the present invention and are not intended to limit the scope of protection of the present invention. Therefore, all equivalent changes made in accordance with the structure, shape and principle of the present invention should be covered within the scope of protection of the present invention.

Claims

1. A large-scale image efficient feature extraction and multi-scale information modeling method, characterized in that, include: Acquire image data; Perform data preprocessing on the acquired image data; Feature extraction is performed on the preprocessed image, and each WSI is represented as a packet consisting of n image patch instances; A cascaded architecture is used to construct the WSI classification model; Train the constructed WSI classification model; Use the trained WSI classification model for classification prediction; The cascaded architecture used to construct the WSI classification model includes formalizing the whole-slice image classification task as a multi-instance learning problem and constructing a complete mathematical model. First, given a dataset containing N WSI images... ,in For slice-level labels, a mapping function from WSI to predicted labels is learned in a weakly supervised environment. , so that the predicted value Accurately approximates the real label , is represented as: ; Under the MIL paradigm, each WSI The image is divided into n non-overlapping patches, forming a "packet": Following the standard MIL assumption, each instance is processed by a pre-trained feature extractor. Encode as feature vector All instance features constitute a set Subsequently, the aggregator was replaced without change. Mapping instance feature sets to package-level representations Ultimately, it is determined by the classifier. Output prediction: Through the feature extractor with aggregator A window-scale decay attention module is inserted between the nodes to reduce redundancy through a clustering-driven sampling strategy and to characterize the dependencies between instances using a multi-scale window attention mechanism; in the aggregator... Previously, we inserted an SE region gating module to enhance global region-level information modeling by learning the gating weights of region blocks.

2. The method for efficient feature extraction and multi-scale information modeling of large-scale images according to claim 1, characterized in that, The data preprocessing of the acquired image data includes, after acquiring the original WSI data, firstly, using a threshold-based image segmentation method to extract the foreground tissue region, then using a color feature adaptive thresholding algorithm to binarize the low-resolution overview image to generate a mask of the tissue region, and then cropping the original high-resolution WSI based on the mask to retain only the tissue region. The image blocks are then segmented within the preserved tissue regions, dividing the tissue regions into a series of non-overlapping small blocks. Finally, all preprocessed image blocks are converted into high-dimensional feature vectors through a pre-trained feature extraction network, forming an instance-level feature set, which serves as the input for the subsequent multi-instance learning framework.

3. The method for efficient feature extraction and multi-scale information modeling of large-scale images according to claim 2, characterized in that, The feature extraction of the preprocessed image includes determining the range of pathologically significant tissue regions on the original WSI image based on the binary foreground mask obtained by foreground extraction and background removal after completing the image data preprocessing. A systematic segmentation strategy is then employed to divide the foreground region into a series of p×p pixel image blocks at a specified magnification. By setting the stride to be equal to the block size, non-overlapping segmentation is achieved, thus striking a balance between covering the complete tissue region and controlling redundancy. Ultimately, each WSI is represented as a packet consisting of n image block instances. Each image block is then feature-mapped by a pre-trained encoder, using a ResNet-50 pre-trained on ImageNet to capture multi-level visual patterns from low-level texture to high-level morphology. Simultaneously, the encoder is selected based on data characteristics and computational resources, and the encoder parameters remain frozen throughout the WSI classification process. Finally, each WSI is transformed into a feature set.

4. The method for efficient feature extraction and multi-scale information modeling of large-scale images according to claim 3, characterized in that, The cascaded architecture used to construct the WSI classification model also includes a window-scale decaying attention (WSDA) module to address the high computational complexity and poor scale adaptability issues in WSI processing. The WSDA module employs a two-stage processing strategy: first, it reduces the number of instances through hierarchical sampling based on feature clustering while maintaining the diversity of tissue morphology; then, it uses a decaying window multi-head self-attention mechanism combined with Nystrom approximation to model multi-scale local correlations from coarse to fine. Specifically, during the feature clustering and hierarchical sampling stages, image patches are converted into instance-level feature sets through a pre-trained encoder. To reduce the computational complexity of attention while preserving histological diversity, the feature set... Perform K-means clustering by minimizing the within-cluster squared error. Divide features into In the clusters, among which For the first Clusters, As the cluster center, within each cluster, according to a fixed proportion Perform random sampling to obtain a subset This ultimately forms a compact feature pool. In the global attention approximation stage, the Nystrom mechanism is used to efficiently model long-range dependencies for batch inputs. Performing a linear projection yields the query, key, and value matrix: Through the and Regional sampling was performed to obtain and Global attention is approximated using Nystrom: ,in Representing the Moore-Penrose generalized inverse; during the decaying window multi-head attention stage, the feature representation is... Reshape into multiple local windows, sequentially in , and Local multi-head self-attention computation is performed at three scales, and the attention computation within each window is represented as follows: Position encoding The output of this layer is obtained through one-dimensional convolution, followed by layer normalization, linear transformation, and residual connection. The entire multi-scale processing flow is denoted as B is the batch size, M is the number of instances, and F is the channel dimension.

5. The method for efficient feature extraction and multi-scale information modeling of large-scale images according to claim 4, characterized in that, The WSI classification model constructed using a cascaded architecture also includes introducing a squeeze-excitation-based region gating mechanism. Adaptive weights are learned for each region block at a global scale. First, region partitioning and input representation are performed. Let the intermediate features after passing through the WSDA module be... To perform regional reweighting, Divide the area into L×L regions according to spatial or instance order. ; Apply global average pooling to each region block to extract its global representation: And order This represents the concatenated set of region-level features. Spatial partitioning is based on the instance order after WSI coordinate sorting to ensure that adjacent patches are included in adjacent windows. Then, region gating weights are learned. SERG uses the SE concept to perform a two-layer mapping on the region-level features, by... Projecting onto a low-dimensional embedding space yields an intermediate representation. : Where r is the dimensionality reduction ratio. ReLU activation is used to suppress redundant information and establish global dependencies between regions; the low-dimensional representation is then expanded back to the original dimension, and the final region-gated weight matrix G is obtained through a Sigmoid mapping. Finally, G is reshaped into an L×L region weight map. After obtaining G, region-by-region multiplication is performed at the region granularity. Perform weighting to obtain the weighted output features. : Where ⊙ represents element-wise multiplication performed at the region level, and the weighted sum... The data is then fed into subsequent network layers, and finally, packet-level label prediction is achieved through WSI-level attention aggregation and a linear classifier.

6. The method for efficient feature extraction and multi-scale information modeling of large-scale images according to claim 5, characterized in that, The training of the constructed WSI classification model includes a C-class classification task for the entire WSI, using cross-entropy loss as the training objective function, expressed as: in, Indicates the first The true label of each sample in the category One-hot encoding on The model represents the first Each sample in category The predicted probability, To determine the number of training samples, the optimization algorithm and hyperparameter settings for model parameter updates were determined. The Adam optimizer was selected for parameter updates, employing a fixed learning rate strategy to avoid instability in the early stages of training while ensuring convergence accuracy in the later stages. A single WSI image was used as a training batch to ensure sufficient model convergence while preventing overfitting. Regarding the sampling strategy, a sampling rate was set, using… , , A multi-scale decreasing strategy is employed to achieve adaptive modeling of tumor regions at different scales. The region gating module divides the feature map into... That is, 8×8=64 region blocks, balancing global information capture and computational complexity. In the feature clustering stage, the number of clusters is set to K, and the instance features are divided into K semantic clusters for hierarchical sampling.

7. The method for efficient feature extraction and multi-scale information modeling of large-scale images according to claim 6, characterized in that, The training of the constructed WSI classification model includes extracting deep features for each image patch using a pre-trained ResNet-50 encoder. The encoder parameters are frozen throughout the training process to ensure the stability of feature extraction and reduce the number of trainable parameters. Then, feature clustering and hierarchical sampling are performed. K-means clustering is applied to all instance features of a single WSI image, dividing the instances into K clusters based on feature similarity. Within each cluster, a preset sampling rate is applied. Random sampling is performed to construct a compact feature subset, significantly reducing computational burden while maintaining organizational morphological diversity. Subsequently, the WSDA module is forward-propagated, applying the Nystrom attention mechanism to approximate global dependencies on the sampled features before sequentially applying... , , Multi-head self-attention computation is performed on a decreasing window scale to model local instance correlations from coarse to fine, and spatial structure information is incorporated through convolutional position encoding; then, the SERG module is forward propagated: the WSDA output features are divided into... The system divides the data into regions and applies global average pooling to each region to obtain a region-level representation. A squeeze-excitation network consisting of two fully connected layers learns region-gated weights. The first fully connected layer performs dimensionality reduction compression, while the second fully connected layer restores the original dimensions and generates weight values ​​between 0 and 1 through sigmoid activation. Finally, the learned region weights are multiplied with the original features at the region level to strengthen discriminative regions and suppress redundant regions. Then, bag-level aggregation and classification output are performed. An attention aggregator integrates weighted instance features, calculates the attention weight for each instance using a learnable query vector, and performs a weighted summation of instance features based on these weights to obtain a bag-level representation. This bag-level representation is input into a linear classifier, which outputs the predicted probability for each class through fully connected layers and a Softmax activation function. Finally, backpropagation and parameter updates are performed. The cross-entropy loss is calculated based on the model's predicted output and the true labels, and the gradient of the loss function with respect to the parameters of each model component is calculated using the backpropagation algorithm. The Adam optimizer is used to update the trainable parameters of the WSDA module, SERG module, attention aggregator, and classifier based on the gradient information.

8. The method for efficient feature extraction and multi-scale information modeling of large-scale images according to claim 7, characterized in that, The classification prediction using the trained WSI classification model includes loading the entire WSI to be inferred, employing the same color threshold-based segmentation algorithm as in the training phase to distinguish foreground tissue regions from the background; using the same pre-trained encoder as in the training phase to extract feature vectors from each image patch and performing feature space clustering, hierarchical sampling, and forward computation of the WSDA module; first, applying the Nystrom attention mechanism to the sampled feature matrix to approximate global instance dependencies and capture long-range spatial associations; then, performing multi-head self-attention computation sequentially at decreasing window scales, refining the local feature representation from coarse to fine granular; next, performing SERG module region reweighting, calculating the gating weights of each region through a pre-trained squeeze-activation network; finally, performing packet-level representation aggregation and classification output; calculating the probability distribution of WSI belonging to each category through fully connected layers and the Softmax activation function to complete the final classification prediction; transforming the probability distribution output by the model into specific category prediction results; for multi-class classification tasks, using the argmax decision rule to determine the final predicted category from the probability distribution output by the model: let the probability vector output by the model be... ,in Indicates that the sample belongs to the first The predicted probability of the class is then used to determine the final predicted label. That is, the class with the highest probability value is selected as the model's prediction result; for binary classification tasks, a threshold decision mechanism is used to convert the continuous probability output into discrete class labels: let the positive class probability output by the model be... The preset decision threshold is ,when The sample is determined to be positive if it is positive, otherwise it is determined to be negative.

9. A system for efficient feature extraction and multi-scale information modeling of large-scale images, comprising performing the method for efficient feature extraction and multi-scale information modeling of large-scale images as described in claim 1, characterized in that, include: The data acquisition module is configured to acquire image data; The data preprocessing module is configured to perform data preprocessing on the acquired image data; The feature extraction module is configured to extract features from the preprocessed image. The model building module is configured to build WSI classification models using a cascaded architecture; The model training module is configured to train the constructed WSI classification model; The prediction module is configured to perform classification prediction using the trained WSI classification model.