A pathological image classification method based on self-supervised learning and position coding
By employing self-supervised learning and positional encoding methods, and utilizing the SimSiam network and Transformer attention mechanism, the problem of classifying whole-slice pathological images without lesion region annotation was solved, achieving efficient pathological image classification and lesion region detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHANGCHUN UNIV OF SCI & TECH
- Filing Date
- 2026-01-16
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to effectively classify whole-slice pathological images without lesion area annotation, resulting in low diagnostic efficiency.
We employ a self-supervised learning and location encoding approach, pre-training the SimSiam self-supervised learning network and combining spatial location encoding and Transformer attention mechanism to generate a visual heatmap of the lesion region.
It enables accurate classification and visualization of lesion areas in whole-slice pathological images without lesion area annotation, thus improving diagnostic efficiency and accuracy.
Smart Images

Figure CN122244486A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the technical field of weakly supervised learning and image classification technology, specifically designing a pathological image classification method based on self-supervised learning and positional encoding. Background Technology
[0002] In clinical practice, pathological diagnosis is the "gold standard" for cancer diagnosis and subtyping. Pathologists obtain a unified diagnostic opinion by analyzing the histological structure, cell morphology, and differentiation degree of tumor tissue samples under a microscope. With the development of digital imaging technology, whole-slide pathology images (WSI) are driving the digitalization and intelligentization of pathology. Currently, utilizing artificial intelligence technology to assist pathological diagnosis is a hot topic in the field of computer vision, and its application can improve the diagnostic efficiency of pathologists to a certain extent.
[0003] Due to its extremely high resolution, WSI (Web Image Segmentation) presents a significant challenge to AI-assisted diagnosis. Existing image classification networks generally do not support computation on images with billions of pixels. They typically employ a multi-instance learning approach, segmenting the WSI into instance-level image patches, which are then fed into the classification network for individual computation, ultimately aggregating into a bag-level classification representation of the WSI. However, WSI contains not only image patches of lesion regions corresponding to its labels but also a large number of normal tissue regions. Directly training on these patches can negatively impact the final judgment. Therefore, professional pathologists are required to perform pixel-level annotation of lesion regions on WSI. However, due to the extremely high resolution of WSI, pathological annotation is time-consuming and labor-intensive. Currently, the main challenge is how to achieve classification and diagnosis of WSI types without lesion region annotation. Summary of the Invention
[0004] To address the shortcomings of existing technologies, this invention provides a pathological image classification method based on self-supervised learning and location encoding. This method pre-trains the network using self-supervised learning, enabling the model to label lesion-free areas and achieve classification and diagnosis of WSIs solely based on WSI type labels.
[0005] To achieve the above objectives, the present invention specifically adopts the following technical solution:
[0006] A pathological image classification method based on self-supervised learning and positional encoding, characterized in that the method includes the following steps:
[0007] S1. Obtain the whole-slice pathological image dataset and the corresponding pathological image labels;
[0008] S2. Preprocess the whole-slice pathological images and divide the dataset into training set, validation set and test set;
[0009] S3. Construct a self-supervised contrastive learning framework and pre-train the feature extraction network using a self-supervised learning approach;
[0010] S4. Use the pre-trained feature extraction network to extract features from each instance-level image and generate feature vectors for each instance.
[0011] S5. Generate the location code for each instance, convert the location code into a vector form, integrate the location code information into the feature vector of each instance, and weight and fuse the feature vectors of instances at different scales.
[0012] S6. The Transformer attention mechanism is used to globally aggregate all instances within a whole-slice pathological image to make classification decisions. Grad-CAM and sliding window are combined to generate a visual heatmap of the lesion area of the entire pathological image.
[0013] In one possible embodiment, the steps for constructing the dataset in step S2 are as follows:
[0014] S2-1. Using OpenSlide, images at various scales were extracted from whole-slice pathological images;
[0015] S2-2. Using Vahadane color normalization, all images in the dataset are normalized according to the selected standard color pathological image template;
[0016] S2-3. Otsu's method is used to perform edge detection on tissue regions in pathological images and generate a mask to filter out invalid blank backgrounds and holes in the pathological images;
[0017] S2-4. Divide the image of the tissue region into several 224×224 pixel instance-level image blocks, and filter out image blocks in which the blank area is greater than 50%.
[0018] In one possible embodiment, the step of constructing the self-supervised contrastive learning framework in step S3 is as follows:
[0019] S3-1. Let the label of the whole slide pathological image be a bag named X, where the segmented image patches are instances in the bag. Then, x n This represents the nth image patch slice;
[0020] S3-2. Perform two random data augmentations on each instance. Augmentation methods include rotation, flipping, and introducing noise interference. The two augmented images generated from the same source image are combined into a positive sample pair. The SimSiam dual twin self-supervised learning network is used to perform self-supervised pre-training on the feature extraction network Swin Transformer.
[0021] In one possible embodiment, the feature extraction step for each instance-level image in step S4 is as follows:
[0022] After completing the self-supervised pre-training, the projection head used in the contrastive learning stage is removed, and the trained weights are used as the pre-training weights of the new feature extraction network. Then, the instance-level image patches are re-extracted to obtain the instance-level high-order feature vectors.
[0023] In one possible embodiment, step S5 specifically includes:
[0024] S5-1. Construct two-dimensional spatial position codes for each image block under different magnifications, and integrate the position code information into the feature vector;
[0025] S5-2. Based on the pyramid structure of whole-slice pathological images, the image features belonging to the same low-magnification image block at high magnification are weighted and fused to obtain the feature vectors of each instance at multiple scales.
[0026] In one possible embodiment, step S6 specifically includes:
[0027] S6-1. By introducing a learnable CLS token, each instance interacts with all other instances through self-attention, learning the contribution weight of each instance to the packet-level prediction.
[0028] S6-2. Based on the contribution weight of each instance, the feature vectors of each instance at multiple scales are weighted and aggregated to obtain a bag-level representation for classification;
[0029] S6-3. Grad-CAM is used to calculate the gradient of the judgment type to obtain the probability heatmap of each image block. The darker the color of the heatmap, the greater the probability that the area is a lesion area.
[0030] Compared with existing technologies, this invention provides a pathological image classification method based on self-supervised learning and positional encoding, which has the following beneficial effects:
[0031] 1. The present invention first performs two random data augmentations on each image patch in the dataset, and uses the augmented images as a pair of positive samples. The SimSiam dual twin self-supervised learning network is used for self-supervised pre-training, so that the feature extraction network can distinguish the feature differences between different image patches without labels before classification training.
[0032] 2. This invention introduces spatial location encoding when performing feature fusion, constructs two-dimensional spatial location information of image blocks at different magnifications, integrates it into the feature vector of the image block, and, according to the pyramid structure of WSI, performs weighted fusion of high-magnification image features belonging to the same low-magnification image block, so that the model can obtain feature vectors of each instance at multiple scales.
[0033] 3. By introducing learnable CLS tokens to facilitate information interaction among instances, the contribution weights of each instance to packet-level prediction are learned, and instance-level features are weighted and aggregated according to the weights, thereby achieving the reliability and robustness of WSI packet-level prediction. Attached Figure Description
[0034] Figure 1 This is a flowchart of the pathological image classification method based on self-supervised learning and positional encoding described in this invention;
[0035] Figure 2 This is a schematic diagram of the self-supervised pre-training structure described in this invention;
[0036] Figure 3 This is a schematic diagram of the pathological image classification method based on self-supervised learning and positional encoding described in this invention;
[0037] Figure 4 This is a comparison chart of WSI thermal imaging results and details for lung cancer. Detailed Implementation
[0038] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0039] The components of the embodiments of the invention described and illustrated herein can typically be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the invention without inventive effort are within the scope of protection of the invention.
[0040] In the following, the terms “comprising,” “having,” and their cognates, which may be used in various embodiments of the invention, are intended only to indicate a particular feature, number, step, operation, element, component, or combination thereof, and should not be construed as excluding, firstly, the presence of one or more other features, numbers, steps, operations, elements, components, or combinations thereof, or adding the possibility of one or more features, numbers, steps, operations, elements, components, or combinations thereof.
[0041] Furthermore, the terms "first," "second," and "third" are used only to distinguish descriptions and should not be interpreted as indicating or implying relative importance.
[0042] Unless otherwise specified, all terms used herein (including technical and scientific terms) shall have the same meaning as commonly understood by one of ordinary skill in the art to which the various embodiments of the invention pertain. Terms (such as those defined in commonly used dictionaries) shall be interpreted as having the same meaning as in their contextual meaning in the relevant technical field and shall not be interpreted as having an idealized or overly formal meaning, unless clearly defined in the various embodiments of the invention.
[0043] To address the shortcomings of existing technologies, this invention falls under the category of weakly supervised learning, enabling WSI classification even with only WSI category labels and no lesion region annotations. The model constructs positive sample pairs by augmenting data for each instance in the WSI dataset and employs a dual-twin self-supervised learning network, SimSiam, for self-supervised pre-training of features, allowing the model to distinguish different instances even without labels. Simultaneously, spatial location encoding is introduced for multi-instance learning to obtain the contribution weights of each instance to WSI classification. Finally, a weighted aggregation method is used, and a heatmap is generated through gradient feedback to achieve WSI classification and lesion region detection.
[0044] Example 1
[0045] A pathological image classification method based on self-supervised learning and positional encoding specifically includes the following steps:
[0046] S1. Obtain the whole-slice pathological image dataset and the corresponding pathological image labels.
[0047] Specifically, pathological image labels are used for subsequent judgment, and the classification result is to see if the model classification result corresponds to the true label.
[0048] S2. Preprocess the whole-section pathological images. The preprocessing methods include color normalization, tissue region segmentation, and invalid information filtering. The dataset is then divided into training set, validation set, and test set.
[0049] Furthermore, in step S2, the preprocessing steps for the dataset are as follows:
[0050] S2-1. Use OpenSlide to extract images at various scales from whole-slice pathological images.
[0051] Specifically, a pathological image contains images under different objective magnifications; different scales refer to images under different magnifications.
[0052] S2-2. Using Vahadane color normalization, all images in the dataset are normalized according to the selected standard color pathological image template.
[0053] S2-3. Otsu's method is used to perform edge detection on tissue regions in pathological images and generate a mask to filter out invalid blank backgrounds and holes in the pathological images.
[0054] S2-4. Divide the image of the tissue region into several 224×224 pixel instance-level image blocks, and filter out image blocks in which the blank area is greater than 50%.
[0055] S3. Construct a self-supervised contrastive learning framework and pre-train the feature extraction network using self-supervised learning, so that the model can distinguish different instances under unlabeled conditions.
[0056] Furthermore, in step S3, the steps for constructing the self-supervised contrastive learning framework are as follows:
[0057] S3-1. Let the label of the whole slide pathological image be a bag named X, where the segmented image patches are instances in the bag. Then, x n This represents the nth image patch slice.
[0058] S3-2. Perform two random data augmentations on each instance. Augmentation methods include rotation, flipping, and introducing noise interference. The two augmented images generated from the same source image are combined into a positive sample pair. The SimSiam dual twin self-supervised learning network is used to perform self-supervised pre-training on the feature extraction network Swin Transformer.
[0059] S4. Use the pre-trained feature extraction network to extract features from each instance-level image and generate the corresponding feature vector.
[0060] Furthermore, in step S4, the feature extraction steps for instance-level image patches are as follows:
[0061] After completing the self-supervised pre-training, the projection head used in the contrastive learning stage is removed, and the trained weights are used as the pre-training weights of the new feature extraction network. Then, the instance-level image patches are re-extracted to obtain instance-level high-dimensional feature vectors.
[0062] S5. Generate the location code for each instance, convert it into a vector form, and fuse it with the instance feature vector. Then, weight and fuse the feature vectors at different scales.
[0063] Furthermore, in step S5, the feature fusion steps based on spatial coding are as follows:
[0064] S5-1. Construct two-dimensional spatial location codes for each image block under different magnifications, and integrate the location code information into the feature vector.
[0065] S5-2. Based on the pyramid structure of whole-slice pathological images, the image features belonging to the same low-magnification image block at high magnification are weighted and fused to obtain the feature vectors of each instance at multiple scales.
[0066] S6. The Transformer attention mechanism is used to globally aggregate all instances within a whole-slice pathological image to make classification decisions. Grad-CAM and sliding window are combined to generate a visual heatmap of the lesion area of the entire pathological image.
[0067] Furthermore, in step S6, the classification decision steps are as follows:
[0068] S6-1. By introducing a learnable CLS token, each instance interacts with all other instances through self-attention, learning the contribution weight of each instance to the packet-level prediction.
[0069] S6-2. Based on the contribution weight of each instance, perform weighted aggregation of instance-level features to obtain a package-level representation for classification.
[0070] S6-3. Grad-CAM is used to calculate the gradient of the judgment type to obtain the probability heatmap of each image block. The darker the color of the heatmap, the greater the probability that the area is a lesion area.
[0071] Example 2
[0072] like Figure 1 As shown, this invention provides a pathological image classification method based on self-supervised learning and positional encoding, comprising the following steps:
[0073] Furthermore, in step S1, the whole-slice pathological image formats supported by this invention include SVS, TIFF, NDPI, etc., and the magnification includes "2.5×", "5×", "10×", "20×" and "40×". The whole-slice pathological image only contains the category label of the whole image and does not contain any lesion area annotation.
[0074] S2. Preprocess the whole-section pathological images. The preprocessing methods include color normalization, tissue region segmentation, and invalid information filtering. The dataset is then divided into training set, validation set, and test set.
[0075] Furthermore, in step S2, the preprocessing steps for the dataset are as follows:
[0076] S2-1. Using the OpenSlide library, extract pathological images at various scales from the whole-slice pathological images.
[0077] S2-2. Select an image with significant color contrast from the dataset as the standard color template for color normalization. In this study, the Vahadane color normalization algorithm is used to correct the colors of all pathological images based on the selected standard color template, thereby improving the color contrast between cell nuclei and cytoplasm in the pathological images. The normalization formula is as follows:
[0078]
[0079]
[0080] in, Indicates the standardization result, Indicates background light intensity. This represents the source image to be standardized. Represents the numerical stability constant. Represents the staining basis of the target image. Indicates the coloring basis of the source image. Indicates the coloring concentration of the source image. This represents optical density.
[0081] S2-3. Otsu's method is used for edge detection of tissue regions in pathological images. First, the original RGB pathological image is converted to a grayscale image, and grayscale histogram statistical analysis is performed. Otsu's method adaptively determines the globally optimal segmentation threshold T by maximizing the inter-class variance between the foreground and background while minimizing the intra-class variance. Based on this, the grayscale image is binarized to generate a mask M. When the pixel grayscale value is greater than the threshold T, it is assigned a value of 255 to represent the tissue region; when the pixel grayscale value is less than or equal to the threshold T, it is assigned a value of 0 to represent the background and hole regions. The mathematical expression for this process is as follows:
[0082]
[0083] S2-4. Use a mask to filter out the background and cavity areas in the pathological image, and divide the remaining tissue area image into image blocks of size 224×224 with an overlap of 50%. Image blocks with more than 50% blank area are filtered out.
[0084] S3. Construct a self-supervised contrastive learning framework and pre-train the feature extraction network using self-supervised learning, so that the model can distinguish different instances under unlabeled conditions.
[0085] Furthermore, referring to step S3, the steps for constructing a self-supervised learning framework are as follows:
[0086] S3-1. Let the whole-section pathological images be denoted as a bag named X, where the segmented image patches are instances within the bag. Then we have , where x n This represents the nth image patch slice. The package contains image patches at different magnifications, and the image patches are named according to their two-dimensional coordinate information and magnification.
[0087] See Figure 2 S3-2. A dual twin self-supervised learning network, SamSiam, is used to pre-train the feature extraction network (Swin Transformer in this case). Each image patch undergoes two random data augmentations (including rotation, flipping, and noise interference). The two augmented images are then used as a pair of positive samples for SamSiam self-supervised learning. By minimizing the loss function, only the learning parameters of the feature extraction network are updated, gradually ensuring that different views of the same instance remain consistent in the feature space. This allows the network to distinguish the features of each instance without labels. The loss function formula is as follows:
[0088]
[0089] Where z represents the feature after MLP projection, and p represents the prediction vector of the projected feature. This represents the gradient blocking operator.
[0090] The training batch size is set to 128, the number of training iterations is set to 200, the learning rate is set to lr = 0.0001, and the decay rate is set to decay rate = 0.00001. The Adam optimizer is used to update the parameters of the feature extraction network. Iterative training is performed on a large number of unlabeled image patch data until the loss function converges, thereby learning a stable and discriminative feature representation.
[0091] S4. Use the pre-trained feature extraction network to extract features from each instance-level image and generate the corresponding feature vector.
[0092] Furthermore, in step S4, after completing the self-supervised pre-training, the projection head introduced in the self-supervised learning stage is removed, and the obtained weights are used as the pre-training weights of the model. The feature extraction network is then used again to extract features from the image patches, thereby realizing high-level semantic modeling of the input image patch data. The high-dimensional feature vectors output by the model are used as instance-level representations and fed into subsequent modules for aggregation, discrimination, and decision-making, thus providing a stable and discriminative feature foundation for the overall model.
[0093] S5. Generate the location code for each instance, convert it into a vector form, and fuse it with the instance feature vector. Then, weight and fuse the feature vectors at different scales.
[0094] Furthermore, in step S5, the specific steps for feature fusion based on spatial coding are as follows:
[0095] S5-1. Based on the two-dimensional coordinates (x, y) of each image patch in the original WSI magnification level, map it into a continuous position embedding vector. Then, fuse the constructed two-dimensional spatial position encoding with the image patch feature vector output by the feature extraction backbone network to obtain a feature vector with spatial position information.
[0096] S5-2. Based on the pyramid structure of WSI, for multi-level high-magnification image blocks located at the same lowest magnification image block position, this paper performs weighted fusion of their features with the corresponding low-magnification image block features, thereby integrating multi-scale contextual information at the feature level and obtaining feature vectors with multi-scale information.
[0097]
[0098] in The features are represented by m, where m represents the number of image patches at position i under magnification 1, and n represents the number of image patches at position i under magnification 2. The weights of low-magnification features are indicated (adjusted according to slice size). This represents the image patch features at high magnification n.
[0099] S6. uses the Transformer attention mechanism to globally aggregate instances within a whole-slice pathological image to make classification decisions, and uses a combination of Grad-CAM and sliding window to generate a visual heatmap of the lesion area of the entire pathological image.
[0100] See details Figure 3 , Figure 3 The diagram above shows a structural schematic of a pathological image classification method based on self-supervised learning and location encoding, including the processes in steps S3-S6.
[0101] Furthermore, in step S6, the classification decision steps are as follows:
[0102] S6-1. To achieve global modeling of package-level features, a learnable classification token (CLS token) is introduced and used as input to the Transformer encoder along with the features of all instances within the package. Under the self-attention mechanism, the CLS token acts as a global information aggregation node, establishing fully connected attention interactions with each instance feature. By calculating its attention weights for all instances, it adaptively aggregates key information from different instances. After multiple layers of Transformer encoding, the CLS token gradually integrates the discriminative features of each instance within the package, resulting in attention weights that reflect the relative contributions of different instances to the package-level prediction results.
[0103] S6-2. Based on the weight coefficients learned from the self-attention mechanism, the features of each instance in the bag are weighted and summed so that key instances that contribute more to the classification results receive higher weights during the aggregation process, thereby forming a compact WSI bag-level feature representation with global semantic expression capabilities, which is ultimately used to complete the WSI classification task.
[0104] S6-3. Grad-CAM is used for gradient feedback to generate a probability heatmap of each image patch in the WSI. The entire WSI is then scanned patch by patch using a row-by-row scanning method with a scan step size of 112. For the heatmaps of overlapping areas, their pixel values are weighted and summed to finally generate the heatmap of the entire WSI. See [link to relevant documentation]. Figure 4 The darker the color in the heat map, the greater the probability that the area is a lesion.
[0105] It should be understood that various parts of the present invention can be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.
[0106] In the description of this specification, references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0107] Although embodiments of the invention have been shown and described, those skilled in the art will understand that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
[0108] The above is a detailed description of the preferred embodiments of the present invention, but the present invention is not limited to the embodiments described. Those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention, and these equivalent modifications or substitutions are all included within the scope defined by the claims of this application.
Claims
1. A pathological image classification method based on self-supervised learning and positional encoding, characterized in that, The method includes the following steps: S1. Obtain the whole-slice pathological image dataset and the corresponding pathological image labels; S2. Preprocess the whole-slice pathological images and divide the dataset into training set, validation set and test set; S3. Construct a self-supervised contrastive learning framework and pre-train the feature extraction network using a self-supervised learning approach; S4. Use the pre-trained feature extraction network to extract features from each instance-level image and generate feature vectors for each instance. S5. Generate the location code for each instance, convert the location code into a vector form, integrate the location code information into the feature vector of each instance, and weight and fuse the feature vectors of instances at different scales. S6. The Transformer attention mechanism is used to globally aggregate all instances within a whole-slice pathological image to make classification decisions. Grad-CAM and sliding window are combined to generate a visual heatmap of the lesion area of the entire pathological image.
2. The pathological image classification method based on self-supervised learning and positional encoding according to claim 1, characterized in that, The steps for constructing the dataset in step S2 are as follows: S2-1. Use the relevant functions of the OpenSlide library to extract images at various scales from whole-slice pathological images; S2-2. Using Vahadane color normalization, all images in the dataset are normalized according to the selected standard color pathological image template; S2-3. Otsu's method is used to perform edge detection on tissue regions in pathological images and generate a mask to filter out invalid blank backgrounds and holes in the pathological images; S2-4. Divide the image of the tissue region into several 224×224 pixel instance-level image blocks, and filter out image blocks in which the blank area is greater than 50%.
3. The pathological image classification method based on self-supervised learning and positional encoding according to claim 1, characterized in that, The steps for constructing the self-supervised contrastive learning framework in step S3 are as follows: S3-1. Let the label of the whole slide pathological image be a bag named X, where the segmented image patches are instances in the bag. Then, x n This represents the nth image patch slice; S3-2. Perform two random data augmentations on each instance. Augmentation methods include rotation, flipping, and introducing noise interference. The two augmented images generated from the same source image are combined into a positive sample pair. The SimSiam dual twin self-supervised learning network is used to perform self-supervised pre-training on the feature extraction network Swin Transformer.
4. The pathological image classification method based on self-supervised learning and positional encoding according to claim 1, characterized in that, The feature extraction steps for each instance-level image in step S4 are as follows: After completing the self-supervised pre-training, the projection head used in the contrastive learning stage is removed, and the trained weights are used as the pre-training weights of the new feature extraction network. Then, the instance-level image patches are re-extracted to obtain the instance-level high-order feature vectors.
5. The pathological image classification method based on self-supervised learning and positional encoding according to claim 1, characterized in that, Step S5 specifically includes: S5-1. Construct two-dimensional spatial position codes for each image block under different magnifications, and integrate the position code information into the feature vector; S5-2. Based on the pyramid structure of whole-slice pathological images, the image features belonging to the same low-magnification image block at high magnification are weighted and fused to obtain the feature vectors of each instance at multiple scales.
6. The pathological image classification method based on self-supervised learning and positional encoding according to claim 1, characterized in that, Step S6 specifically includes: S6-1. By introducing a learnable CLS token, each instance interacts with all other instances through self-attention, learning the contribution weight of each instance to the packet-level prediction. S6-2. Based on the contribution weight of each instance, the feature vectors of each instance at multiple scales are weighted and aggregated to obtain a bag-level representation for classification; S6-3. Grad-CAM is used to calculate the gradient of the judgment type to obtain the probability heatmap of each image block. The darker the color of the heatmap, the greater the probability that the area is a lesion area.