A method for cutting image segmentation of rock based on transformer
By using the RdsFormer model based on Transformer, the problem of handling long-distance dependencies in rock cuttings image segmentation is solved, achieving efficient automatic classification and labeling of rock cuttings images and improving segmentation performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SICHUAN UNIV
- Filing Date
- 2022-07-25
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies for segmenting rock debris images suffer from the problem that convolutional neural network models cannot effectively handle long-distance dependencies in complex scenes, resulting in poor performance in particle boundary segmentation.
We employ the Transformer-based RdsFormer model, utilize a split-shifted window multi-head self-attention mechanism to construct a hierarchical encoder, and combine it with a three-level cascaded decoder, including multi-scale feature fusion, contextual relation, and enhancement modules. We optimize network parameters through a multi-task learning strategy.
It improves the accuracy and efficiency of rock cuttings image segmentation, effectively handles particle boundaries in complex scenes, achieves automatic classification and labeling, and reduces human interference and costs.
Smart Images

Figure CN117523182B_ABST
Abstract
Description
Technical Field
[0001] This invention designs a rock debris image segmentation method based on Transformer, and a new network model RdsFormer (Rock Debris Image Segmentation Using Transformer), which relates to the problem of determining the lithological characteristics of underground rock strata in geological research and oil exploration, and belongs to the field of computer vision and intelligent information processing. Background Technology
[0002] Rock cuttings, or fine rock particles, are produced when drill bits break rocks during drilling operations. They play a crucial role in geological research and oil exploration. During drilling, rock cuttings are brought to the surface with the drilling mud. After sampling, washing, and screening, rock cutting samples are formed. By observing, describing, classifying, and naming these samples, the lithological characteristics of the formations can be effectively understood, and the hydrocarbon content can be analyzed. Generally, rock cutting samples contain different types of rock particles, varying in size, color, and texture, which can visually reflect the lithological characteristics of the formations. However, manual classification of rock cuttings is time-consuming and labor-intensive, and easily affected by human factors. On the other hand, methods such as computed tomography (CT) and spectral analysis of rock cuttings are expensive and not suitable for widespread use. Therefore, using image segmentation technology to automatically classify and label various rock particles in digital images of rock cuttings is an economical and effective method for analyzing rock cutting samples.
[0003] Image segmentation technology can automatically classify particles in rock debris images, thereby quickly determining the lithological characteristics of underground rock strata in geological research and oil exploration. Traditional image segmentation algorithms typically rely on low-level image features to segment complex rock debris images, resulting in generally poor performance. In recent years, Convolutional Neural Networks (CNNs) have demonstrated superior performance in computer vision applications such as image classification, image segmentation, and object recognition. It is a framework that automatically learns and extracts high-dimensional abstract semantic features from images by stacking convolutional layers. Currently, CNNs are widely used in rock debris image segmentation, outperforming traditional algorithms. However, a potential drawback of CNNs is the small size of the convolutional kernels, which limits their ability to establish long-range dependencies in images. Furthermore, particles in rock debris samples are often interconnected and piled up, with complex contextual information at particle boundaries. CNN-based models often struggle to handle particle edges effectively, requiring improvement in segmentation performance. Therefore, it is essential to seek a novel image feature extractor for rock debris image segmentation. The Transformer was first proposed in the natural language processing field in 2017 and has since achieved substantial success in image processing. The Transformer's self-attention mechanism enables it to capture long-range dependencies in the data, making it more suitable than CNN-based models for segmenting rock debris images in complex scenes. However, there is currently no documented application of the Transformer to rock debris image segmentation, a phenomenon that inspired this invention to apply it to tasks involving rock debris image segmentation. Summary of the Invention
[0004] This invention designs a Transformer-based rock debris image segmentation method, called RdsFormer, for rock debris image segmentation. In terms of the encoder, this invention uses a Transformer based on a split-shifted window multi-head self-attention mechanism to construct a hierarchical encoder, which can effectively extract multi-scale features from rock debris images. In terms of the decoder, this invention introduces a three-level cascaded structure, including a multi-scale feature fusion module, a contextual relationship module, and an enhancement module. This decoder can fully utilize multi-scale background information and relational background information to predict the rock debris category for each pixel.
[0005] A Transformer-based method for segmenting rock cuttings images includes the following steps:
[0006] (1) Construct a Transformer-based encoder (Rds-TE) for extracting rock cuttings images. The multi-scale feature set F = {F1, F2, F3, F4}.
[0007] (2) A three-level cascaded decoder (Rds-TD) was constructed, which includes a multi-scale feature fusion module, a context relation module and an enhancement module.
[0008] (3) The multi-scale feature fusion module uses F to generate a coarse category representation S. c and pixel representation S p .
[0009] (4)S c and S p The context relation module, which is simultaneously input into the decoder, generates an enhanced pixel representation S based on the relationship between pixels and object domains. P .
[0010] (5)S p and S P The input is fed into the decoder's enhancement module to produce the final predicted rock cuttings category value S. O .
[0011] (6) RdsFormer employs a multi-task learning strategy in S c and S O Add loss functions at each location
[0012] and During the training phase, two-way backpropagation is used to achieve joint self-learning of the supervised model to optimize network parameters. Attached Figure Description
[0013] Figure 1 This is a block diagram of the rock debris image segmentation method based on Transformer of the present invention.
[0014] Figure 2 This is a structural diagram of the Rds-TE of the present invention.
[0015] Figure 3 This is a schematic diagram of the SSW-MHSA operation process of the present invention.
[0016] Figure 4 This is a structural diagram of the Rds-TD of the present invention. Detailed Implementation
[0017] The following is in conjunction with the appendix Figure 1 Appendix Figure 2 Appendix Figure 3 and attached Figure 4 Further explanation of the present invention:
[0018] The network structure and principle of the RdsFormer model are as follows:
[0019] The RdsFormer proposed in this invention is a Transformer-based segmentation model, specifically designed for rock debris images. (See attached image.) Figure 1 As shown, RdsFormer consists of two parts: 1) a Transformer-based encoder, called Rds-TE, which can extract multi-scale feature maps from the input image; 2) a three-level decoder, called Rds-TD, which can classify each pixel.
[0020] (1) Rds-TE:
[0021] As attached Figure 2 As shown, it is a four-layer architecture, with each layer containing a block embedding module (PE) and N. i A custom Transformer-layer (TE) module; PE consists of a convolutional layer (Conv) with a kernel size of 4×4 and a regularization layer (LN) to implement the input image. Divided into Individual patches and data regularization are used to accelerate training convergence.
[0022] The TE consists of a local window multi-head self-attention mechanism (W-MHSA) module, a hierarchical sliding window multi-head self-attention mechanism (SSW-MHSA) module, and two feedforward network layers (FFN). Each MHSA module and each FFN is preceded by an LN, each MHSA is followed by a Dropout layer (DR), and each module is followed by a residual connection that connects to the next stage.
[0023] As attached Figure 3 As shown, based on the Multi-Head Attention (MHSA) mechanism, and to increase connectivity between windows, this invention designs an SSW-MHSA module based on W-MHSA, assuming the input feature map is X∈R. h×w×C Divide it evenly into M non-overlapping local windows [X]. 1 ,X 2 ,...,X M Assuming each window contains n×n tokens, then M = (h×w) / n 2 Another perspective is to divide X into J heads according to the number of C channels, where each head has a dimension of d. j =C / J, then divide the J heads into three equal groups. The first group moves the window horizontally. Pixels, compared to a window with M regularly partitioned sections; the second group shifts the window's diagonal by [number]. Pixels; the third group moves the window vertically. Pixels; finally, each group performs MHSA within its own shifted window. Therefore, the output of the self-attention computation for the horizontally shifted window of the j-th head is defined as:
[0024]
[0025]
[0026] Where i = 1,...,M, The parameters represent the linear transformation matrix; local window self-attention in the vertical and diagonal directions can also be defined similarly as V-Attention. j (X) and D-Attention j (X) represents the output of the j-th head. Therefore, the calculation expression for SSW-MHSA can be defined as:
[0027] SSW-MHSA(X)=Concat(h1,,h J W O
[0028]
[0029] Here, Concat(·) is a stacking operation, W O ∈R C×C It is a parameter of a linear transformation matrix.
[0030] The specific formula for TE is as follows:
[0031]
[0032]
[0033]
[0034]
[0035] in, and z l This represents the output of the (SS)W-MHSA module and FFN in layer l.
[0036] (2) Multi-scale feature fusion module
[0037] As attached Figure 4 As shown, this module is divided into two branches: a multi-scale feature fusion branch and a feature fusion branch for different sub-regions; F (where {F1, F2, F3, F4} have dimensions of the original image) After the feature maps with channel numbers of 128, 256, 512, and 1024 are fed into the first branch, {F2, F3, and F4} are upsampled to the size of F1 by bilinear interpolation of 2x, 4x, and 8x respectively. The four feature maps of the same size are then stacked sequentially and merged into F5 according to the following formula:
[0038] F5 = Concat(Up) 8× (F4), Up 4× (F3),Up 2× (F2),F1)
[0039] Finally, F5 generates pixel representations after an affine transformation. (C is the dimension of the feature channel), the calculation formula is as follows:
[0040] S p =α(F5)
[0041] Where α(·) represents a 1x1 Conv operation, which reduces the number of channels in F5 from 1920 to 512.
[0042] After F4 is fed into the second branch, it passes through the Feature Pyramid Module (PPM) to obtain different sub-region representations. After being upsampled 8 times, it is stacked and merged with F5 after affine transformation, and then subjected to another affine transformation to produce a coarse category representation. (k represents the number of rock debris types), the specific calculation formula is as follows:
[0043] S c =γ(Concat(Up) 8× (PPM(F4)),β(F5)))
[0044] Where β(·) and γ(·) represent ReLU(BN(Conv) respectively. 1×1 Affine transformation operations and ReLU(BN(Conv) 3×3 Affine transformation operation (·)))
[0045] (3) Context Relationship Module
[0046] As attached Figure 4 As shown, S c After two different linear transformations, S is used as the input to K and V of the self-attention mechanism module, respectively. p After undergoing a linear transformation, the result is used as input to Q. A cross-attention mechanism is then used to compute the relationship between pixels and categories, thereby generating enhanced pixel representations. The specific calculation formula is as follows:
[0047] Q=φ(S p ),K=κ(S c ),V=v(Sc )
[0048]
[0049] Where φ(·), κ(·), v(·) and ρ(·) represent 1x1 Conv operations with four different parameters.
[0050] (4) Enhancement Module
[0051] As attached Figure 4 As shown, S p and S P First, the two input features are superimposed along the channel dimension. Then, an affine transformation is applied. Finally, bilinear interpolation upsampling is used to restore the original image size to generate the final predicted rock debris category value S for each pixel. O ∈R H×W×k :
[0052] S O =Up 4× (θ(Concat(S p ,S P )))
[0053] (5) Multi-task learning strategy
[0054] As attached Figure 3 As shown, in S c and S O Add loss functions at each location and in It is used to supervise the model's self-learning to predict the initial segmentation, and its formula is as follows:
[0055]
[0056]
[0057] It is used to supervise the model to predict more accurate pixel-level segmentation results. The specific formula is as follows:
[0058]
[0059]
[0060] The model's final loss function After calculating the loss value using the joint loss function, the model parameters are optimized through backpropagation joint learning.
[0061] The RdsFormer model of this invention is executed on two RTX 2080 Ti GPUs using the PyTorch 1.8 framework. Furthermore, RdsFormer is trained using AdamW as the optimizer. Typically, a properly designed learning rate warm-up phase is required when training a Transformer model. This invention trains the proposed RdsFormer model for 200 epochs (2.5 days) with a minimum batch size of 4. The learning rate warm-up phase is used for the first 5 epochs, and the linearly scaled learning rate is used for the last 195 epochs. The initial learning rate is L. r The weight decay is 0.01, and the learning rate L r Defined as:
[0062]
[0063] The rock debris image dataset used in this experiment is a dataset containing oily siltstone, purple andesite, and 13 other different rock types. The rock debris image dataset consists of 7583 images, which can be divided into 15 categories. 5157 images were randomly selected from the dataset as the training set, 1213 images were randomly selected as the validation set, and another 1213 images were selected as the test set.
[0064] This invention conducted comparative experiments on a rock debris image dataset, investigated network structure ablation, and examined the impact of different loss functions on the segmentation accuracy of RdsFormer. mIoU and mAcc were used as evaluation metrics, with higher values indicating better segmentation performance. The relevant experimental results are shown in Tables 1, 2, 3, and 4.
[0065] Table 1 compares RdsFormer and traditional image segmentation methods on the rock cuttings image dataset.
[0066]
[0067] Table 2 compares the RdsFormer and CNN-based segmentation models on the rock debris image dataset.
[0068]
[0069] Table 3 Ablation Study of Network Structure on Rock Cuttings Image Dataset
[0070]
[0071] Table 4. Study on the influence of different loss functions on the segmentation accuracy of RdsFormer
[0072]
[0073] The results in Tables 1 and 2 show that RdsFormer performs better than traditional algorithms and CNN-based segmentation models. The results in Table 3 show that Rds-TD and Rds-TE of RdsFormer contribute to the improvement of segmentation accuracy. The results in Table 4 show that the multi-task loss function can improve the performance of RdsFormer.
Claims
1. A Transformer-based method for segmenting rock cuttings images. Its characteristics include the following steps: (1) Construct a Transformer-based encoder Used to extract rock cuttings images Multiscale feature maps ; (2) A three-level cascaded decoder architecture was constructed. It includes a multi-scale feature fusion module, a context relationship module, and an enhancement module; (3) Utilization of multi-scale feature fusion module Generate a coarse category representation and pixel representation ; (4) and The context relationship module, which is simultaneously input into the decoder, After two different linear transformations, they serve as modules for the self-attention mechanism. and Input, After a linear transformation, it becomes The input is used to compute the relationship between pixels and categories through a cross-attention mechanism, thereby producing enhanced pixel representations. ; (5) and The enhancement module is input to the decoder, where and First, the images are stacked along the channel dimension, then subjected to an affine transformation, and finally upsampled using bilinear interpolation to restore them to the original image size, thus generating the final predicted rock debris category value for each pixel. ; (6) Employing a multi-task learning strategy, in and Add loss functions at each location and During the training phase, two-way backpropagation is used to achieve joint self-learning of the supervised model to optimize network parameters.
2. The method according to claim 1, characterized in that... In step (1) The construction method is as follows: It is a four-layer architecture, with each layer containing a block embedding module. and A custom Transformer-layer module ; A convolution kernel with a size of convolutional layers and a regularization layer Composition, used to process input images Divided into Each patch is processed, and data regularization is performed to accelerate training convergence. Multi-head self-attention mechanism with a local window Module, a multi-head self-attention mechanism for layered and directional sliding windows. Module and two feedforward network layers Composition, in each Modules and each There used to be one In each Then there is a Dropout layer Each module has a residual connection that connects to the next stage. Based on multi-head attention mechanism And at the same time, in order to increase the connectivity between windows, based on Designed Module, assuming the input feature map is Divide it evenly into Non-overlapping local windows Assuming each window contains 100 tokens. ,but From another angle according to Channel number is divided into Each head has a dimension of [number] units. = Then The windows were divided into three groups of equal size, and the first group moved the window horizontally. Pixels, and Compared to the regular partitioned windows, the second group shifts the diagonal of the windows. Pixels, the third group moves the window vertically. Pixels, and finally, each group is processed within its own shifted window. Therefore, the first The output of the self-attention calculation for the horizontal moving window of a given size is defined as: in, The parameters representing the linear transformation matrix, the calculation methods for local window self-attention in the vertical and diagonal directions are the same as those for horizontally moving window self-attention, only the window movement direction differs, and are defined as follows: and The two are respectively represented as the first The self-attention output of the size on the vertically moving window and the diagonally moving window can be defined from this. The calculation expression is: in, It is a stacking operation. It is a parameter of a linear transformation matrix; The specific formula is as follows: in, For the first Layer input; and The first In the layer Modules and The output; and The first In the layer Modules and The output of .
3. The method according to claim 1, characterized in that... In step (3), a feature fusion module is constructed. This module is divided into two branches: a multi-scale feature fusion branch and a feature fusion branch for different sub-regions. After being sent to the first branch, among them Upsampled to 2x, 4x, and 8x bilinear interpolation respectively After determining the size, the four feature sets of uniform size are stacked sequentially and merged according to the following formula: : at last Pixel representation is generated after an affine transformation. , The dimension of the feature channel is calculated using the following formula: in Representing 1x1 Operation, to achieve The number of channels is reduced to C dimensions; After being fed into the second branch, it passes through the Feature Pyramid Module (PPM) to obtain different sub-region representations, and is then upsampled by 8 times and compared with the representation after affine transformation. The data are stacked and merged, then subjected to an affine transformation to produce a rough category representation. , The number of rock cutting types is represented by the following formula: in and They represent Affine transformation operations and Affine transformation operation.
4. The method according to claim 1, characterized in that... In step (6), a multi-task learning strategy is used to jointly optimize the model parameters during the training phase. and Add loss functions at each location and ,in It is used to supervise the model's self-learning to predict the initial segmentation, and its formula is as follows: It is used to supervise the model to predict more accurate pixel-level segmentation results. The specific formula is as follows: The model's final loss function After calculating the loss value using the joint loss function, the model parameters are optimized through backpropagation joint learning.