A coal rock image classification method and device based on a CNN-Transformer combination and electronic equipment
By combining feature extraction methods from CNN and Transformer, and optimizing convolutional and attention layers, the accuracy and efficiency issues of coal and rock image classification in complex underground coal mine environments were addressed, achieving high-precision image classification.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- LINYI UNIVERSITY
- Filing Date
- 2025-01-16
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to effectively classify high-precision coal and rock images in complex underground coal mine environments. The combination of convolutional neural networks and Transformers suffers from excessively large computational parameters and insufficient generalization ability.
By combining feature extraction methods from CNN and Transformer, an initial deep network model is constructed, convolutional and attention layers are optimized, and RefConv and SE residual connection modules are used to extract local and global features of the image. Refocusing transformation and global average pooling are used to refine and classify the feature maps.
It improves the accuracy and robustness of coal and rock image classification, reduces computational redundancy and memory access, is suitable for mobile devices, and adapts to complex coal mine environments.
Smart Images

Figure CN119942210B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of image processing technology, specifically relating to a coal and rock image classification method, device, and electronic equipment based on a combination of CNN-Transformer. Background Technology
[0002] Coal is the world's most economical fossil fuel, playing a decisive role in global energy security and social development. However, due to the complex environment of coal mines, heavy dust, and poor lighting conditions, high-noise, low-quality coal images are often captured. Extracting useful information from these low-quality images is difficult, severely limiting the application of image and video technologies in intelligent coal mining. Coal and rock image classification is a particularly challenging task. Since coal and rock images typically possess complex textures and structural features, such as the longitudinal and transverse textures of coal seams and the contact surfaces between rock strata, effective feature extraction methods are crucial for solving the coal and rock image classification problem. Simple convolutional neural networks have fallen behind emerging technologies like Transformers in many coal and rock image classification tasks. Furthermore, the demand for high-accuracy coal and rock image classification on low-latency mobile devices is increasing due to the complex underground environment of coal mines.
[0003] To address these challenges, image classification tasks typically require high-performance computer vision techniques, such as introducing Transformers into the field. However, these techniques still lag behind state-of-the-art convolutional networks. This work demonstrates that while Transformers often have larger model capacity, their generalization can be worse than convolutional networks due to a lack of proper inductive bias, resulting in poor performance in coal and rock image classification tasks. Furthermore, the introduction of Transformers leads to an excessive number of computational parameters, hindering support for mobile devices. Convolutional layers, due to their strong inductive bias priors, often exhibit better generalization and faster convergence, while attention layers offer higher model capacity and can benefit from larger datasets. Combining convolutional and attention layers can achieve better generalization and capacity; however, a key challenge here is how to effectively combine them to achieve a better trade-off between accuracy and efficiency.
[0004] Therefore, the present invention provides a coal and rock image classification method, apparatus and electronic device based on CNN-Tramsformer combination to solve the above problems. Summary of the Invention
[0005] To address the aforementioned problems, this invention provides a coal and rock image classification method, device, and electronic device based on a combination of CNN and Transformer. By combining the respective advantages of CNN and Transformer, it can extract both local and global features of the image, providing a powerful tool for the field of geological research.
[0006] This invention is achieved through the following technical solution:
[0007] In a first aspect, the present invention provides a coal and rock image classification method based on a combination of CNN and Transformer, specifically including the following steps:
[0008] S1. Construct the initial deep network model CNN-Transformer by inputting data from the texture library dataset into the initial CNN-Transformer model to obtain the base weights. ;
[0009] S2. Collect coal and rock images and filter them to obtain a dataset. For the dataset Preprocessing is performed to obtain the preprocessed dataset. Then, the preprocessed dataset The image is initially processed to obtain feature maps. ;
[0010] S3. Optimize the initial CNN-Transformer model by replacing one standard convolutional layer and one depthwise separable convolutional layer in the initial CNN-Transformer model to obtain the optimized CNN-Transformer model.
[0011] S4. Feature map The feature maps are obtained by inputting them into the optimized CNN-Transformer model.
[0012] S5. Input the results of the optimized CNN-Transformer model into the classifier to obtain the classification results.
[0013] S1 is as follows:
[0014] The initial deep network model CNN-Transformer consists of the following structures: RBConv shallow feature extraction module Block Sim-Transformer deep feature extraction module;
[0015] The RBConv shallow feature extraction module includes: a standard convolutional layer, a depthwise separable convolutional layer, and an SE residual connection module.
[0016] The Sim-Transformer deep feature extraction module includes: the Sim-Attention module and the feedforward neural network FFN;
[0017] The texture library dataset is an existing dataset in which images have already been classified and labeled. After inputting the data from the texture library dataset into the initial CNN-Transformer model, feature maps are first extracted from the images in the texture library dataset. Then, the extracted feature maps are processed by the initial CNN-Transformer model to complete the training of the initial CNN-Transformer model. Finally, the convolutional kernels output by the initial CNN-Transformer model are frozen as base weights. .
[0018] S2 is as follows:
[0019] For dataset The images in the dataset are preprocessed by scaling them to the same size and increasing the number of images through random cropping and rotation. The result is a preprocessed dataset. ;
[0020] Dataset Represented as Preprocessed dataset Represented as ,in, , Represents the dataset The number of images in the middle, Represents the preprocessed dataset The number of images in the middle, Represents the dataset The Middle Zhang Image Represents the preprocessed dataset The Middle Zhang Image Represents the preprocessed dataset Any image in the image;
[0021] For the preprocessed dataset The image undergoes preliminary processing to convert the image into a digital format. The input is fed into two 3×3 convolutions for feature map extraction, resulting in the feature map. .
[0022] S3 is as follows:
[0023] The optimized model is derived from the initial model by replacing one standard convolutional layer and one depthwise separable convolutional layer with RefConv convolutions.
[0024] The optimized CNN-Transformer model is made by RBConv shallow feature extraction module and The Sim-Transformer deep feature extraction modules are stacked and connected in series.
[0025] The RBConv shallow feature extraction module includes: RefConv reparameterized refocusing convolution and SE residual connection module;
[0026] The SE residual connection module includes: a global average pooling layer, a fully connected layer, and an activation function for generating channel weights;
[0027] The Sim-Transformer module includes: the Sim-Attntion module and the feedforward neural network FFN.
[0028] S4 is as follows:
[0029] S4.1, Feature Map Before being fed into the optimized CNN-Transformer model, it first passes through a... Convolution, converting feature maps The number of channels C is increased by 4 times, and the feature map The original number of channels is C, and the expanded number of channels is 4C;
[0030] S4.2 Feature map after channel enlargement The feature map is input into the RBConv shallow feature extraction module of the CNN-Transformer model. Feature map output by RefConv ;
[0031] Then adjust the basis weights of the initial CNN-Transformer model output. Perform a refocusing transformation to obtain the conversion weights The calculation process is as follows:
[0032] ,
[0033] in, Indicates refocusing transformation. Representation of feature map Trainable parameters;
[0034] Then the feature map The input is fed into the SE residual module, which processes the input feature map. Global average pooling is performed, and the calculation process is as follows:
[0035] ,
[0036] in, Indicates channel The global average pooling output features have a size of 1×1×4C. Indicates altitude, Indicates width, Representation of feature map Height × Width × Channel Representation of feature map Index in the height direction, , Representation of feature map Index in the width direction, ,aisle The number of channels is 4C;
[0037] Features output by global average pooling Input is fed into a fully connected layer, and then passed through the fully connected layer. Activation function obtains channels weight value The calculation process is as follows:
[0038] ,
[0039] Input feature map passage Multiply by the corresponding weights to generate a new feature map. The calculation process is as follows:
[0040] ,
[0041] Among them, feature map The dimensions are H×W×4C;
[0042] Final feature map Then, a 1×1 convolution is used to reduce the number of channels to 1 / 4, and the number of channels after reduction is C;
[0043] S4.3, Feature Map Input to the Sim-Transformer module, feature map It is a matrix composed of multiple neurons, each neuron corresponding to a receptive field. An energy function is defined for methods to find important neurons. The calculation process is as follows:
[0044] ,
[0045] ,
[0046] ,
[0047] ,
[0048] in, Representation of feature map The target neuron in a single channel, Indicates that except for the target god general Yuan External feature map Other neurons in a single channel, Indicates the number of energy functions. , , This represents the total number of energy functions within each channel. Indicates the transformation weights. Indicates the transformation bias. and Representing feature maps respectively In a single channel, excluding the target neuron All other neurons The mean and variance;
[0049] Calculate the minimum energy According to minimum energy Determine importance, The lower the value, the stronger the target neuron. The greater the difference from surrounding neurons, the higher the importance, and the lower the minimum energy. The calculation process is as follows:
[0050] ,
[0051] in, and Let these represent the mean and variance of the neurons, respectively. Denotes the regularization coefficient;
[0052] Then feature map energy function value The form propagates downwards, and the energy values of all neurons form a set. ;
[0053] Continue with feature maps Input to the Sim-Attention module, via Neuron weights and feature maps obtained from activation functions Perform dot product operation to transform the feature map The input values are mapped to the range (0,1), and the output is a refined feature map focusing on the perception region of the key target. The calculation process of the Sim-Attention module is as follows:
[0054] ,
[0055] feature map Input to the feedforward neural network FFN, feature map First, the input is fed into a fully connected layer to obtain the feature map. The calculation process is as follows:
[0056] ,
[0057] in, Indicates activation function The operation, Representing feature dimension, Indicates a fixed parameter;
[0058] Then to Perform layer normalization to obtain the final output feature map. The calculation formula is as follows:
[0059] ,
[0060] in, Represents a pair of feature maps Each channel undergoes a standardized and normalized process.
[0061] S5 is detailed below:
[0062] Feature maps are processed using average pooling layers and fully connected layers. Classify;
[0063] For the input feature map Global average pooling is performed on each channel:
[0064] ,
[0065] Among them, the input feature map The size is × × C ,in and These are the height and width of the feature map, respectively. It is a passage The characteristics of the global average pooling output, channels The number of channels is The dimensions are 1×1× C ;
[0066] The vector after average pooling A fully connected layer is used to map the data to a category space, obtaining the probability distribution for each category. The category label with the highest probability is selected as the final category label for the coal and rock image. Assuming there are a total of... The categories are as follows:
[0067] ,
[0068] in, This represents the weight matrix of the fully connected layer, with size . , This represents the bias vector. Represents the score for each category, with a size of , , Indicates the first Scores for each category ;
[0069] Then apply to the output of the fully connected layer The activation function yields the probability distribution for each category:
[0070]
[0071] in, Indicates that the input image belongs to a category. The probability of;
[0072] Calculate the classification probability for each category, and denote the category corresponding to the highest probability value as the category of the input image.
[0073] Secondly, the present invention also provides a coal and rock image classification device based on a combination of CNN and Transformer, which executes a coal and rock image classification method based on a combination of CNN and Transformer, comprising the following units:
[0074] 1) Pre-trained learning unit: The initial CNN-Transformer model is pre-trained by processing existing data with labeled image types.
[0075] 2) Coal and rock image acquisition unit: Acquires images for constructing a coal and rock image dataset and preprocesses the dataset;
[0076] 3) Optimization Unit: Replace some structures in the initial CNN-Transformer model to obtain the optimized CNN-Transformer model;
[0077] 4) Image data processing unit: The unit processes the acquired images using the parameters output by the optimized CNN-Transformer model and the initial CNN-Transformer model to obtain the feature map of the input image;
[0078] 5) Classification unit: Classifies the input image based on the feature map obtained from the image data processing unit.
[0079] Thirdly, the present invention also provides an electronic device, including a processor, a memory, and a bus. The memory stores machine-readable instructions executable by the processor. When the electronic device is running, the processor communicates with the memory via the bus. When the machine-readable instructions are executed by the processor, a coal and rock image classification method based on a combination of CNN-Transformer is performed.
[0080] The advantages of this invention are:
[0081] The technical solution of this invention achieves high accuracy and robustness for classifying low-precision coal and rock images. It cleverly combines convolutional networks and Transformers. Convolutional layers, due to their strong inductive bias priors, often exhibit better generalization and faster convergence, while the attention layer in the Transformer has higher model capacity, benefiting from larger datasets. The combination of convolutional and attention layers achieves better generalization and capacity. Furthermore, the use of Refconv significantly reduces computational redundancy and memory access while greatly reducing parameters, making coal and rock image recognition possible on mobile devices and playing a crucial role in complex coal mine environments. Attached Figure Description
[0082] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used together with the embodiments of the invention to explain the invention and do not constitute a limitation thereof.
[0083] Figure 1 This is a schematic diagram of the process of the present invention.
[0084] Figure 2 This is a heatmap comparing the regions of interest of the method of this invention and CoAtnet-0 in coal and rock image feature extraction. Detailed Implementation
[0085] To better understand the above-mentioned objectives, features, and advantages of the present invention, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that, unless otherwise specified, the embodiments and features described in these embodiments can be combined with each other.
[0086] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and therefore the scope of protection of the invention is not limited to the specific embodiments disclosed below.
[0087] Example 1
[0088] The following is combined Figure 1 This invention provides a detailed description of a coal and rock image classification method based on a combination of CNN and Transformer.
[0089] A coal and rock image classification method based on CNN-Transformer combination includes the following steps:
[0090] S1. Construct the initial deep network model CNN-Transformer by inputting data from the texture library dataset into the initial CNN-Transformer model to obtain the base weights. ;
[0091] S2. Collect coal and rock images and filter them to obtain a dataset. For the dataset Preprocessing is performed to obtain the preprocessed dataset. Then, the preprocessed dataset The image is initially processed to obtain feature maps. ;
[0092] S3. Optimize the initial CNN-Transformer model by replacing one standard convolutional layer and one depthwise separable convolutional layer in the initial CNN-Transformer model to obtain the optimized CNN-Transformer model.
[0093] S4. Feature map The feature maps are obtained by inputting them into the optimized CNN-Transformer model.
[0094] S5. Input the results of the optimized CNN-Transformer model into the classifier to obtain the classification results.
[0095] S1 is as follows:
[0096] The initial deep network model CNN-Transformer consists of the following structures: RBConv shallow feature extraction module Block Sim-Transformer deep feature extraction module;
[0097] The RBConv shallow feature extraction module includes: a standard convolutional layer, a depthwise separable convolutional layer, and an SE residual connection module.
[0098] The Sim-Transformer deep feature extraction module includes: the Sim-Attention module and the feedforward neural network FFN;
[0099] The texture library dataset is an existing dataset in which images have already been classified and labeled. After inputting the data from the texture library dataset into the initial CNN-Transformer model, feature maps are first extracted from the images in the texture library dataset. Then, the extracted feature maps are processed by the initial CNN-Transformer model to complete the training of the initial CNN-Transformer model. Finally, the convolutional kernels output by the initial CNN-Transformer model are frozen as base weights. .
[0100] S2 is as follows:
[0101] For dataset The images in the dataset are preprocessed by scaling them to the same size and increasing the number of images through random cropping and rotation. The result is a preprocessed dataset. ;
[0102] Dataset Represented as Preprocessed dataset Represented as ,in, , Represents the dataset The number of images in the middle, Represents the preprocessed dataset The number of images in the middle, Represents the dataset The Middle Zhang Image Represents the preprocessed dataset The Middle Zhang Image Represents the preprocessed dataset Any image in the image;
[0103] For the preprocessed dataset The image undergoes preliminary processing to convert the image into a digital format. The input is fed into two 3×3 convolutions for feature map extraction, resulting in the feature map. .
[0104] S3 is as follows:
[0105] The optimized model is derived from the initial model by replacing one standard convolutional layer and one depthwise separable convolutional layer with RefConv convolutions.
[0106] The optimized CNN-Transformer model is made by RBConv shallow feature extraction module and The Sim-Transformer deep feature extraction modules are stacked and connected in series.
[0107] The RBConv shallow feature extraction module includes: RefConv reparameterized refocusing convolution and SE residual connection module;
[0108] The SE residual connection module includes: a global average pooling layer, a fully connected layer, and an activation function for generating channel weights;
[0109] The Sim-Transformer module includes: the Sim-Attntion module and the feedforward neural network FFN.
[0110] S4 is as follows:
[0111] S4.1, Feature Map Before being fed into the optimized CNN-Transformer model, it first passes through a... Convolution, converting feature maps The number of channels C is increased by 4 times, and the feature map The original number of channels is C, and the expanded number of channels is 4C;
[0112] S4.2 Feature map after channel enlargement The feature map is input into the RBConv shallow feature extraction module of the CNN-Transformer model. Feature map output by RefConv ;
[0113] Then adjust the basis weights of the initial CNN-Transformer model output. Perform a refocusing transformation to obtain the conversion weights The calculation process is as follows:
[0114] ,
[0115] in, Indicates refocusing transformation. Representation of feature map Trainable parameters;
[0116] Then the feature map The input is fed into the SE residual module, which processes the input feature map. Global average pooling is performed, and the calculation process is as follows:
[0117] ,
[0118] in, Indicates channel The global average pooling output features have a size of 1×1×4C. Indicates altitude, Indicates width, Representation of feature map Height × Width × Channel Representation of feature map Index in the height direction, , Representation of feature map Index in the width direction, ,aisle The number of channels is 4C;
[0119] Features output by global average pooling Input is fed into a fully connected layer, and then passed through the fully connected layer. Activation function obtains channels weight value The calculation process is as follows:
[0120] ,
[0121] Input feature map passage Multiply by the corresponding weights to generate a new feature map. The calculation process is as follows:
[0122] ,
[0123] Among them, feature map The dimensions are H×W×4C;
[0124] Final feature map Then, a 1×1 convolution is used to reduce the number of channels to 1 / 4, and the number of channels after reduction is C;
[0125] S4.3, Feature Map Input to the Sim-Transformer module, feature map It is a matrix composed of multiple neurons, each neuron corresponding to a receptive field. An energy function is defined for methods to find important neurons. The calculation process is as follows:
[0126] ,
[0127] ,
[0128] ,
[0129] ,
[0130] in, Representation of feature map The target neuron in a single channel, Indicates that except for the target god general Yuan External feature map Other neurons in a single channel, Indicates the number of energy functions. , , This represents the total number of energy functions within each channel. Indicates the transformation weights. Indicates the transformation bias. and Representing feature maps respectively In a single channel, excluding the target neuron All other neurons The mean and variance;
[0131] Calculate the minimum energy According to minimum energy Determine importance, The lower the value, the stronger the target neuron. The greater the difference from surrounding neurons, the higher the importance, and the lower the minimum energy. The calculation process is as follows:
[0132] ,
[0133] in, and Let these represent the mean and variance of the neurons, respectively. Denotes the regularization coefficient;
[0134] Then feature map energy function value The form propagates downwards, and the energy values of all neurons form a set. ;
[0135] Continue with feature maps Input to the Sim-Attention module, via Neuron weights and feature maps obtained from activation functions Perform dot product operation to transform the feature map The input values are mapped to the range (0,1), and the output is a refined feature map focusing on the perception region of the key target. The calculation process of the Sim-Attention module is as follows:
[0136] ,
[0137] feature map Input to the feedforward neural network FFN, feature map First, the input is fed into a fully connected layer to obtain the feature map. The calculation process is as follows:
[0138] ,
[0139] in, Indicates activation function The operation, Representing feature dimension, Indicates a fixed parameter;
[0140] Then to Perform layer normalization to obtain the final output feature map. The calculation formula is as follows:
[0141] ,
[0142] in, Represents a pair of feature maps Each channel undergoes a standardized and normalized process.
[0143] S5 is detailed below:
[0144] Feature maps are processed using average pooling layers and fully connected layers. Classify;
[0145] For the input feature map Global average pooling is performed on each channel:
[0146] ,
[0147] Among them, the input feature map The size is × × C ,in and These are the height and width of the feature map, respectively. It is a passage The characteristics of the global average pooling output, channels The number of channels is The dimensions are 1×1× C ;
[0148] The vector after average pooling A fully connected layer is used to map the data to a category space, obtaining the probability distribution for each category. The category label with the highest probability is selected as the final category label for the coal and rock image. Assuming there are a total of... The categories are as follows:
[0149] ,
[0150] in, This represents the weight matrix of the fully connected layer, with size . , This represents the bias vector. Represents the score for each category, with a size of , , Indicates the first Scores for each category ;
[0151] Then apply to the output of the fully connected layer The activation function yields the probability distribution for each category:
[0152] ,
[0153] in, Indicates that the input image belongs to a category. The probability of;
[0154] Calculate the classification probability for each category, and denote the category corresponding to the highest probability value as the category of the input image.
[0155] This invention combines CNN and Transformer. Convolutional layers, due to their strong inductive bias prior, often have better generalization and faster convergence speed, while the attention layer in Transformer has higher model capacity and can benefit from larger datasets. The combination of convolutional and attention layers can achieve better generalization and capacity. Furthermore, Refconv is used, which can greatly reduce the number of parameters while reducing computational redundancy and memory access, and has a classification accuracy higher than its peers. This makes it possible to recognize coal and rock images on mobile devices and can play an important role in complex coal mine environments.
[0156] To verify the effectiveness of this method, it was compared with existing methods under the same conditions. As shown in Table 1, thirteen models were compared, which can be divided into three categories: pure convolutional (CNN) architecture image classification models (ResNet-101 and NFNet-F3); variant image classification models of Vision Transformer (ViT) (DeiT-B and ViT-L / 16); and various models using a CNN-Transformer hybrid architecture (MobileViT-S, ConFormer-S, T2T-ViT-24, FastVit-SA36, Next-ViT-S, DeepMAD-50M, CoAtNet-0), including the reference model CoAtNet-0 (removing the RBConv shallow feature extraction module and the Sim-Transformer deep feature extraction module). The comparisons included the number of model parameters (Params), the number of floating-point operations (FLOPs), and Top-1 accuracy. Comparing the accuracy across three aspects, the visualization results show that the model of this invention achieved a precision of 91.4% without pre-training, reaching the highest accuracy rate and improving upon the baseline by 7.4%. While the number of parameters did not increase significantly, FLOPs were reduced by 19.0%. After pre-training, the model of this invention achieved an even higher precision of 95.72%, signifying a significant improvement in both data efficiency and computational efficiency.
[0157] Table 1. Performance comparison results of the method in this invention and existing methods on coal and rock datasets.
[0158]
[0159] Example 2
[0160] A coal and rock image classification device based on CNN-Transformer combination, which implements a coal and rock image classification method based on CNN-Transformer combination, includes the following units:
[0161] 1) Pre-trained learning unit: The initial CNN-Transformer model is pre-trained by processing existing data with labeled image types.
[0162] 2) Coal and rock image acquisition unit: Acquires images for constructing a coal and rock image dataset and preprocesses the dataset;
[0163] 3) Optimization Unit: Replace some structures in the initial CNN-Transformer model to obtain the optimized CNN-Transformer model;
[0164] 4) Image data processing unit: The unit processes the acquired images using the parameters output by the optimized CNN-Transformer model and the initial CNN-Transformer model to obtain the feature map of the input image;
[0165] 5) Classification unit: Classifies the input image based on the feature map obtained from the image data processing unit.
[0166] Example 3
[0167] An electronic device includes a processor, a memory, and a bus. The memory stores machine-readable instructions executable by the processor. When the electronic device is running, the processor communicates with the memory via the bus. When the machine-readable instructions are executed by the processor, the aforementioned coal and rock image classification method based on CNN-Transformer is performed.
[0168] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0169] Example 4
[0170] To further demonstrate the effectiveness of the method of the present invention in extracting feature information from coal and rock images Figure 2 The heatmaps show the focus and attention information of the method of the present invention and the existing method CoAtNet-0 on the test set, respectively. The blue area represents the network's key attention area. It can be clearly seen from the heatmap that the method of the present invention can accurately focus on the key feature areas in the image and effectively extract detailed information for analysis. In contrast, although CoAtNet-0 also shows a certain feature capture capability, its accuracy in local features is slightly lower than that of the model of the present invention. The above observation results further demonstrate the advantages of the method of the present invention in coal and rock image analysis and provide strong support for future practical applications.
[0171] Finally, it should be noted that the above descriptions are merely preferred embodiments of the present invention and are not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A coal rock image classification method based on CNN-Transformer combination, characterized in that The steps are as follows: S1, constructing an initial deep network model CNN-Transformer, inputting data in a texture library data set into the initial CNN-Transformer model to obtain base weights ; S2. Collect coal and rock images and filter them to obtain a dataset. For the dataset Preprocessing is performed to obtain the preprocessed dataset. Then, the preprocessed dataset The image is initially processed to obtain feature maps. ; S3. Optimize the initial CNN-Transformer model by replacing one standard convolutional layer and one depthwise separable convolutional layer in the initial CNN-Transformer model to obtain the optimized CNN-Transformer model. S3 is as follows: The optimized model is derived from the initial model by replacing one standard convolutional layer and one depthwise separable convolutional layer with RefConv convolutions. The optimized CNN-Transformer model is made by RBConv shallow feature extraction module and The Sim-Transformer deep feature extraction modules are stacked and connected in series. The RBConv shallow feature extraction module includes: RefConv reparameterized refocusing convolution and SE residual connection module; The SE residual connection module includes: a global average pooling layer, a fully connected layer, and an activation function for generating channel weights; The Sim-Transformer module includes: the Sim-Attention module and the feedforward neural network FFN; S4, the feature map input to the optimized CNN-Transformer model to obtain a feature map; S4 is as follows: S4.1 Feature Map Before being fed into the optimized CNN-Transformer model, it first passes through a... Convolution, converting feature maps The number of channels C is increased by 4 times, and the feature map The original number of channels is C, and the expanded number of channels is 4C; S4.2 Feature map after channel enlargement The feature map is input into the RBConv shallow feature extraction module of the CNN-Transformer model. Feature map output by RefConv ; Then adjust the basis weights of the initial CNN-Transformer model output. Perform a refocusing transformation to obtain the conversion weights The calculation process is as follows: , wherein, denotes a refocusing transform, denotes a feature map of trainable parameters; Then the feature map The input is fed into the SE residual module, which processes the input feature map. Global average pooling is performed, and the calculation process is as follows: , in, Indicates channel The global average pooling output features have a size of 1×1×4C. Indicates altitude, Indicates width, Representation of feature map Height × Width × Channel Representation of feature map Index in the height direction, , Representation of feature map Index in the width direction, ,aisle The number of channels is 4C; Features output by global average pooling Input is fed into a fully connected layer, and then passed through the fully connected layer. Activation function obtains channels weight value The calculation process is as follows: , Input feature map passage Multiply by the corresponding weights to generate a new feature map. The calculation process is as follows: , wherein the feature map has a size of HxWx4C; Final feature map The channel number is reduced to 1 / 4 through a 1x1 convolution again, and the reduced channel number is C. S4.3, Feature Map Input to the Sim-Transformer module, feature map It is a matrix composed of multiple neurons, each neuron corresponding to a receptive field. An energy function is defined for methods to find important neurons. The calculation process is as follows: , , , , in, Representation of feature map The target neuron in a single channel, Indicates that except for the target god general Yuan External feature map Other neurons in a single channel, Indicates the number of energy functions. , , This represents the total number of energy functions within each channel. Indicates the transformation weights. Indicates the transformation bias. and Representing feature maps respectively In a single channel, excluding the target neuron All other neurons The mean and variance; Calculate the minimum energy According to minimum energy Determine importance, The lower the value, the stronger the target neuron. The greater the difference from surrounding neurons, the higher the importance, and the lower the minimum energy. The calculation process is as follows: , wherein, and respectively denote the mean and variance of the neurons, denotes a regularization coefficient; Then feature map energy function value The form propagates downwards, and the energy values of all neurons form a set. ; Continue with feature maps Input to the Sim-Attention module, via Neuron weights and feature maps obtained from activation functions Perform dot product operation to transform the feature map The input values are mapped to the range (0,1), and the output is a refined feature map focusing on the perception region of the key target. The calculation process of the Sim-Attention module is as follows: , The feature map is input to a feed forward neural network, FFN, to obtain a feature map is first input to a fully connected layer to obtain a feature map The calculation process is as follows: , wherein, represents an activation function operation, represents a feature dimension, represents a fixed parameter; Then the layer normalization is performed to obtain the final output feature map , and the calculation formula is as follows: , wherein, denotes a feature map pair Each layer channel is normalized one by one. S5. Input the results of the optimized CNN-Transformer model into the classifier to obtain the classification results.
2. The coal rock image classification method based on the combination of CNN and Transformer according to claim 1, characterized in that, S1 is as follows: The structure of the initial deep network model CNN-Transformer comprises: a block RBConv shallow feature extraction module, a block Sim-Transformer deep feature extraction module; The RBConv shallow feature extraction module includes: a standard convolutional layer, a depthwise separable convolutional layer, and an SE residual connection module. The Sim-Transformer deep feature extraction module includes: the Sim-Attention module and the feedforward neural network FFN; The texture library dataset is an existing dataset in which images have been classified and labeled. After inputting the data in the texture library dataset into the initial CNN-Transformer model, the features of the images in the texture library dataset are first extracted, and then the extracted features are processed by the initial CNN-Transformer model to complete the training of the initial CNN-Transformer model. Finally, the convolution kernel finally output by the initial CNN-Transformer model is frozen as the base weight .
3. The coal rock image classification method based on the combination of CNN and Transformer according to claim 2, characterized in that, S2 is as follows: For dataset The images in the dataset are preprocessed by scaling them to the same size and increasing the number of images in the dataset through random cropping and rotation. The final preprocessed dataset is then obtained. ; Dataset Represented as Preprocessed dataset Represented as ,in, , Represents the dataset The number of images in the middle, Represents the preprocessed dataset The number of images in the middle, Represents the dataset The Middle Zhang Image Represents the preprocessed dataset The Middle Zhang Image Represents the preprocessed dataset Any image in the image; The pre-processed dataset The images in the pre-processed dataset are preliminarily processed to obtain images Input to two 3x3 convolutions for feature map extraction to obtain feature maps .
4. The coal rock image classification method based on the combination of CNN and Transformer according to claim 3, characterized in that, S5 is detailed below: The feature map is classified through an average pooling layer and a fully connected layer ; global average pooling is performed on each channel of the input feature map: , wherein the input feature map has a size of × × C , wherein and are the height and the width of the feature map, respectively, is a feature of the global average pooling output of the channel , and the channel number of the channel is , and the size is 1×1× C ; The vector after average pooling Through the full connection layer, map to the category space, get the probability distribution of each category, select the highest probability category label as the final category label of the coal rock image, assuming that there are categories, the specific process is as follows: , wherein, denotes the weight matrix of the fully connected layer, with dimensions , denotes the bias vector, denotes the score for each class, with dimensions , , denotes the score for the th class, ; applying again an activation function to the output of the fully connected layers, obtaining a probability distribution over each class: activations function, obtaining a probability distribution over each class: wherein, represents the probability that the input image belongs to the class ; Calculate the classification probability for each category, and denote the category corresponding to the highest probability value as the category of the input image.
5. A coal rock image classification device based on a combination of CNN-Transformer, characterized by, The coal and rock image classification method based on CNN-Transformer combination as described in any one of claims 1-4 includes the following units: 1) Pre-trained learning unit: The initial CNN-Transformer model is pre-trained by processing existing data with labeled image types. 2) Coal and rock image acquisition unit: Acquires images for constructing a coal and rock image dataset and preprocesses the dataset; 3) Optimization Unit: Replace some structures in the initial CNN-Transformer model to obtain the optimized CNN-Transformer model; 4) Image data processing unit: The unit processes the acquired images using the parameters output by the optimized CNN-Transformer model and the initial CNN-Transformer model to obtain the feature map of the input image; 5) Classification unit: Classifies the input image based on the feature map obtained from the image data processing unit.
6. An electronic device, comprising: The device includes a processor, a memory, and a bus. The memory stores machine-readable instructions that the processor can execute. When the electronic device is running, the processor communicates with the memory via the bus. When the machine-readable instructions are executed by the processor, a coal and rock image classification method based on CNN-Transformer combination as described in any one of claims 1 to 4 is performed.
Citation Information
Patent Citations
Image classification method and device in combination with CNN and Transform, and computer storage medium
CN116912552A