A target device image segmentation method and system in a power operation scene

By combining an image semantic segmentation model and a global vision attention model, the problems of wasted computing resources and insufficient data in deep learning models in power operation scenarios are solved, thereby improving the accuracy and robustness of power equipment image segmentation.

CN119399459BActive Publication Date: 2026-06-26CHINA ELECTRIC POWER RESEARCH INSTITUTE CO LTD +4

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA ELECTRIC POWER RESEARCH INSTITUTE CO LTD
Filing Date
2024-09-27
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

When deep learning models are used to segment images of key equipment in power operation scenarios, there are problems such as wasted computing resources, insufficient data, and imbalanced category distribution, resulting in insufficient generalization ability and insufficient recognition and segmentation ability.

Method used

An image semantic segmentation model is adopted, which combines a multi-scale fast fusion extraction model and a global vision attention model with loss function training to expand the dataset and extract multi-layer features, thereby generating global vision attention feature maps and predicting semantic segmentation.

Benefits of technology

It improves the model's accuracy and robustness, enhances the understanding of image details and segmentation accuracy, especially in the case of small samples.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119399459B_ABST
    Figure CN119399459B_ABST
Patent Text Reader

Abstract

The application discloses a target device image segmentation method and system in a power operation scene, and belongs to the technical field of target device image segmentation. The method comprises the following steps: acquiring a device image of a target device in a power operation scene; training an image semantic segmentation model by using the device image and a loss function; performing semantic segmentation on the target device image of the target device in the power operation scene based on the image semantic segmentation model, and generating a semantic segmentation prediction image. The application improves the precision and robustness of the image semantic segmentation model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of target equipment image segmentation technology, and more specifically, to a target equipment image segmentation method and system in a power operation scenario. Background Technology

[0002] In power operation scenarios, accurate image segmentation of critical equipment is crucial. This not only helps monitor and maintain the health status of equipment but also plays a key role in fault prediction and accident prevention. For example, in intelligent inspection systems, real-time segmentation of power equipment images is required to quickly identify different components of the equipment, thereby achieving efficient condition monitoring and fault diagnosis.

[0003] Deep learning plays a crucial role in the accurate image segmentation of critical equipment in power operation scenarios. Deep learning models can learn the appearance features and structural information of equipment by training on large amounts of power equipment image data. This enables the model to automatically and accurately segment key equipment parts in images, such as transformer cooling systems, insulation layers, and wiring. Compared to traditional manual feature and rule design, deep learning models are better able to adapt to the variations and complexities of different equipment, providing more accurate segmentation results.

[0004] However, deep learning also has some limitations in accurately segmenting critical equipment images in power operation scenarios. Deep learning models are typically complex neural network structures containing a large number of parameters, leading to a waste of computational resources. In some cases, image data for power equipment may be limited, especially for specific types of equipment or rare fault conditions. This can result in insufficient generalization ability of the model in small sample situations. Furthermore, the class distribution of different components in power equipment images may be unbalanced, with some parts having fewer samples. This can affect the model's ability to identify and segment a few classes or rare cases.

[0005] Chinese patent CN116862847A discloses "An Interactive Segmentation Method and System for Power Equipment from Infrared Images." This method acquires images of power equipment and uses pixel-level labels as ground truth masks to construct a power equipment segmentation dataset. Using the known ground truth masks, the infrared image and simulated input information are jointly input into a trained image segmentation model to obtain predicted masks for the power equipment, thus yielding the segmentation results. While this method has made some progress in power equipment segmentation, it cannot capture and integrate feature information of different sizes. This results in insufficient understanding and representation of image details, leading to low accuracy and robustness of the model. Summary of the Invention

[0006] To address the above problems, this invention proposes a target equipment image segmentation method for power operation scenarios, comprising:

[0007] Acquire equipment images of target devices in power operation scenarios;

[0008] An image semantic segmentation model is trained using the device image and the loss function;

[0009] Based on an image semantic segmentation model, semantic segmentation is performed on the target equipment image of the target equipment in the power operation scenario to generate a semantic segmentation prediction image.

[0010] Optionally, an image semantic segmentation model is trained using the device image and the loss function, including:

[0011] The device image is transformed to expand its data size, generating a dataset, which is then divided into a training set and a test set according to a preset ratio.

[0012] The training set is fed into a multi-scale fast fusion extraction model, and features are extracted from the training set based on a pre-established loss function to extract multi-layer features from the training set.

[0013] The global vision attention model connects each location on the feature map of multi-layer features to other locations to output a global vision attention feature map.

[0014] The semantic segmentation prediction model is used to perform semantic segmentation on the global field of view attention feature map to generate semantic segmentation prediction results for the device image.

[0015] The data processing model, multi-scale fast fusion extraction model, global vision attention model, and semantic segmentation prediction model are fused together. The device image is used as the input data of the fused model, and the semantic segmentation prediction result is used as the output data of the fused model. The fused model is trained to obtain an image semantic segmentation model, and the accuracy of the image semantic segmentation model is verified using the test set.

[0016] Optional conversion operations include:

[0017] Flip, rotate, crop, scale, and translate operations.

[0018] Optional, the default ratio is 9:1.

[0019] Optionally, a multi-scale fast fusion model includes: two convolutional layers and three residual structures, wherein the convolutional layers and residual structures are used to perform convolution operations on the training set, and feature extraction is performed on the training set through the convolution operations;

[0020] The convolution operation includes: inputting the training set into two convolutional layers for convolution operation, and inputting the output of the two convolutional layers into three residual structures for convolution operation;

[0021] The kernel size of the convolutional layer is 3×3, and the stride of the convolution operation is 2.

[0022] The first residual structure includes two convolutional kernels, each 3×3 in size. The stride of the first convolutional operation is 2, and the stride of the second convolutional operation is 1.

[0023] The second residual structure includes three convolutional kernels, each 3×3 in size. The stride of the first convolutional operation is 1, and the strides of the second and third convolutional operations are 2.

[0024] The third residual structure includes two convolutional kernels, each 3×3 in size. The stride of the first convolutional operation is 1, and the stride of the second convolutional operation is 2.

[0025] Optionally, a global view attention model is used to connect each location on the feature map of the multi-layer features to other locations to output a global view attention feature map, including:

[0026] The feature map of the layer features is divided into three branches;

[0027] For the first branch, feature extraction is performed through two consecutive 3×3 convolutional layers of the global vision attention model. The extracted features are then subjected to average pooling to reduce the feature dimensionality and obtain key information features.

[0028] For the second branch, a 3×3 convolutional layer of the global vision attention model is used for preliminary feature extraction. For the preliminary extracted features, a 1×1 convolutional layer is used for refined feature extraction. The extracted refined features are then subjected to max pooling to obtain the most significant response features.

[0029] For the third branch, feature compression is performed through a 1×1 convolutional layer of the global vision attention model. For the compressed features, feature extraction is performed through a 3×3 convolutional layer. The extracted features are then subjected to average pooling and max pooling operations to obtain two types of feature representations at different scales. The two types of feature representations at different scales are then concatenated to obtain concatenated features.

[0030] The key information features are concatenated with the most significant response features, and the concatenated features are normalized to obtain normalized concatenated features. The normalized concatenated features are then summed with the concatenated features to ensure that each position on the feature map of the multi-layer features is connected to other positions, so as to output a global vision attention feature map.

[0031] Optionally, the global field-of-view attention feature map is semantically segmented using a semantic segmentation prediction model to generate semantic segmentation prediction results for the device image, including:

[0032] The global vision attention feature map is input into a 3×3 convolutional layer of the semantic segmentation prediction model, and the size of the global vision attention feature map is adjusted in the 3×3 convolutional layer.

[0033] The resized global vision attention feature map is segmented using a 1×1 convolutional layer of the semantic segmentation prediction model to output the semantic segmentation prediction result.

[0034] Furthermore, this invention also proposes a target equipment image segmentation system for power operation scenarios, comprising:

[0035] The data acquisition unit is used to acquire equipment images of target equipment in power operation scenarios;

[0036] The training unit is used to train an image semantic segmentation model using the device image and the loss function;

[0037] The segmentation prediction unit is used to perform semantic segmentation on the target equipment image of the target equipment in the power operation scenario based on the image semantic segmentation model, and generate a semantic segmentation prediction image.

[0038] Optionally, the training unit trains an image semantic segmentation model using the device image and the loss function, including:

[0039] The device image is transformed to expand its data size, generating a dataset, which is then divided into a training set and a test set according to a preset ratio.

[0040] The training set is fed into a multi-scale fast fusion extraction model, and features are extracted from the training set based on a pre-established loss function to extract multi-layer features from the training set.

[0041] The global vision attention model connects each location on the feature map of multi-layer features to other locations to output a global vision attention feature map.

[0042] The semantic segmentation prediction model is used to perform semantic segmentation on the global field of view attention feature map to generate semantic segmentation prediction results for the device image.

[0043] The data processing model, multi-scale fast fusion extraction model, global vision attention model, and semantic segmentation prediction model are fused together. The device image is used as the input data of the fused model, and the semantic segmentation prediction result is used as the output data of the fused model. The fused model is trained to obtain an image semantic segmentation model, and the accuracy of the image semantic segmentation model is verified using the test set.

[0044] Optional conversion operations include:

[0045] Flip, rotate, crop, scale, and translate operations.

[0046] Optional, the default ratio is 9:1.

[0047] Optionally, a multi-scale fast fusion model includes: two convolutional layers and three residual structures, wherein the convolutional layers and residual structures are used to perform convolution operations on the training set, and feature extraction is performed on the training set through the convolution operations;

[0048] The convolution operation includes: inputting the training set into two convolutional layers for convolution operation, and inputting the output of the two convolutional layers into three residual structures for convolution operation;

[0049] The kernel size of the convolutional layer is 3×3, and the stride of the convolution operation is 2.

[0050] The first residual structure includes two convolutional kernels, each 3×3 in size. The stride of the first convolutional operation is 2, and the stride of the second convolutional operation is 1.

[0051] The second residual structure includes three convolutional kernels, each 3×3 in size. The stride of the first convolutional operation is 1, and the strides of the second and third convolutional operations are 2.

[0052] The third residual structure includes two convolutional kernels, each 3×3 in size. The stride of the first convolutional operation is 1, and the stride of the second convolutional operation is 2.

[0053] Optionally, a global view attention model is used to connect each location on the feature map of the multi-layer features to other locations to output a global view attention feature map, including:

[0054] The feature map of the layer features is divided into three branches;

[0055] For the first branch, feature extraction is performed through two consecutive 3×3 convolutional layers of the global vision attention model. The extracted features are then subjected to average pooling to reduce the feature dimensionality and obtain key information features.

[0056] For the second branch, a 3×3 convolutional layer of the global vision attention model is used for preliminary feature extraction. For the preliminary extracted features, a 1×1 convolutional layer is used for refined feature extraction. The extracted refined features are then subjected to max pooling to obtain the most significant response features.

[0057] For the third branch, feature compression is performed through a 1×1 convolutional layer of the global vision attention model. For the compressed features, feature extraction is performed through a 3×3 convolutional layer. The extracted features are then subjected to average pooling and max pooling operations to obtain two types of feature representations at different scales. The two types of feature representations at different scales are then concatenated to obtain concatenated features.

[0058] The key information features are concatenated with the most significant response features, and the concatenated features are normalized to obtain normalized concatenated features. The normalized concatenated features are then summed with the concatenated features to ensure that each position on the feature map of the multi-layer features is connected to other positions, so as to output a global vision attention feature map.

[0059] Optionally, the global field-of-view attention feature map is semantically segmented using a semantic segmentation prediction model to generate semantic segmentation prediction results for the device image, including:

[0060] The global vision attention feature map is input into a 3×3 convolutional layer of the semantic segmentation prediction model, and the size of the global vision attention feature map is adjusted in the 3×3 convolutional layer.

[0061] The resized global vision attention feature map is segmented using a 1×1 convolutional layer of the semantic segmentation prediction model to output the semantic segmentation prediction result.

[0062] In another aspect, the present invention also provides a computing device, comprising: one or more processors;

[0063] A processor is used to execute one or more programs;

[0064] When the one or more programs are executed by the one or more processors, the method described above is implemented.

[0065] In another aspect, the present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed, implements the method described above.

[0066] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0067] This invention provides a method for segmenting target equipment images in a power operation scenario, comprising: acquiring equipment images of target equipment in a power operation scenario; training an image semantic segmentation model using the equipment images and a loss function; and performing semantic segmentation on the target equipment images of the target equipment in the power operation scenario based on the image semantic segmentation model to generate a semantic segmentation prediction image. The image semantic segmentation model obtained by training the equipment images and the loss function together in this invention improves the accuracy and robustness of the model compared to existing models trained without using a loss function. Attached Figure Description

[0068] Figure 1 This is a flowchart of the method of the present invention;

[0069] Figure 2 This is a structural diagram of the system of the present invention. Detailed Implementation

[0070] Exemplary embodiments of the invention will now be described with reference to the accompanying drawings. However, the invention may be embodied in many different forms and is not limited to the embodiments described herein. These embodiments are provided to fully and completely disclose the invention and to fully convey its scope to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the drawings is not intended to limit the invention. In the drawings, the same units / elements are referred to by the same reference numerals.

[0071] Unless otherwise stated, the terms used herein (including technical terms) have their common meaning as understood by one of ordinary skill in the art. Furthermore, it is understood that terms defined in commonly used dictionaries should be understood to have a meaning consistent with the context of their relevant field, and not to be interpreted as having an idealized or overly formal meaning.

[0072] Example 1:

[0073] This invention proposes a target equipment image segmentation method in power operation scenarios, such as... Figure 1 As shown, it includes:

[0074] Step 1: Obtain equipment images of the target equipment in the power operation scenario;

[0075] Step 2: Train the image semantic segmentation model using the device image and the loss function;

[0076] Step 3: Based on the image semantic segmentation model, perform semantic segmentation on the target equipment image of the target equipment in the power operation scenario to generate a semantic segmentation prediction image.

[0077] The generated semantic segmentation prediction image is pixel-level.

[0078] The image semantic segmentation model, trained using the device image and loss function, includes:

[0079] The device image is transformed to expand its data size, generating a dataset, which is then divided into a training set and a test set according to a preset ratio.

[0080] The training set is fed into a multi-scale fast fusion extraction model, and features are extracted from the training set based on a pre-established loss function to extract multi-layer features from the training set.

[0081] The global vision attention model connects each location on the feature map of multi-layer features to other locations to output a global vision attention feature map.

[0082] The semantic segmentation prediction model is used to perform semantic segmentation on the global field of view attention feature map to generate semantic segmentation prediction results for the device image.

[0083] The data processing model, multi-scale fast fusion extraction model, global vision attention model, and semantic segmentation prediction model are fused together. The device image is used as the input data of the fused model, and the semantic segmentation prediction result is used as the output data of the fused model. The fused model is trained to obtain an image semantic segmentation model, and the accuracy of the image semantic segmentation model is verified using the test set.

[0084] The conversion operations include:

[0085] Flip, rotate, crop, scale, and translate operations.

[0086] The preset ratio is 9:1.

[0087] The multi-scale fast fusion model includes two convolutional layers and three residual structures. The convolutional layers and residual structures are used to perform convolution operations on the training set, and the training set is used to extract features through the convolution operations.

[0088] The convolution operation includes: inputting the training set into two convolutional layers for convolution operation, and inputting the output of the two convolutional layers into three residual structures for convolution operation;

[0089] The kernel size of the convolutional layer is 3×3, and the stride of the convolution operation is 2.

[0090] The first residual structure includes two convolutional kernels, each 3×3 in size. The stride of the first convolutional operation is 2, and the stride of the second convolutional operation is 1.

[0091] The second residual structure includes three convolutional kernels, each 3×3 in size. The stride of the first convolutional operation is 1, and the strides of the second and third convolutional operations are 2.

[0092] The third residual structure includes two convolutional kernels, each 3×3 in size. The stride of the first convolutional operation is 1, and the stride of the second convolutional operation is 2.

[0093] Specifically, a global view attention model is used to connect each location on the feature map of multiple layers to other locations, resulting in a global view attention feature map, including:

[0094] The feature map of the layer features is divided into three branches;

[0095] For the first branch, feature extraction is performed through two consecutive 3×3 convolutional layers of the global vision attention model. The extracted features are then subjected to average pooling to reduce the feature dimensionality and obtain key information features.

[0096] For the second branch, a 3×3 convolutional layer of the global vision attention model is used for preliminary feature extraction. For the preliminary extracted features, a 1×1 convolutional layer is used for refined feature extraction. The extracted refined features are then subjected to max pooling to obtain the most significant response features.

[0097] For the third branch, feature compression is performed through a 1×1 convolutional layer of the global vision attention model. For the compressed features, feature extraction is performed through a 3×3 convolutional layer. The extracted features are then subjected to average pooling and max pooling operations to obtain two types of feature representations at different scales. The two types of feature representations at different scales are then concatenated to obtain concatenated features.

[0098] The key information features are concatenated with the most significant response features, and the concatenated features are normalized to obtain normalized concatenated features. The normalized concatenated features are then summed with the concatenated features to ensure that each position on the feature map of the multi-layer features is connected to other positions, so as to output a global vision attention feature map.

[0099] The semantic segmentation prediction model is used to perform semantic segmentation on the global field of view attention feature map to generate semantic segmentation prediction results for the device image, including:

[0100] The global vision attention feature map is input into a 3×3 convolutional layer of the semantic segmentation prediction model, and the size of the global vision attention feature map is adjusted in the 3×3 convolutional layer.

[0101] The resized global vision attention feature map is segmented using a 1×1 convolutional layer of the semantic segmentation prediction model to output the semantic segmentation prediction result.

[0102] The invention will be further illustrated below with specific examples:

[0103] The specific steps include:

[0104] S1. Design an image semantic segmentation model suitable for parsing images of target equipment in power operations;

[0105] S2. Design the loss function used in training the semantic segmentation model of the target equipment image in power operation, and train the image semantic segmentation model designed in step S1;

[0106] The loss function is derived from the fast fusion extraction model loss function L. combined composition;

[0107] The loss function L of the fast fusion extraction model combined Specifically as follows:

[0108] L combined =αL bce +βL dice

[0109] Among them, L bce Used to measure the difference between the probability distribution predicted by the model and the true label; L dice Used to optimize the model to improve segmentation accuracy; α and β are hyperparameters used to control L bce and L dice The weight of the loss.

[0110] The L bce Specifically as follows:

[0111]

[0112] Where N represents the number of pixels; i represents the i-th pixel; p i g represents the predicted probability of the i-th pixel; i The true label represents the i-th pixel.

[0113] The L dice Specifically as follows:

[0114]

[0115] Where N represents the number of pixels; i represents the i-th pixel; p i g represents the predicted probability of the i-th pixel; iRepresents the true label of the i-th pixel; smooth is a small constant used to avoid the case where the denominator is zero.

[0116] S3. Use the target equipment image segmentation model trained in step S2 to parse the target equipment image in the power operation scenario and generate a pixel-level predicted image.

[0117] Furthermore, step S1, which designs an image semantic segmentation model suitable for parsing images of target equipment in power operations, specifically includes the following sub-steps:

[0118] S1-1. A dataset of target equipment images is collected from power operation scenarios using a data preprocessing model. Then, the data preprocessing model performs transformations on the image dataset, such as flipping, rotating, cropping, scaling, and translation, to expand the size of the original dataset. Finally, the dataset is divided in a 9:1 ratio to form the training set and the test set.

[0119] S1-2. Feed the training set and test set from step S1-1 into the multi-scale fast fusion extraction model for feature extraction, and extract the multi-layer features of the data in the training set and test set;

[0120] S1-3. The extracted multi-layer features are input into a global attention model. Through adaptive learning of the attention mask, each location on the feature map is connected to all other locations. In this way, information can flow globally, providing a more comprehensive contextual understanding and thus improving the accuracy of semantic segmentation.

[0121] S1-4. Input the features output by the global vision attention model into the semantic segmentation prediction model to obtain the final prediction result.

[0122] Furthermore, the fast fusion extraction model includes two convolutional layers and three residual structures, comprising the following sub-steps:

[0123] The image output from the data preprocessing model is input into two convolutional layers for convolution operation. The kernel size of the convolutional layers is 3×3, and the stride of the convolution operation is 2.

[0124] Entering the first residual structure, the first residual structure includes two convolution kernels, each with a size of 3×3. The stride of the first convolution operation is 2, and the stride of the second convolution operation is 1.

[0125] Entering the second residual structure, the second residual structure includes three convolution kernels, each of which is 3×3 in size. The stride of the first convolution operation is 1, and the stride of the second and third convolution operations is 2.

[0126] Entering the third residual structure, the third residual structure includes two convolution kernels, each with a size of 3×3. The stride of the first convolution operation is 1, and the stride of the second convolution operation is 2.

[0127] Furthermore, the global vision attention model includes the following sub-steps:

[0128] S1-3-1. The output features of the fast fusion extraction model are divided into three branches. The first branch extracts features through two consecutive 3×3 convolutional layers, followed by average pooling to reduce the feature dimension while retaining key information.

[0129] S1-3-2. The second branch performs initial processing through a 3×3 convolutional layer, then further refines it through a 1×1 convolutional layer, and finally highlights the most significant feature responses through max pooling.

[0130] S1-3-3. The third branch performs feature compression using a 1×1 convolutional layer, followed by feature extraction using a 3×3 convolutional layer. Then, average pooling and max pooling operations are performed to obtain feature representations at different scales. Finally, the results of these two pooling operations are concatenated to integrate different feature information.

[0131] S1-3-4. Finally, the output of step S1-3-1 is concatenated with the output of step S1-3-2 and input into the sigmoid function for normalization. Then, the normalized result is summed with the output of step S1-3-3 to obtain the output feature map.

[0132] Furthermore, the semantic segmentation prediction model includes the following sub-steps:

[0133] S1-4-1. Input the output of step S1-3-4 into a 3×3 convolutional layer, and change the output size through the 3×3 convolutional layer;

[0134] S1-4-2. Input the result of step S1-4-1 into a 1×1 convolution to directly output the final prediction result.

[0135] Furthermore, step S3 uses the semantic segmentation model for the power operation target equipment image trained in step S2 to parse the power operation target equipment image and generate a pixel-level predicted image, as follows:

[0136] By loading a pre-trained semantic segmentation model for power operation target equipment images, the images of power operation target equipment to be parsed are preprocessed and inferred by the model to generate pixel-level predicted images.

[0137] Compared with the prior art, the beneficial effects of the present invention are:

[0138] (1) By designing a fast fusion extraction model, which employs convolution operations with multiple strides, the model can effectively capture and integrate feature information of different sizes, thereby improving the model's understanding and representation of image details.

[0139] (2) By designing a global vision attention model, features of different scales can be combined and the response to important features can be enhanced through the attention mechanism, thereby improving the accuracy and robustness of semantic segmentation while maintaining real-time performance.

[0140] (3) By designing a fast fusion extraction model loss function, the fast fusion extraction model loss function includes Lb ce Loss function and L dice Loss function, L bce The loss function helps the model better learn and locate the boundaries of objects; for pixels near the boundary, L... bce The loss function can guide the model to accurately predict their categories, thereby improving the boundary accuracy of the segmentation results and enhancing the model's accuracy and robustness; L dice The loss function facilitates model optimization and is robust to small objectives and class imbalance problems.

[0141] Example 2:

[0142] This invention also proposes a target equipment image segmentation system 200 for power operation scenarios, such as... Figure 2 As shown, it includes:

[0143] Data acquisition unit 201 is used to acquire equipment images of target equipment in power operation scenarios;

[0144] Training unit 202 is used to train an image semantic segmentation model using the device image and the loss function;

[0145] The segmentation prediction unit 203 is used to perform semantic segmentation on the target equipment image of the target equipment in the power operation scenario based on the image semantic segmentation model, and generate a semantic segmentation prediction image.

[0146] The training unit 202 trains an image semantic segmentation model using the device image and the loss function, including:

[0147] The device image is transformed to expand its data size, generating a dataset, which is then divided into a training set and a test set according to a preset ratio.

[0148] The training set is fed into a multi-scale fast fusion extraction model, and features are extracted from the training set based on a pre-established loss function to extract multi-layer features from the training set.

[0149] The global vision attention model connects each location on the feature map of multi-layer features to other locations to output a global vision attention feature map.

[0150] The semantic segmentation prediction model is used to perform semantic segmentation on the global field of view attention feature map to generate semantic segmentation prediction results for the device image.

[0151] The data processing model, multi-scale fast fusion extraction model, global vision attention model, and semantic segmentation prediction model are fused together. The device image is used as the input data of the fused model, and the semantic segmentation prediction result is used as the output data of the fused model. The fused model is trained to obtain an image semantic segmentation model, and the accuracy of the image semantic segmentation model is verified using the test set.

[0152] The conversion operations include:

[0153] Flip, rotate, crop, scale, and translate operations.

[0154] The preset ratio is 9:1.

[0155] The multi-scale fast fusion model includes two convolutional layers and three residual structures. The convolutional layers and residual structures are used to perform convolution operations on the training set, and the training set is used to extract features through the convolution operations.

[0156] The convolution operation includes: inputting the training set into two convolutional layers for convolution operation, and inputting the output of the two convolutional layers into three residual structures for convolution operation;

[0157] The kernel size of the convolutional layer is 3×3, and the stride of the convolution operation is 2.

[0158] The first residual structure includes two convolutional kernels, each 3×3 in size. The stride of the first convolutional operation is 2, and the stride of the second convolutional operation is 1.

[0159] The second residual structure includes three convolutional kernels, each 3×3 in size. The stride of the first convolutional operation is 1, and the strides of the second and third convolutional operations are 2.

[0160] The third residual structure includes two convolutional kernels, each 3×3 in size. The stride of the first convolutional operation is 1, and the stride of the second convolutional operation is 2.

[0161] Specifically, a global view attention model is used to connect each location on the feature map of multiple layers to other locations, resulting in a global view attention feature map, including:

[0162] The feature map of the layer features is divided into three branches;

[0163] For the first branch, feature extraction is performed through two consecutive 3×3 convolutional layers of the global vision attention model. The extracted features are then subjected to average pooling to reduce the feature dimensionality and obtain key information features.

[0164] For the second branch, a 3×3 convolutional layer of the global vision attention model is used for preliminary feature extraction. For the preliminary extracted features, a 1×1 convolutional layer is used for refined feature extraction. The extracted refined features are then subjected to max pooling to obtain the most significant response features.

[0165] For the third branch, feature compression is performed through a 1×1 convolutional layer of the global vision attention model. For the compressed features, feature extraction is performed through a 3×3 convolutional layer. The extracted features are then subjected to average pooling and max pooling operations to obtain two types of feature representations at different scales. The two types of feature representations at different scales are then concatenated to obtain concatenated features.

[0166] The key information features are concatenated with the most significant response features, and the concatenated features are normalized to obtain normalized concatenated features. The normalized concatenated features are then summed with the concatenated features to ensure that each position on the feature map of the multi-layer features is connected to other positions, so as to output a global vision attention feature map.

[0167] The semantic segmentation prediction model is used to perform semantic segmentation on the global field of view attention feature map to generate semantic segmentation prediction results for the device image, including:

[0168] The global vision attention feature map is input into a 3×3 convolutional layer of the semantic segmentation prediction model, and the size of the global vision attention feature map is adjusted in the 3×3 convolutional layer.

[0169] The resized global vision attention feature map is segmented using a 1×1 convolutional layer of the semantic segmentation prediction model to output the semantic segmentation prediction result.

[0170] This invention improves the accuracy and robustness of image semantic segmentation models.

[0171] Example 3:

[0172] Based on the same inventive concept, this invention also provides a computer device, which includes a processor and a memory. The memory stores a computer program, which includes program instructions. The processor executes the program instructions stored in the computer storage medium. The processor may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. It is the computing and control core of the terminal, suitable for implementing one or more instructions, specifically suitable for loading and executing one or more instructions in the computer storage medium to implement corresponding method flows or corresponding functions, thereby implementing the steps of the methods in the above embodiments.

[0173] Example 4:

[0174] Based on the same inventive concept, this invention also provides a storage medium, specifically a computer-readable storage medium (Memory), which is a memory device in a computer device used to store programs and data. It is understood that the computer-readable storage medium here can include both the built-in storage medium in the computer device and extended storage media supported by the computer device. The computer-readable storage medium provides storage space that stores the terminal's operating system. Furthermore, this storage space also stores one or more instructions suitable for loading and execution by a processor. These instructions can be one or more computer programs (including program code). It should be noted that the computer-readable storage medium here can be high-speed RAM or non-volatile memory, such as at least one disk storage device. The processor can load and execute one or more instructions stored in the computer-readable storage medium to implement the steps of the method in the above embodiments.

[0175] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code. The solutions in the embodiments of the present invention can be implemented using various computer languages, such as the object-oriented programming language Java and the interpreted scripting language JavaScript.

[0176] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0177] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0178] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0179] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention.

[0180] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, this invention also intends to include these modifications and variations.

Claims

1. A method for segmenting target equipment images in a power operation scenario, characterized in that, include: Acquire equipment images of target devices in power operation scenarios; An image semantic segmentation model is trained using the device image and the loss function, including: The data processing model is used to transform the device image to expand the data size of the device image, generate a dataset, and divide the dataset into a training set and a test set according to a preset ratio. The training set is fed into a multi-scale fast fusion extraction model, and features are extracted from the training set based on a pre-established loss function to extract multi-layer features from the training set. The global vision attention model connects each location on the feature map of multi-layer features to other locations to output a global vision attention feature map. The step of using a global vision attention model to connect each location on the feature map of multi-layer features with other locations to output a global vision attention feature map includes: The feature map of the multi-layer feature is divided into three branches; For the first branch, feature extraction is performed through two consecutive 3×3 convolutional layers of the global vision attention model. The extracted features are then subjected to average pooling to reduce the feature dimensionality and obtain key information features. For the second branch, a 3×3 convolutional layer of the global vision attention model is used for preliminary feature extraction. For the preliminary extracted features, a 1×1 convolutional layer is used for refined feature extraction. The extracted refined features are then subjected to max pooling to obtain the most significant response features. For the third branch, feature compression is performed through a 1×1 convolutional layer of the global vision attention model. For the compressed features, feature extraction is performed through a 3×3 convolutional layer. The extracted features are then subjected to average pooling and max pooling operations to obtain two types of feature representations at different scales. The two types of feature representations at different scales are then concatenated to obtain concatenated features. The key information features are concatenated with the most significant response features, and the concatenated features are normalized to obtain normalized concatenated features. The normalized concatenated features are then summed with the concatenated features to ensure that each position on the feature map of the multi-layer features is connected to other positions, so as to output a global vision attention feature map. Based on an image semantic segmentation model, semantic segmentation is performed on the target equipment image of the target equipment in the power operation scenario to generate a semantic segmentation prediction image.

2. The target device image segmentation method according to claim 1, characterized in that, The image semantic segmentation model trained using the device image and loss function further includes: The semantic segmentation prediction model is used to perform semantic segmentation on the global field of view attention feature map to generate semantic segmentation prediction results for the device image. The data processing model, multi-scale fast fusion extraction model, global vision attention model, and semantic segmentation prediction model are fused together. The device image is used as the input data of the fused model, and the semantic segmentation prediction result is used as the output data of the fused model. The fused model is trained to obtain an image semantic segmentation model, and the accuracy of the image semantic segmentation model is verified using the test set.

3. The target device image segmentation method according to claim 1, characterized in that, The conversion operation includes: Flip, rotate, crop, scale, and translate operations.

4. The target device image segmentation method according to claim 1, characterized in that, The preset ratio is 9:

1.

5. The target device image segmentation method according to claim 1, characterized in that, The multi-scale fast fusion extraction model includes: two convolutional layers and three residual structures. The convolutional layers and residual structures are used to perform convolution operations on the training set, and feature extraction is performed on the training set through the convolution operations. The convolution operation includes: inputting the training set into two convolutional layers for convolution operation, and inputting the output of the two convolutional layers into three residual structures for convolution operation; The kernel size of the convolutional layer is 3×3, and the stride of the convolution operation is 2. The first residual structure includes: two convolutional kernels, each 3×3 in size, with a stride of 2 for the first convolutional operation and a stride of 1 for the second convolutional operation; The second residual structure includes three convolutional kernels, each 3×3 in size. The stride of the first convolutional operation is 1, and the strides of the second and third convolutional operations are 2. The third residual structure includes two convolutional kernels, each 3×3 in size. The stride of the first convolutional operation is 1, and the stride of the second convolutional operation is 2.

6. The target device image segmentation method according to claim 2, characterized in that, The step of performing semantic segmentation on the global field-of-view attention feature map using a semantic segmentation prediction model to generate semantic segmentation prediction results for the device image includes: The global vision attention feature map is input into a 3×3 convolutional layer of the semantic segmentation prediction model, and the size of the global vision attention feature map is adjusted in the 3×3 convolutional layer. The resized global vision attention feature map is segmented using a 1×1 convolutional layer of the semantic segmentation prediction model to output the semantic segmentation prediction result.

7. A target equipment image segmentation system for power operation scenarios, characterized in that, include: The data acquisition unit is used to acquire equipment images of target equipment in power operation scenarios; A training unit, used to train an image semantic segmentation model using the device image and a loss function, includes: The data processing model is used to transform the device image to expand the data size of the device image, generate a dataset, and divide the dataset into a training set and a test set according to a preset ratio. The training set is fed into a multi-scale fast fusion extraction model, and features are extracted from the training set based on a pre-established loss function to extract multi-layer features from the training set. The global vision attention model connects each location on the feature map of multi-layer features to other locations to output a global vision attention feature map. The step of using a global vision attention model to connect each location on the feature map of multi-layer features with other locations to output a global vision attention feature map includes: The feature map of the multi-layer feature is divided into three branches; For the first branch, feature extraction is performed through two consecutive 3×3 convolutional layers of the global vision attention model. The extracted features are then subjected to average pooling to reduce the feature dimensionality and obtain key information features. For the second branch, a 3×3 convolutional layer of the global vision attention model is used for preliminary feature extraction. For the preliminary extracted features, a 1×1 convolutional layer is used for refined feature extraction. The extracted refined features are then subjected to max pooling to obtain the most significant response features. For the third branch, feature compression is performed through a 1×1 convolutional layer of the global vision attention model. For the compressed features, feature extraction is performed through a 3×3 convolutional layer. The extracted features are then subjected to average pooling and max pooling operations to obtain two types of feature representations at different scales. The two types of feature representations at different scales are then concatenated to obtain concatenated features. The key information features are concatenated with the most significant response features, and the concatenated features are normalized to obtain normalized concatenated features. The normalized concatenated features are then summed with the concatenated features to ensure that each position on the feature map of the multi-layer features is connected to other positions, so as to output a global vision attention feature map. The segmentation prediction unit is used to perform semantic segmentation on the target equipment image of the target equipment in the power operation scenario based on the image semantic segmentation model, and generate a semantic segmentation prediction image.

8. The target device image segmentation system according to claim 7, characterized in that, The training unit trains an image semantic segmentation model using the device image and the loss function, and also includes: The semantic segmentation prediction model is used to perform semantic segmentation on the global field of view attention feature map to generate semantic segmentation prediction results for the device image. The data processing model, multi-scale fast fusion extraction model, global vision attention model, and semantic segmentation prediction model are fused together. The device image is used as the input data of the fused model, and the semantic segmentation prediction result is used as the output data of the fused model. The fused model is trained to obtain an image semantic segmentation model, and the accuracy of the image semantic segmentation model is verified using the test set.

9. The target device image segmentation system according to claim 7, characterized in that, The conversion operation includes: Flip, rotate, crop, scale, and translate operations.

10. The target device image segmentation system according to claim 7, characterized in that, The preset ratio is 9:

1.

11. The target device image segmentation system according to claim 7, characterized in that, The multi-scale fast fusion extraction model includes: two convolutional layers and three residual structures. The convolutional layers and residual structures are used to perform convolution operations on the training set, and feature extraction is performed on the training set through the convolution operations. The convolution operation includes: inputting the training set into two convolutional layers for convolution operation, and inputting the output of the two convolutional layers into three residual structures for convolution operation; The kernel size of the convolutional layer is 3×3, and the stride of the convolution operation is 2. The first residual structure includes: two convolutional kernels, each 3×3 in size, with a stride of 2 for the first convolutional operation and a stride of 1 for the second convolutional operation; The second residual structure includes three convolutional kernels, each 3×3 in size. The stride of the first convolutional operation is 1, and the strides of the second and third convolutional operations are 2. The third residual structure includes two convolutional kernels, each 3×3 in size. The stride of the first convolutional operation is 1, and the stride of the second convolutional operation is 2.

12. The target device image segmentation system according to claim 8, characterized in that, The step of performing semantic segmentation on the global field-of-view attention feature map using a semantic segmentation prediction model to generate semantic segmentation prediction results for the device image includes: The global vision attention feature map is input into a 3×3 convolutional layer of the semantic segmentation prediction model, and the size of the global vision attention feature map is adjusted in the 3×3 convolutional layer. The resized global vision attention feature map is segmented using a 1×1 convolutional layer of the semantic segmentation prediction model to output the semantic segmentation prediction result.

13. A computer device, characterized in that, include: One or more processors; A processor is used to execute one or more programs; When the one or more programs are executed by the one or more processors, the method described in any one of claims 1-6 is implemented.

14. A computer-readable storage medium, characterized in that, It contains a computer program, which, when executed, implements the method as described in any one of claims 1-6.