U-net-based image semantic segmentation device and method having self-attention and separable convolutional neural network
The U-Net-based image semantic segmentation device with self-attention and separable convolution addresses the challenge of accurate object classification in complex aerial images by enhancing feature extraction and computational efficiency, resulting in improved accuracy and efficiency.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- DONGGUK UNIVERSITY INDUSTRY ACADEMIC COOPERATION FOUNDATION
- Filing Date
- 2025-12-22
- Publication Date
- 2026-07-02
Smart Images

Figure KR2025022453_02072026_PF_FP_ABST
Abstract
Description
U-NET-based image semantic segmentation device and method with self-attention and separable convolutional neural network
[0001] The present invention relates to image semantic segmentation technology, and more specifically, to a U-Net-based image semantic segmentation device and method having self-attention and a separable convolutional neural network.
[0002] Aerial imagery is utilized in various fields such as urban planning and environmental monitoring; as cities expand and landscapes evolve, accurate land cover classification is crucial for effective urban development and resource management.
[0003] Semantic segmentation, a technology that analyzes and classifies images at the pixel level, provides detailed data for land cover classification; consequently, there is a growing need for sophisticated semantic segmentation devices that accurately classify objects for urban infrastructure, changes in land use patterns, and environmental monitoring.
[0004] The background technology of the present invention is disclosed in Korean Registered Patent No. 10-1624503.
[0005] The present invention provides a U-Net-based image semantic segmentation device and method having self-attention and a separate convolutional neural network.
[0006] The technical problems that the present invention aims to solve are not limited to those mentioned above, and other unmentioned technical problems will be clearly understood by those skilled in the art to which the present invention belongs from the description below.
[0007] According to one aspect of the present invention, a U-Net-based image semantic segmentation device having self-attention and a separable convolutional neural network is provided.
[0008] A U-Net-based image semantic segmentation device having self-attention and a separable convolutional neural network according to one embodiment of the present invention may include an input unit for inputting an image, a semantic segmentation unit for training a model with the image and performing semantic segmentation, and an output unit for outputting the semantic segmentation result.
[0009] According to another aspect of the present invention, a U-Net-based image semantic segmentation method having self-attention and a separable convolutional neural network and a computer program for executing the same are provided.
[0010] A U-Net-based image semantic segmentation method having self-attention and a separable convolutional neural network according to one embodiment of the present invention may include the steps of inputting an image, training a model with the image and performing semantic segmentation, and outputting the semantic segmentation result.
[0011] According to one embodiment of the present invention, a U-Net-based image semantic segmentation device having self-attention and a separable convolutional neural network can perform sophisticated semantic segmentation of complex aerial images by improving accuracy by focusing on important features within the image with self-attention and improving computational efficiency with separable convolution.
[0012] The effects of the present invention are not limited to the effects described above, and should be understood to include all effects that can be inferred from the composition of the invention described in the description or claims of the present invention.
[0013] FIG. 1 is a block diagram of a U-Net-based image semantic segmentation device having self-attention and a separable convolutional neural network according to an embodiment of the present invention.
[0014] FIG. 2 is a drawing describing a U-Net-based image semantic segmentation device having self-attention and a separable convolutional neural network according to an embodiment of the present invention.
[0015] FIG. 3 is a block diagram of a semantic segmentation section according to an embodiment of the present invention.
[0016] FIGS. 4 and FIGS. 5 are drawings relating to a patch image section according to an embodiment of the present invention.
[0017] FIG. 6 is a block diagram of an encoder unit according to an embodiment of the present invention.
[0018] FIG. 7 is a drawing illustrating self-attention according to an embodiment of the present invention.
[0019] FIG. 8 is a structural diagram relating to a separable convolution according to an embodiment of the present invention.
[0020] FIG. 9 is a drawing describing the performance of a U-Net-based image semantic segmentation device having self-attention and a separable convolutional neural network according to an embodiment of the present invention.
[0021] FIG. 10 is a predicted image drawing of a U-Net-based image semantic segmentation device having self-attention and a separable convolutional neural network according to an embodiment of the present invention.
[0022] FIG. 11 is a plot diagram of the IoU and test IoU of a U-Net-based image semantic segmentation device having self-attention and a disjoint convolutional neural network according to an embodiment of the present invention.
[0023] FIG. 12 is a plot diagram of the loss function and test loss function of a U-Net-based image semantic segmentation device having self-attention and a separable convolutional neural network according to an embodiment of the present invention.
[0024] FIG. 13 is a diagram of a U-Net-based image semantic segmentation method having self-attention and a separable convolutional neural network according to an embodiment of the present invention.
[0025] The present invention is susceptible to various modifications and may have various embodiments. Specific embodiments are illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and it should be understood that it includes all modifications, equivalents, and substitutions that fall within the spirit and scope of the invention. In describing the present invention, detailed descriptions of related prior art are omitted if it is determined that such detailed descriptions would unnecessarily obscure the essence of the invention. Furthermore, singular expressions used in this specification and claims should generally be interpreted as meaning "one or more" unless otherwise stated.
[0026] Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In describing with reference to the accompanying drawings, identical or corresponding components are given the same reference numerals, and redundant descriptions thereof will be omitted.
[0027] FIG. 1 is a block diagram of a U-Net-based image semantic segmentation device having self-attention and a separable convolutional neural network according to one embodiment of the present invention, and FIG. 2 is a diagram describing a U-Net-based image semantic segmentation device having self-attention and a separable convolutional neural network according to one embodiment of the present invention.
[0028] Referring to FIGS. 1 and 2, a U-Net-based image semantic segmentation device (10) having self-attention and a separable convolutional neural network includes an input unit (110), a semantic segmentation unit (120), and an output unit (130).
[0029] The input unit (110) inputs an image.
[0030] An image according to one embodiment of the present invention uses an aerial image of Dubai obtained from the MBRSC (Mohammed Bin Rashid Space Centre) satellite, but the input image is not limited thereto. The aerial image of Dubai is grouped into 8 large tiles containing a total of 72 images, and each image is divided into 6 classes of buildings, land, roads, vegetation, water, and unlabeled, with a specific label assigned to each pixel.
[0031] The semantic segmentation unit (120) performs semantic segmentation by training a model with the input image. Detailed information regarding the semantic segmentation unit (120) will be explained in FIGS. 3 to 8.
[0032] The output unit (130) outputs the semantic segmentation result. Based on the output result, the output unit (130) outputs a semantic segmentation map in which a class is assigned to each pixel.
[0033]
[0034] FIG. 3 is a block diagram of a semantic segmentation section according to one embodiment of the present invention.
[0035] Referring to FIG. 3, the semantic segmentation unit (120) includes a patch image unit (121), an encoder unit (122), and a decoder unit (123).
[0036] The semantic segmentation unit (120) performs semantic segmentation by training a model with the input image.
[0037] In this case, the optimizer is Adam, and the composite loss functions are Dice loss and Focal loss.
[0038] The composite loss function according to one embodiment of the present invention is as shown in Equation 1.
[0039] Total Loss = 0.5 × Dice Loss + 0.5 × Focal Loss (Equation 1)
[0040] According to one embodiment of the present invention, Focal Loss solves the class imbalance problem, and Dice Loss learns accurate object boundaries. Focal Loss is a loss function designed to enhance learning for classes with few pixels (rare classes) by assigning higher weights to incorrectly predicted pixels, thereby preventing the model from focusing only on easily predictable classes. Dice Loss is a loss function that helps learn accurate boundaries by emphasizing the overlapping area between the pixel distribution predicted by the model and the actual correct labels, thereby improving pixel-level accuracy by learning the accurate boundaries and locations of objects. The loss function according to one embodiment of the present invention equally reflects the advantages of both loss functions.
[0041] The patch image section (121) divides the input image to generate and augment a patch image.
[0042] The encoder unit (122) inputs a patch image to train the model.
[0043] A model according to one embodiment of the present invention is a U-Net-based semantic segmentation model having self-attention and a separable convolutional neural network. The U-Net-based image semantic segmentation model having self-attention and a separable convolutional neural network improves accuracy by focusing on important features within the image using self-attention and improves computational efficiency using separable convolution to perform sophisticated semantic segmentation of complex aerial images.
[0044] The decoder unit (123) restores spatial details of the encoder's output value through upsampling using transpose convolution and skip connection. The decoder unit (123) performs a 1x1 convolution operation using a softmax activation function in the last layer to predict the semantic segmentation result of which class each pixel in the image belongs to.
[0045]
[0046] FIGS. 4 and FIGS. 5 are drawings relating to a patch image section according to an embodiment of the present invention.
[0047] Referring to FIGS. 4 and 5, the patch image unit (121) divides the input image to generate and augment patch images. The patch image unit generates patch images through a patch-based augmentation technique using the Patchify library and normalization. When generating patch images, the patch image unit (121) forms a 1x1 grid so that the patch images do not overlap. At this time, the patch images are composed of three channel colors (R, G, B) with a size of 256 x 256. The patch image unit (121) augments the generated patch images to create 945 of them and divides them into 803 training images and 142 test images.
[0048]
[0049] FIG. 6 is a block diagram of an encoder unit according to an embodiment of the present invention, FIG. 7 is a diagram explaining self-attention according to an embodiment of the present invention, and FIG. 8 is a structural diagram of a separable convolution according to an embodiment of the present invention.
[0050] Referring to FIGS. 6 to 8, the encoder unit (122) includes a convolution unit (124), a self-attention unit (125), and a max pooling layer unit (126).
[0051] The encoder unit (122) inputs a patch image to train the model.
[0052] The convolution unit (124) takes a patch image as input and passes the feature map, which has passed through a convolution layer, a dropout layer, and a separable convolution, to the self-attention unit. Here, the separable convolution proceeds in two stages: depth-wise convolution and point-wise convolution. Depth-wise convolution examines how objects are arranged in the data and focuses on one channel at a time, but does not consider how different channels are related. Depth-wise convolution learns features by performing convolution operations individually on each channel of the input image, while point-wise convolution examines all channels together at each point and combines the details of each channel to learn the overall feature that combines the information of each channel obtained from depth-wise convolution. In other words, Separable Convolution extracts features within each channel through depth-wise convolution and learns overall features by combining information between channels through point-wise convolution. Although Separable Convolution requires fewer computations and numbers to remember, it is faster than standard convolution and can efficiently uncover important details in data, even when dealing with large amounts of data or when computing power is limited.Separable Convolution can prevent the model from becoming over-specialized in the training data, and by proceeding in two stages, it can reduce the computational burden while increasing efficiency, allowing it to perform tasks more efficiently than standard convolution in situations with limited resources.
[0053] The self-attention unit (125) transmits the input feature map to the self-attention block to add relationship information between pixels, and transmits the output feature map to the max pooling layer unit (126). The self-attention unit (125) learns the importance between pixels by connecting all pixels to each other. The self-attention unit (125) can focus on the entire image and local features together by considering the association with other pixels in the same image by thinking of the image as a series of pixels. The self-attention unit can fuse temporal information and spatial information in semantic segmentation, so it can be useful for tracking landscapes where the image changes over time.
[0054] The max pooling layer section (126) performs depthwise concatenation on the input feature map and then transmits the feature map that has passed through the max pooling layer to the decoder section (123).
[0055] The decoder unit (123) restores spatial details of the input feature map through upsampling using transpose convolution and skip connection. The decoder unit (123) performs a 1x1 convolution operation using a softmax activation function in the last layer to predict the semantic segmentation result of which class each pixel in the image belongs to.
[0056] The output unit (130) outputs the semantic segmentation result. Based on the output result, the output unit (130) outputs a semantic segmentation map in which a class is assigned to each pixel.
[0057]
[0058] FIG. 9 is a diagram illustrating the performance of a U-Net-based image semantic segmentation device having self-attention and a separable convolutional neural network according to one embodiment of the present invention, and FIG. 10 is a diagram of a predicted image of a U-Net-based image semantic segmentation device having self-attention and a separable convolutional neural network according to one embodiment of the present invention.
[0059] Referring to FIGS. 9 and 10, a U-Net-based image semantic segmentation device (10) with self-attention and a separable convolutional neural network according to one embodiment of the present invention showed results that outperformed FCN, UNet, and Dense+U-Net in terms of accuracy, loss function, and IoU performance indicators. In particular, the U-Net-based image semantic segmentation device (10) with self-attention and a separable convolutional neural network achieved the highest scores in accuracy (91.78%) and IoU (81.82%). It can be seen that the U-Net-based image semantic segmentation device (10) with self-attention and a separable convolutional neural network according to one embodiment of the present invention showed excellent feature extraction and the capture of long-range dependencies by focusing on more relevant image regions through the self-attention mechanism, and effectively learned the spatial hierarchy of the data by reducing complexity while maintaining efficiency through separable convolution.
[0060] FIG. 11 is a plot of the IoU and test IoU of a U-Net-based image semantic segmentation device having self-attention and a separate convolutional neural network according to one embodiment of the present invention.
[0061] Referring to FIG. 11, it can be seen that both the IoU of the training data and the IoU of the test data have continuously improved, indicating that effective training and generalization of the model have been achieved. IoU is an indicator that measures the degree of overlap between a prediction mask and an actual mask, and it indicates that a U-Net-based image semantic segmentation device (10) having self-attention and a separable convolutional neural network according to one embodiment of the present invention can accurately identify and distinguish objects in invisible data.
[0062]
[0063] FIG. 12 is a plot of the loss function and test loss function of a U-Net-based image semantic segmentation device having self-attention and a separate convolutional neural network according to one embodiment of the present invention.
[0064] Referring to FIG. 12, it can be seen that both the loss function of the training data and the loss function of the test data have been reduced, indicating that effective training and generalization of the model have been achieved. The loss function is an indicator representing the error between the predicted segmentation mask and the actual segmentation mask, and it can be seen that the U-Net-based image semantic segmentation device (10) having self-attention and a separable convolutional neural network according to one embodiment of the present invention is operating well in both training data and test data.
[0065] A U-Net-based image semantic segmentation device (10) having self-attention and a separable convolutional neural network according to one embodiment of the present invention integrates a separable convolution and a self-attention mechanism and demonstrates multi-class sophisticated semantic segmentation performance in terms of prediction accuracy and efficiency with a synthetic loss function combining Dice Loss and Focal Loss.
[0066]
[0067] FIG. 13 is a diagram of a U-Net-based image semantic segmentation method having self-attention and a separable convolutional neural network according to an embodiment of the present invention. Each process described below is a process performed by each functional part constituting the U-Net-based image semantic segmentation device (10) having self-attention and a separable convolutional neural network in each step, but for a concise and clear explanation of the present invention, the subject of each step is collectively referred to as the U-Net-based image semantic segmentation device (10) having self-attention and a separable convolutional neural network.
[0068] Referring to FIG. 13, in step S1310, a U-Net-based image semantic segmentation device (10) with self-attention and a separate convolutional neural network inputs an image.
[0069] An image according to one embodiment of the present invention uses an aerial image of Dubai obtained from the MBRSC (Mohammed Bin Rashid Space Centre) satellite, but the input image is not limited thereto. The aerial image of Dubai is grouped into 8 large tiles containing a total of 72 images, and each image is divided into 6 classes of buildings, land, roads, vegetation, water, and unlabeled, with a specific label assigned to each pixel.
[0070] In step S1320, the U-Net-based image semantic segmentation device (10) with self-attention and a separable convolutional neural network divides the input image to generate and augment patch images. The U-Net-based image semantic segmentation device (10) with self-attention and a separable convolutional neural network generates patch images through a patch-based augmentation technique using the Patchify library and normalization. When generating patch images, the U-Net-based image semantic segmentation device (10) forms a 1x1 grid so that the patch images do not overlap. At this time, the patch images are composed of three channel colors (R, G, B) with a size of 256 x 256. A U-Net-based image semantic segmentation device (10) with self-attention and a separable convolutional neural network augments the generated patch images to create 945, and divides them into 803 training images and 142 test images.
[0071] In step S1330, a U-Net-based image semantic segmentation device (10) with self-attention and a separable convolutional neural network inputs a patch image and passes the feature map, which has passed through a convolutional layer, a dropout layer, and a separable convolution, to a self-attention block. Here, the separable convolution proceeds in two stages: depth-wise convolution and point-wise convolution. Depth-wise convolution examines how objects are arranged in the data and focuses on one channel at a time, but does not consider how different channels are related. Depth-wise convolution learns features by performing convolution operations on each channel of the input image individually, whereas point-wise convolution examines all channels together at each point and combines the detailed information of each channel to learn the overall features derived from the channel information obtained through depth-wise convolution. In other words, Separable Convolution extracts features within each channel through depth-wise convolution and learns overall features by combining information between channels through point-wise convolution. Separable Convolution requires fewer computations and numbers to remember, yet is faster than standard convolution, allowing it to efficiently uncover important details in data even when handling large amounts of data or when computing power is limited.Separable Convolution can prevent the model from becoming over-specialized in the training data, and by proceeding in two stages, it can reduce the computational burden while increasing efficiency, allowing it to perform tasks more efficiently than standard convolution in situations with limited resources.
[0072] A model according to one embodiment of the present invention is a U-Net-based semantic segmentation model having self-attention and a separable convolutional neural network.
[0073] According to one embodiment of the present invention, the optimizer uses Adam, and the composite loss function uses Dice loss and Focal loss. The composite loss function according to one embodiment of the present invention is as shown in Equation 1.
[0074] Total Loss = 0.5 × Dice Loss + 0.5 × Focal Loss (Equation 1)
[0075] According to one embodiment of the present invention, Focal Loss solves the class imbalance problem, and Dice Loss learns accurate object boundaries. Focal Loss is a loss function designed to enhance learning for classes with few pixels (rare classes) by assigning higher weights to incorrectly predicted pixels, thereby preventing the model from focusing only on easily predictable classes. Dice Loss is a loss function that helps learn accurate boundaries by emphasizing the overlapping area between the pixel distribution predicted by the model and the actual correct labels, thereby improving pixel-level accuracy by learning the accurate boundaries and locations of objects. The loss function according to one embodiment of the present invention equally reflects the advantages of both loss functions.
[0076] In step S1340, the U-Net-based image semantic segmentation device (10) with self-attention and a separate convolutional neural network transmits the input feature map to the self-attention block to add relationship information between pixels and transmits the output feature map to the max pooling layer. The U-Net-based image semantic segmentation device (10) with self-attention and a separate convolutional neural network can focus on the entire image and local features together by considering the association with other pixels in the same image by thinking of the image as a series of pixels. The U-Net-based image semantic segmentation device (10) with self-attention and a separate convolutional neural network can fuse temporal information and spatial information in semantic segmentation, so it can be useful for tracking landscapes where the image changes over time.
[0077] In step S1350, a U-Net-based image semantic segmentation device (10) with self-attention and a separable convolutional neural network performs depthwise concatenation on the input feature map and then passes the feature map through a max pooling layer to a decoder.
[0078] In step S1360, the U-Net-based image semantic segmentation device (10) with self-attention and a separate convolutional neural network restores spatial details of the input feature map through upsampling using a transpose convolution and skip connections. The U-Net-based image semantic segmentation device (10) with self-attention and a separate convolutional neural network performs a 1x1 synthesis operation using a softmax activation function in the last layer to predict the semantic segmentation result of which class each pixel in the image belongs to.
[0079] In step S1370, the U-Net-based image semantic segmentation device (10) with self-attention and a separable convolutional neural network outputs a semantic segmentation result. The U-Net-based image semantic segmentation device (10) with self-attention and a separable convolutional neural network outputs a semantic segmentation map in which a class is assigned to each pixel based on the output result.
[0080] The U-Net-based image semantic segmentation method with the self-attention and separate convolutional neural network described above can be implemented as computer-readable code on a computer-readable medium. The computer-readable recording medium may be, for example, a removable recording medium (CD, DVD, Blu-ray disc, USB storage device, removable hard disk) or a fixed recording medium (ROM, RAM, computer-equipped hard disk). The computer program recorded on the computer-readable recording medium can be transmitted to another computing device via a network such as the Internet and installed on the other computing device, thereby allowing it to be used on the other computing device.
[0081] Although it has been described above that all components constituting an embodiment of the present invention are combined or operate as a single unit, the present invention is not necessarily limited to such an embodiment. That is, within the scope of the purpose of the present invention, all components may be selectively combined in one or more ways to operate.
[0082] Although operations are depicted in a specific order in the drawings, it should not be understood that the operations must be executed in the specific order depicted or in a sequential order, or that all depicted operations must be executed to obtain the desired result. In certain situations, multitasking and parallel processing may be advantageous. Furthermore, the separation of various configurations in the embodiments described above should not be understood as a necessary separation, and it should be understood that the described program components and systems can generally be integrated together into a single software product or packaged into multiple software products.
[0083] The present invention has been described above with reference to its embodiments. Those skilled in the art will understand that the present invention may be implemented in modified forms without departing from the essential characteristics of the invention. Therefore, the disclosed embodiments should be considered in an illustrative rather than a restrictive sense. The scope of the invention is defined by the claims, not by the foregoing description, and all variations within the scope of the claims should be interpreted as being included in the invention.
[0084] The modes for carrying out the invention are described together in the best mode for carrying out the invention above.
[0085]
[0086] The present invention relates to image semantic segmentation technology, and more specifically, to a U-Net-based image semantic segmentation device and method having self-attention and a separable convolutional neural network, and is industrially applicable as it can be used in various ways.
Claims
1. A U-Net-based image semantic segmentation device having self-attention and a separable convolutional neural network, Input section for inputting an image; A semantic segmentation unit that trains a model with the above image and performs semantic segmentation; and Includes an output unit that outputs the above semantic segmentation result U-Net-based image semantic segmentation device with self-attention and separable convolutional neural network 2. In Paragraph 1 The above semantic segmentation part A patch image unit that divides the above image to generate a patch image; An encoder unit that trains a model by inputting the above patch image; and Includes a decoder unit that upsamples the learned results and performs semantic segmentation. U-Net-based image semantic segmentation device with self-attention and separable convolutional neural network 3. In Paragraph 2 The above encoder part A convolutional unit that takes the above patch image as input and outputs a result after passing through a convolutional layer, a dropout layer, and a separated convolution; A self-attention unit that learns the importance between pixels by transmitting the above results to self-attention; and A max pooling layer unit that extracts a feature map by passing the output result of the self-attention unit through a max pooling layer. U-Net-based image semantic segmentation device with self-attention and separable convolutional neural network 4. In Paragraph 2 The above decoder part Upsampling the above feature map using transpose convolution is performed. U-Net-based image semantic segmentation device with self-attention and separable convolutional neural network 5. In Paragraph 2 The above semantic segmentation part Learning the above model based on a synthetic loss function U-Net-based image semantic segmentation device with self-attention and separable convolutional neural network 6. In a U-Net-based image semantic segmentation method having self-attention and a separable convolutional neural network performed by a U-Net-based image semantic segmentation device having self-attention and a separable convolutional neural network Step to input an image; Steps for training a model with the above image and performing semantic segmentation; and A step comprising outputting the above semantic segmentation result U-Net-based image semantic segmentation method with self-attention and separable convolutional neural network 7. In Paragraph 6 The step of training a model with the above image and performing semantic segmentation is A step of dividing the above image to generate a patch image; A step of training a model by inputting the above patch image; and Includes a step of upsampling the learned results and performing semantic segmentation. U-Net-based image semantic segmentation method with self-attention and separable convolutional neural network 8. In Paragraph 7 The step of training a model by inputting the above patch image is A step of inputting the above patch image and outputting a result after passing through a convolutional layer, a dropout layer, and a disassembled convolution; A step of learning the importance between pixels by passing the above results to self-attention; and A step comprising passing the above result to self-attention to learn the importance between pixels, and extracting a feature map by passing the output result of the step through a max pooling layer. U-Net-based image semantic segmentation method with self-attention and separable convolutional neural network 9. In Paragraph 7 The step of upsampling the above-mentioned learned results and performing semantic segmentation is Upsampling the above feature map using transpose convolution is performed. U-Net-based image semantic segmentation method with self-attention and separable convolutional neural network 10. In Paragraph 7 The step of training a model with the above image and performing semantic segmentation is Learning the above model based on a synthetic loss function U-Net-based image semantic segmentation method with self-attention and separable convolutional neural network 11. In any one of paragraphs 6 through 10 A computer program recorded on a computer-readable recording medium that executes a U-Net-based image semantic segmentation method with self-attention and a separable convolutional neural network