An image semantic segmentation method, device, equipment, medium and program product
By employing multi-stage feature extraction and decoding in encoders and decoders, and combining convolutional neural networks and spatial hybrid networks, the problem of accuracy and detail recovery in complex scenes in existing semantic segmentation methods is solved, achieving more efficient image semantic segmentation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TENCENT TECHNOLOGY (SHENZHEN) CO LTD
- Filing Date
- 2024-12-23
- Publication Date
- 2026-06-23
AI Technical Summary
Existing semantic segmentation methods face challenges such as insufficient accuracy, long-distance dependency issues, and difficulty in detail recovery when dealing with complex scenes and large-scale transformations.
Feature extraction is performed using multiple encoding layers of the encoder, combined with convolutional neural networks and spatial hybrid networks, and feature decoding is performed through the decoder, which enhances the model's ability to capture long-range dependencies and local features.
It significantly improves the accuracy and efficiency of image semantic segmentation, and realizes the collaborative modeling of global semantics and detailed information of images.
Smart Images

Figure CN122265641A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence, and in particular to an image semantic segmentation method, apparatus, device, medium, and program product. Background Technology
[0002] Artificial intelligence (AI) utilizes digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceiving the environment, acquiring knowledge, and using that knowledge to achieve optimal results. In other words, AI is a branch of computer science that attempts to understand the essence of intelligence and produce new intelligent machines that can react in a way similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to possess perception, reasoning, and decision-making capabilities.
[0003] Semantic segmentation is an important research direction in computer vision. Its goal is to assign semantic labels to each pixel in an image, going beyond object detection and instance segmentation to provide refined scene understanding. Semantic segmentation has a wide range of applications, including autonomous driving, medical image analysis, remote sensing image processing, and virtual reality. In this task, the model needs to accurately identify and segment different categories of regions, which is crucial for applications such as object detection and scene understanding. Most existing semantic segmentation methods are based on convolutional neural networks, but they still face challenges such as insufficient accuracy, long-range dependencies, and difficulty in detail recovery when dealing with complex scenes and large-scale transformations.
[0004] Therefore, how to achieve more efficient image semantic segmentation is a problem that urgently needs to be solved. Summary of the Invention
[0005] This application provides an image semantic segmentation method, apparatus, device, medium, and program product to enhance the model's ability to capture long-distance dependencies and improve contextual understanding, thereby improving the accuracy and efficiency of image semantic segmentation.
[0006] In view of this, this application provides an image semantic segmentation method, comprising: acquiring an image to be recognized; sequentially extracting features from the image to be recognized using multiple coding layers based on an encoder to obtain multiple feature maps, wherein the multiple feature maps are feature maps of different sizes output by multiple coding layers in the encoder, the multiple coding layers include coding layers based on convolutional neural networks and coding layers based on convolutional neural networks and spatial hybrid networks; sequentially decoding the multiple feature maps using multiple decoding layers based on a decoder to obtain a semantic segmentation result of the image to be recognized, wherein the multiple decoding layers correspond one-to-one with the multiple coding layers, and the input of the first decoding layer of the multiple decoding layers is the output of the last coding layer of the coding layer, and the inputs of the other decoding layers of the multiple decoding layers include the output of the previous decoding layer and the output of the coding layer corresponding to the decoding layer.
[0007] Another aspect of this application provides an image semantic segmentation apparatus, including: an acquisition module for acquiring an image to be recognized;
[0008] The processing module is used to sequentially extract features from the image to be recognized based on multiple coding layers of the encoder to obtain multiple feature maps. These multiple feature maps are feature maps of different sizes output by multiple coding layers in the encoder. The multiple coding layers include coding layers based on convolutional neural networks and coding layers based on convolutional neural networks and spatial hybrid networks. Multiple decoding layers based on the decoder sequentially decode the multiple feature maps to obtain the semantic segmentation result of the image to be recognized. The multiple decoding layers correspond one-to-one with the multiple coding layers, and the input of the first decoding layer of the multiple decoding layers is the output of the last coding layer of the coding layer. The inputs of the other decoding layers of the multiple decoding layers include the output of the previous decoding layer and the output of the coding layer corresponding to the decoding layer.
[0009] In one possible design, in another implementation of another aspect of the embodiments of this application, the encoding layer based on the convolutional neural network is generated by stacking multiple first-type convolutional modules, which include convolutional layers and depthwise convolutional layers.
[0010] The encoding layer, which is based on convolutional neural networks and spatial hybrid networks, is generated by stacking one first-type convolutional module and multiple second-type convolutional modules. The second-type convolutional module includes convolutional layers, spatial hybrid networks, and deep convolutional layers.
[0011] In one possible design, in another implementation of another aspect of the embodiments of this application, the encoder includes four coding layers, wherein the first coding layer and the second coding layer are coding layers based on convolutional neural networks, and the third coding layer and the fourth coding layer are coding layers based on convolutional neural networks and spatial hybrid networks; the processing module is used to extract features from the image to be recognized based on the first coding layer to obtain a first feature map;
[0012] Based on the second coding layer, features are extracted from the first feature map to obtain the second feature map;
[0013] Based on this third coding layer, features are extracted from the second feature map to obtain the third feature map;
[0014] Based on this fourth coding layer, features are extracted from the third feature map to obtain the fourth feature map;
[0015] The first feature map, the second feature map, the third feature map, and the fourth feature map are the multiple feature maps.
[0016] In one possible design, in another implementation of another aspect of the embodiments of this application, the processing module is used to extract features from the image to be identified using the first type of convolutional module in the first coding layer to obtain a first intermediate feature map.
[0017] The first intermediate feature map is extracted using the second type-1 convolutional module in the first coding layer to obtain the second intermediate feature map;
[0018] The first intermediate feature map and the second intermediate feature map are added together to obtain a third intermediate feature map, which is used as the output of the second first type convolutional module in the first coding layer.
[0019] The third intermediate feature map is extracted using the third type-1 convolutional module in the first coding layer to obtain the fourth intermediate feature map;
[0020] The third intermediate feature map and the fourth intermediate feature map are added together to obtain the fifth intermediate feature map, which is used as the output of the third first type convolutional module in the first coding layer.
[0021] The feature extraction operations of each convolutional module in the first coding layer are performed accordingly to obtain the first feature map.
[0022] In one possible design, in another implementation of another aspect of the embodiments of this application, the processing module is used to perform layer normalization processing on the image to be identified using the first convolutional layer in the first type of convolutional module to obtain a first intermediate feature sequence.
[0023] The intermediate feature sequence is locally aggregated using the deep convolutional layer in the first type of convolutional module to obtain the second intermediate feature sequence;
[0024] The first intermediate feature map is obtained by using the second intermediate feature sequence of the second convolutional layer pair in the first type of convolutional module for feature mapping.
[0025] In one possible design, in another implementation of another aspect of the embodiments of this application, the processing module is used to extract features from the second feature map using the first type of convolutional module in the third coding layer to obtain a sixth intermediate feature map.
[0026] The first type II convolutional module in the third coding layer is used to extract features from the sixth intermediate feature map to obtain the seventh intermediate feature map;
[0027] The seventh intermediate feature map and the sixth intermediate feature map are added together to obtain the eighth intermediate feature map, which is used as the output of the first type II convolutional module in the third coding layer.
[0028] The second type II convolutional module in the third coding layer is used to extract features from the eighth intermediate feature map to obtain the ninth intermediate feature map;
[0029] The eighth intermediate feature map and the ninth intermediate feature map are added together to obtain the tenth intermediate feature map, which is used as the output of the second type II convolutional module in the third coding layer.
[0030] The feature extraction operations of each convolutional module in the third coding layer are performed accordingly to obtain the third feature map.
[0031] In one possible design, in another implementation of another aspect of the embodiments of this application, the processing module is used to perform layer normalization processing on the image to be identified using the first convolutional layer in the first second type of convolutional module to obtain a third intermediate feature sequence.
[0032] The spatial mixing network in the first type II convolutional module is used to spatially mix the third intermediate feature sequence to obtain the fourth intermediate feature sequence;
[0033] The deep convolutional layer in the first type II convolutional module is used to perform local feature aggregation on the fourth intermediate feature sequence to obtain the fifth intermediate feature sequence;
[0034] The fifth intermediate feature sequence of the second convolutional layer pair in the first type II convolutional module is used for feature mapping to obtain the seventh intermediate feature map.
[0035] In one possible design, in another implementation of another aspect of the embodiments of this application, the processing module is used to perform feature map alignment on a first feature map set of the plurality of feature maps to obtain a first intermediate feature map set, the first feature map set being all feature maps except for the feature map output by the last coding layer in the encoder;
[0036] The feature maps in the first intermediate feature map set are fused to obtain a fused feature map;
[0037] The fused feature map is then subjected to feature separation to obtain a second intermediate feature map set, wherein the number of feature maps in the second intermediate feature map set is the same as the number of feature maps in the first feature map set;
[0038] Each feature map in the second intermediate feature map set is restored in terms of size and channel dimension to obtain the second feature map set, wherein each feature map in the second feature map set is the output of the coding layer in the encoder except for the last coding layer, and each feature map in the second feature map set has the same size and number of channels as each feature map in the first feature map set.
[0039] In one possible design, in another implementation of another aspect of the embodiments of this application, the processing module is used to upsample and convolve each feature map in the first feature map set according to the size of the target feature map to obtain the first intermediate feature map set, wherein the target feature map is the feature map with the largest size in the first feature map set.
[0040] In one possible design, in another implementation of another aspect of the embodiments of this application, the processing module is used to splice the intermediate feature maps in the first intermediate feature map set according to the channel dimension to generate a joint feature map.
[0041] The joint feature map is segmented according to the target size to obtain a set of separate feature maps;
[0042] Multi-size global feature fusion is performed on each feature map in the set of separated feature maps to obtain the fused feature map.
[0043] In one possible design, in another implementation of another aspect of the embodiments of this application, the acquisition module is used to acquire an initial encoder, an initial decoder and training samples, the training samples including training image data and real labels, the initial encoder having the same network structure as the encoder, and the initial decoder having the same network structure as the decoder.
[0044] The processing module is used to perform semantic segmentation on the training sample using the initial semantic segmentation model to obtain the predicted semantic segmentation result of the training sample; calculate the loss value based on the real label and the predicted semantic segmentation result, the loss value being the sum of the cross-entropy loss value and the overlap loss function; and train the initial encoder and the initial decoder based on the loss value to obtain the encoder and the decoder.
[0045] In one possible design, in another implementation of another aspect of the embodiments of this application, the first feature map, the second feature map, the third feature map and the fourth feature map each have a corresponding weight value.
[0046] This application also provides a computer device, including: a memory, a processor, and a bus system;
[0047] The memory is used to store programs;
[0048] The processor is used to execute programs in memory, and the processor is used to execute the methods mentioned above according to the instructions in the program code;
[0049] Bus systems are used to connect memory and processor to enable communication between them.
[0050] Another aspect of this application provides a computer-readable storage medium storing instructions that, when executed on a computer, cause the computer to perform the methods described above.
[0051] Another aspect of this application provides a computer program product or computer program including computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the methods provided in the above aspects.
[0052] As can be seen from the above technical solutions, the embodiments of this application have the following advantages: the encoder uses a multi-stage coding layer for feature extraction, and the encoder is constructed by a coding layer based on a convolutional neural network and a coding layer combining a convolutional neural network and a spatial hybrid network. In this way, the encoder can capture the long-distance dependency information of the image to be recognized through the spatial hybrid network, and at the same time, it can learn the fine expression of the local features of the image to be recognized based on the convolutional neural network. Therefore, it can realize the collaborative modeling of the global semantics and detailed information of the image to be recognized, thereby significantly improving the accuracy of image semantic segmentation. Attached Figure Description
[0053] Figure 1 This is a schematic diagram of the architecture of an application scenario for image semantic segmentation in this application embodiment;
[0054] Figure 2 This is a schematic diagram of one embodiment of the image semantic segmentation method in this application;
[0055] Figure 3 This is a schematic diagram of the architecture of an image semantic segmentation model in an embodiment of this application;
[0056] Figure 4 This is a schematic diagram of the architecture of a shallow convolution module in an embodiment of this application;
[0057] Figure 5 This is another architectural diagram of the shallow convolution module in the embodiments of this application;
[0058] Figure 6 This is a schematic diagram of the architecture of a deep convolutional module in an embodiment of this application;
[0059] Figure 7 This is a schematic diagram of the architecture of the decoding layer in the decoder in an embodiment of this application;
[0060] Figure 8 This is a schematic diagram of another architecture of the image semantic segmentation model in the embodiments of this application;
[0061] Figure 9 This is a schematic diagram of one embodiment of the image semantic segmentation device in this application;
[0062] Figure 10 This is a schematic diagram of one embodiment of the server in this application;
[0063] Figure 11 This is a schematic diagram of one embodiment of the terminal in this application. Detailed Implementation
[0064] This application provides an image semantic segmentation method, apparatus, device, medium, and program product to enhance the model's ability to capture long-distance dependencies and improve contextual understanding, thereby improving the accuracy and efficiency of image semantic segmentation.
[0065] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a particular order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented, for example, in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “corresponding to,” and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0066] In this application embodiment, the terms "module" or "unit" refer to a computer program or part of a computer program that has a predetermined function and works with other related parts to achieve a predetermined goal, and can be implemented wholly or partially using software, hardware (such as processing circuitry or memory), or a combination thereof. Similarly, a processor (or multiple processors or memory) can be used to implement one or more modules or units. Furthermore, each module or unit can be part of an overall module or unit that includes the functionality of that module or unit.
[0067] Artificial intelligence (AI) utilizes digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceiving the environment, acquiring knowledge, and using that knowledge to achieve optimal results. In other words, AI is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a way similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to have perception, reasoning, and decision-making capabilities. Semantic segmentation is an important research direction in computer vision. Its purpose is to assign semantic labels to each pixel in an image, going beyond object detection and instance segmentation to provide refined scene understanding. Semantic segmentation has a wide range of applications, including autonomous driving, medical image analysis, remote sensing image processing, and virtual reality. In this task, the model needs to accurately identify and segment different categories of regions, which is crucial for applications such as object detection and scene understanding. Most existing semantic segmentation methods are based on convolutional neural networks, but they still face challenges such as insufficient accuracy, long-range dependency problems, and difficulty in detail recovery when dealing with complex scenes and large-scale transformations. Therefore, how to achieve more efficient image semantic segmentation is a problem that urgently needs to be solved.
[0068] To address this technical problem, this application provides the following technical solution: acquiring an image to be recognized; sequentially extracting features from the image to be recognized using multiple coding layers based on an encoder to obtain multiple feature maps, wherein the multiple feature maps are feature maps of different sizes output by multiple coding layers in the encoder, the multiple coding layers include coding layers based on convolutional neural networks and coding layers based on convolutional neural networks and spatial hybrid networks; sequentially decoding the multiple feature maps using multiple decoding layers based on a decoder to obtain the semantic segmentation result of the image to be recognized, wherein the multiple decoding layers correspond one-to-one with the multiple coding layers, and the input of the first decoding layer of the multiple decoding layers is the output of the last coding layer of the coding layer, and the inputs of the other decoding layers of the multiple decoding layers include the output of the previous decoding layer and the output of the coding layer corresponding to the decoding layer. This encoder employs multi-stage coding layers for feature extraction. It is constructed from coding layers based on convolutional neural networks and coding layers combining convolutional neural networks and spatial hybrid networks. This allows the encoder to capture long-range dependency information of the image to be recognized through the spatial hybrid network, while also learning the fine representation of local features of the image to be recognized based on the convolutional neural network. Therefore, it can achieve collaborative modeling of global semantics and detailed information of the image to be recognized, thereby significantly improving the accuracy of image semantic segmentation.
[0069] For ease of understanding, some of the technical terms used in this application are explained below.
[0070] (1) Global Semantics: Also known as Global Features. It refers to the overall attributes of an image. Common global semantic features include color features, texture features, line features, and shape features, such as intensity histograms. As low-level visual features at the pixel level, global semantic features have good invariance, are simple to compute, and are intuitive to represent. However, their weakness is that they have high feature dimensionality and large computational cost.
[0071] (2) Local semantics: also known as local features. It refers to features extracted from local regions of an image, including edges, corners, lines, curves, and regions with special attributes. Compared with global semantics, local semantics has the advantages of richer content in the image, lower correlation between features, and no impact on the detection and matching of other features due to the disappearance of some features in the case of occlusion.
[0072] The image semantic segmentation methods in the various optional embodiments of this application can be implemented based on artificial intelligence (AI) technology. AI technology is a comprehensive discipline involving a wide range of fields, encompassing both hardware and software technologies. Fundamental AI technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, pre-trained model technology, operating / interactive systems, and mechatronics. Among these, pre-trained models, also known as large models or foundational models, can be widely applied to downstream tasks across various AI fields after fine-tuning. AI software technologies mainly include computer vision, speech processing, natural language processing, and machine learning / deep learning.
[0073] This application also relates to cloud technology. Cloud technology refers to a managed technology that unifies hardware, software, network, and other system resources within a wide area network (WAN) or local area network (LAN) to achieve data computation, storage, processing, and sharing.
[0074] Cloud technology is a general term encompassing network technology, information technology, integration technology, management platform technology, and application technology applied to the cloud computing business model. It can form resource pools, providing flexible and convenient on-demand access. Cloud computing technology will become a crucial support. Backend services of technical network systems require substantial computing and storage resources, such as video websites, image websites, and many portal websites. With the rapid development and application of internet behavior, every item may possess its own identification mark in the future, requiring transmission to backend systems for logical processing. Data at different levels will be processed separately, and various industry data will require robust system support, which can only be achieved through cloud computing. The cloud technology involved in this application mainly refers to the transmission of images to be identified between terminal devices or servers via the "cloud," etc.
[0075] The technical solutions of this application and their effects are described below through several exemplary embodiments. It should be noted that the following embodiments can be referenced, borrowed from, or combined with each other. Identical terms, similar features, and similar implementation steps in different embodiments will not be repeated.
[0076] The following examples illustrate several typical application scenarios for the image semantic segmentation method in this application. It should be understood that these examples do not constitute a limitation on the scope of application of the method in this application.
[0077] (1) Autonomous driving. During the autonomous driving process of the vehicle, the camera on the vehicle will take pictures of the road environment. At this time, the image semantic segmentation method in this application can be used to perform semantic segmentation on the collected road or environment image data, thereby identifying traffic signs, vehicles, pedestrians or other obstacles in the corresponding road environment, thereby providing decision-making for vehicle driving.
[0078] (2) Medical Assistance. Due to the nature of the work and career development of professionals such as doctors, the number of expert-level doctors is relatively small. In the medical field, the image semantic segmentation method described in this application can be used to identify organs and tissues in medical images, providing corresponding assistance and reference for doctors' diagnoses.
[0079] (3) Data annotation. Currently, annotating training data in the field of Artificial Intelligence (AI) is a very time-consuming and costly task. In AI, the image semantic segmentation method in this application can be used to identify the object categories in each image in the training set, thereby annotating the image data in the training set and significantly improving the efficiency of data annotation.
[0080] (4) Industrial Quality Inspection. Industrial production lines can produce a large number of products, but product defect detection is usually required before the products leave the factory. At this time, the image semantic segmentation method in this application can be used to perform semantic segmentation on the image data of each product, thereby identifying and classifying defects on the product surface, such as scratches, dents, stains, and discoloration. Through accurate segmentation and recognition of product images, the accuracy and efficiency of product quality control can be effectively improved, and human judgment errors and missed detection problems can be reduced.
[0081] (5) Virtual Reality. In the process of simulating real-world scenes in virtual reality, targeted recognition based on the input scene image is usually required. At this time, the image semantic segmentation method in this application can be used to segment the scene image into different semantic regions, such as walls, floors, and furniture. These segmented semantic regions can provide a basis for scene generation in virtual reality, making the virtual reality scene more realistic. It should be understood that semantic segmentation processing can also be performed on the input human image, which can help understand the content in the image and help construct a more realistic virtual human character.
[0082] (6) Remote Sensing Image Processing. Remote sensing images are generated based on electromagnetic waves reflected or emitted from the Earth's surface, acquired by sensors mounted on satellites, aircraft, or drones. Therefore, the foundation of remote sensing image processing is the electromagnetic spectrum, particularly the visible, infrared, and microwave bands. Different land features and materials exhibit different reflection and radiation characteristics in different bands, which can be used to identify and classify surface features. In practical applications, image semantic segmentation can be used to classify different land features in remote sensing images at the pixel level, such as buildings, roads, water bodies, and forests. This classification provides a foundation for the understanding and analysis of remote sensing data and is widely used in fields such as urban planning, environmental monitoring, and agricultural assessment.
[0083] Of course, in addition to the above-mentioned scenarios, the method provided in this application embodiment can also be applied to other scenarios that require image semantic segmentation. This application embodiment does not limit the specific application scenario.
[0084] Based on the aforementioned technical principles or related theoretical foundations, the image semantic segmentation method proposed in this application can be applied to any computer device capable of image semantic segmentation computing, and this computer device can be various types of terminals or servers. When the computer device in the embodiment is a server, the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud service content delivery networks (CDNs), and big data and artificial intelligence platforms.
[0085] It should be further noted that the terminals involved in the embodiments of this application include, but are not limited to, smartphones, computers, intelligent voice interaction devices, smart home appliances, vehicle terminals, and aircraft. The embodiments of this application can be applied to various scenarios, including but not limited to cloud technology, artificial intelligence, smart transportation, and assisted driving.
[0086] In some possible implementations, embodiments of the present invention provide a computer program for an image semantic segmentation method, which can be deployed and executed on a single computer device, or on multiple computer devices located in one location; or, on multiple computer devices distributed across multiple locations and interconnected via a communication network, wherein the multiple computer devices distributed across multiple locations and interconnected via a communication network can form a blockchain system.
[0087] Based on the fact that multiple computer devices can form a blockchain system, the image semantic segmentation method provided in this application can be executed and completed by a node in the blockchain; and the node used to execute the image semantic segmentation method can be any mobile terminal that can provide a front-end page, such as a smartphone, tablet computer, or personal computer (PC).
[0088] The embodiments of this application can be applied to scenarios involving image semantic segmentation, such as artificial intelligence, smart healthcare, and autonomous driving. The following examples illustrate this. Figure 1 An exemplary architecture is illustrated below, comprising a terminal 100, a server 200, a database 300, and a network 400. The terminal 100, the server 200, and the database 300 are connected via the network 400. The database 300 is used to store images to be recognized and various models. Figure 1 The number of terminals 100, servers 200 and databases 300 in the system shown is only an example. For example, there may be multiple terminals 100, servers 200 and databases 300. This application does not limit the number of terminals 100, servers 200 and databases 300.
[0089] In this system, terminal 100 communicates with server 200 via a network. Database 300 can be integrated onto server 200, or it can be located in the cloud or on another server. During image semantic segmentation, interaction can occur between terminal 100 and server 200. For example, a user inputs an image to be recognized through terminal 100, which then sends the image to server 200. Server 200 performs semantic segmentation on the image to obtain the semantic segmentation result and determines subsequent services based on the result.
[0090] Terminal 100 can be a smartphone, tablet, laptop, desktop computer, smart speaker, smartwatch, smart voice interaction device, smart home appliance, or in-vehicle terminal, but is not limited to these. Server 200 can be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery network (CDN), big data, and artificial intelligence platforms.
[0091] In short, a database can be viewed as an electronic filing cabinet—a place to store electronic files, where users can perform operations such as adding, querying, updating, and deleting data. A "database" is a collection of data stored together in a certain way, shared by multiple users, with minimal redundancy, and independent of application programs. A Database Management System (DBMS) is a computer software system designed to manage databases, generally possessing basic functions such as storage, retrieval, security, and backup. DBMSs can be classified according to the database model they support, such as relational or Extensible Markup Language (XML); or according to the type of computer they support, such as server clusters or mobile phones; or according to the query language used, such as Structured Query Language (SQL) or XQuery; or according to performance priorities, such as maximum scale or maximum operating speed; or other classification methods. Regardless of the classification method used, some DBMSs can cross categories, for example, supporting multiple query languages simultaneously.
[0092] Based on the above description, when performing semantic segmentation on an image to be recognized, it is necessary to train an image semantic segmentation model in advance. The architecture of this model can include an encoder and a decoder. The training process can be as follows: Obtain an initial encoder, an initial decoder, and training samples. The training samples include training image data and ground truth labels. The initial encoder and the initial decoder have the same network structure. Use the initial semantic segmentation model to perform semantic segmentation on the training samples to obtain the predicted semantic segmentation result. Calculate the loss value based on the ground truth labels and the predicted semantic segmentation result. This loss value is the sum of the cross-entropy loss and the overlap loss function. Train the initial encoder and the initial decoder based on this loss value to obtain the encoder and the decoder. It should be understood that the cross-entropy loss can be expressed as follows:
[0093]
[0094] Where N is the total number of pixels in the training samples, G g t Where P represents the true labels, and S represents the number of category labels. px Predict category labels.
[0095] The overlap loss function (Dice Loss) measures the degree of overlap between the segmented region and the ground truth label, and can be expressed as follows:
[0096]
[0097] Where -∈: smoothing factor to prevent the denominator from being zero, N is the total number of pixels in the training samples, and G g t Where P represents the true labels, and S represents the number of category labels. px Predict category labels.
[0098] The combined loss of this image semantic segmentation model is a weighted combination of the two:
[0099]
[0100] α and β are weighting coefficients used to balance the contributions of cross-tab loss and Dice Loss.
[0101] It should be understood that when the image semantic segmentation model also includes a cross-channel mixing (CCM) module, its training method also adopts the above scheme, which will not be elaborated here.
[0102] It is understood that in the specific embodiments of this application, data related to the image to be identified, training samples, and the model composed of encoders and decoders are involved. When the above embodiments of this application are applied to specific products or technologies, user permission or consent is required, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions.
[0103] Based on the above introduction, the image semantic segmentation method in this application will be described below, taking the server as the execution entity. Please refer to [link / reference needed]. Figure 2 One embodiment of the image semantic segmentation method in this application includes:
[0104] 201. Obtain the image to be recognized.
[0105] In this embodiment, the server can receive images to be identified transmitted by a third party, or it can acquire images to be identified transmitted by an image acquisition device connected to the server.
[0106] It should be understood that the image to be identified can be of different types depending on the application scenario. For example, in an intelligent driving scenario, the image to be identified could be a real-time captured road condition image. In an intelligent medical scenario, the image to be identified could be a patient's color ultrasound image, etc.
[0107] 202. Based on the encoder, multiple coding layers sequentially extract features from the image to be recognized to obtain multiple feature maps. The multiple feature maps are feature maps of different sizes output by multiple coding layers in the encoder. The multiple coding layers include coding layers based on convolutional neural networks and coding layers based on convolutional neural networks and spatial hybrid networks.
[0108] In this embodiment, the server inputs the image to be recognized into an image semantic segmentation model through an input layer. This model includes an encoder and a decoder. The encoder comprises multiple encoding layers, which may have the same network structure or different network structures. Specifically, some encoding layers are generated based on convolutional neural networks, while others are generated based on a hybrid convolutional neural network and a spatial network. Then, different encoding layers output feature maps of different sizes, resulting in multiple feature maps.
[0109] A Convolutional Neural Network (CNN) is a deep neural network with convolutional structures, and is a deep learning architecture. Deep learning architectures refer to learning at multiple levels of abstraction using machine learning algorithms. As a deep learning architecture, a CNN is a feed-forward artificial neural network where each neuron responds to overlapping regions in the input image. The convolutional layers in a CNN can include multiple convolution operators, also called kernels. In this application, the kernel acts as a filter to extract specific information from the input image. A convolution operator is essentially a weight matrix, which is usually predefined. The weight values in these weight matrices need to be obtained through extensive training in practical applications. The weight matrices formed by these trained weight values can extract information from the input video image, thereby helping the CNN to make correct predictions. When a convolutional neural network has multiple convolutional layers, the initial convolutional layers tend to extract more general features, which can also be called low-level features. As the depth of the convolutional neural network increases, the features extracted by later convolutional layers become more and more complex, such as high-level semantic features. Features with higher semantic levels are more suitable for the problem to be solved.
[0110] In one exemplary scheme, the architecture of the image semantic segmentation model can be as follows: Figure 3 As shown, the encoder of this image semantic segmentation model consists of four encoding layers, with the output of each previous encoding layer serving as the input to the next, and the output of each encoding layer also serving as the input to the decoding layer. The decoder of this image semantic segmentation model consists of four decoding layers, with the output of each previous decoding layer serving as the input to the next. Figure 3 In the encoder shown, the first and second coding layers have the same network architecture, and both are based on convolutional neural networks. The third and fourth coding layers have the same network architecture, and both are based on a hybrid convolutional neural network and spatial coding.
[0111] In one exemplary scheme, the first coding layer is a coding layer based on stacked inverted residual (IR) blocks. The network structure of this IR module can be as follows: Figure 4As shown, the IR module is constructed from a 1x1 convolutional layer, a depthwise convolutional layer, and a 1x1 convolutional layer. It should be understood that in this first encoding layer, the first IR module may have no residual connections, ensuring that the model performs sufficient initial extraction of the input features. Therefore, its processing flow can be as follows: Figure 4 As shown: Given a batch of input feature maps Wherein, B is used to indicate the batch of the feature map, and C... in The channel data used to indicate the feature map, H×W, is used to indicate the length and width of the feature map. During feature extraction, the input I is first normalized using a 1×1 convolutional layer and projected onto the intermediate dimension C. mid Then, layer normalization is performed, and the result of this layer normalization can be expressed as follows:
[0112] I1 = LayerNorm(Conv 1×1 (X)),
[0113] in C mid It should be greater than C in This allows for expansion in the channel dimension.
[0114] Then, a 5×5 depthwise convolutional layer (referred to as the DW-Conv layer) is used to perform local feature aggregation on the result of the layer normalization. It should be understood that the size of the convolutional kernel in this depthwise convolutional layer can be set according to the actual situation; no specific limitation is made here. The result of this local feature aggregation can be represented as follows:
[0115] I2=DW-Conv((I2)+I2),
[0116] in H′×W′ is determined by the stride of the depthwise convolutional layer, but the specific stride is not limited here. Finally, I2 is mapped to the output dimension to obtain the output of the first IR module, which can be represented as follows:
[0117] F = I² + Conv 1×1 (I2),
[0118] in
[0119] To balance the accuracy and efficiency of semantic segmentation and avoid a sharp increase in computational load, the second IR module through the last IR module of the coding layer can be designed as an inverse residual structure. In an exemplary scheme, the processing flow of the second IR module can be as follows: Figure 5 As shown: Given a batch of input feature maps Wherein, B is used to indicate the batch of the feature map, and C... inThe channel data used to indicate the feature map, H×W, is used to indicate the length and width of the feature map. During feature extraction, the input I is first normalized using a 1×1 convolutional layer and projected onto the intermediate dimension C. mid Then, layer normalization is performed, and the result of this layer normalization can be expressed as follows:
[0120] I1 = LayerNorm(Conv 1×1 (X)),
[0121] in C mid It should be greater than C in This allows for expansion in the channel dimension.
[0122] Then, a 5×5 depthwise convolutional layer (referred to as the DW-Conv layer) is used to perform local feature aggregation on the result of the layer normalization. It should be understood that the size of the convolutional kernel in this depthwise convolutional layer can be set according to the actual situation; no specific limitation is made here. The result of this local feature aggregation can be represented as follows:
[0123] I2=DW-Conv((I2)+I2),
[0124] in H′×W′ is determined by the stride of the deep convolutional layer, but the specifics are not specified here.
[0125] Finally, the I2 is mapped to the output dimension and a global skip connection is added to obtain the output of the first IR module, which can be represented as follows:
[0126] F = I² + Conv 1×1 (I2)+X,
[0127] in
[0128] The third encoding layer is a stacked encoding layer based on inverted residual (IR) blocks and IR-RWKV modules. The RWKV (Receptance Weighted Key Value) module is a lightweight sequence modeling architecture that combines the time recursion of RNNs with the attention mechanism of Transformers. RWKV employs a linear time complexity attention mechanism, suitable for processing long sequence data while reducing computational overhead. The processing flow of the first IR module in this third encoding layer is the same as that of the first IR module in the first encoding layer, and will not be repeated here. The following will... Figure 6 The processing flow shown illustrates the processing flow of this IR-RWKV module:
[0129] Given a batch of input feature maps Wherein, B is used to indicate the batch of the feature map, and C... in The channel data used to indicate the feature map, H×W, is used to indicate the length and width of the feature map. During feature extraction, the input X is first normalized through a 1×1 convolutional layer and projected onto the intermediate dimension C. mid Then, layer normalization is performed, and the result of the layer normalization can be expressed as follows:
[0130] I1 = LayerNorm(Conv 1×1 (X)),
[0131] in C mid It should be greater than C in This allows for expansion in the channel dimension.
[0132] The feature map is divided into 1×1 blocks, and then spatial blending from visual RWKV is applied.
[0133] I2 = Unfolding(I1),
[0134] I3=SpatialMix(LayerNorm(I2))+I2
[0135] in
[0136] Convert the one-dimensional feature sequence back into a two-dimensional feature map.
[0137] Then, a 5×5 depthwise convolutional layer (referred to as the DW-Conv layer) is used to perform local feature aggregation on the layer normalization result. It should be understood that the kernel size of this depthwise convolutional layer can be set according to the actual situation, and is not limited here. The result of this local feature aggregation can be represented as follows:
[0138] I5 = DW - Conv((I4) + I4)
[0139] in H′×W′ is determined by the stride of the deep convolutional layer, but the specific stride is not limited here. Finally, I5 is mapped to the output dimension and a global skip connection is added:
[0140] F = I5 + Conv 1×1 (I5)+X
[0141] in
[0142] It should be understood that the RWKV module can also be replaced with other linear attention models, such as Mamba's bidirectional recursive structure, etc., without being limited here.
[0143] In order to achieve adaptive multi-scale information fusion and avoid the limitations of fixed weights, different weight values can be added to multiple feature maps in this embodiment.
[0144] In this embodiment, to fully integrate global and local information at different scales and improve the model's ability to recognize complex targets and detailed regions, thereby enhancing the robustness and generalization of semantic segmentation, a process of fusing, separating, and restoring multiple feature maps of the encoding layer along the channel dimension can be designed. Specifically, the server can perform feature map alignment on a first set of these multiple feature maps to obtain a first intermediate feature map set, which consists of all feature maps except for the feature map output by the last encoding layer in the encoder; feature fusion is then performed on each feature map in the first intermediate feature map set to obtain a fused feature map; and feature separation and restoration are then performed on the fused feature map to obtain a second feature map set, where each feature map in the second feature map set serves as the output of each encoding layer in the encoder except for the last encoding layer.
[0145] In one exemplary approach, a cross-channel mixing (CCM) module can be used to fuse, separate, and reconstruct feature maps of different sizes output from the coding layer. In another exemplary approach, after including the CCM module, the image semantic segmentation model can be as follows: Figure 7 The structure shown is as follows. In this structure, the feature maps output from the first to the third coding layers can be fused, separated, and restored based on the CCM module, and then used as input to the corresponding decoding layer.
[0146] The following is based on Figure 7 The model architecture shown illustrates the processing flow of this CCM module:
[0147] Assuming the input image The encoder's initial output is The outputs of the four encoding stages are respectively They have different spatial dimensions and number of channels. F1 has the largest size, and F4 has the smallest. In an exemplary scheme, where H0×W0 is the original image size, assuming the downsampling factor of the coding layer is set to 2, the size relationship can be expressed as follows: H0=2H1=4H2=8H3=16H4, W0=2W1=4W2=8W3=16W4. In the processing flow of this CCM module, only the feature maps output from the first three stages are processed. The specific flow can be as follows:
[0148] 1. Feature map alignment:
[0149] The smaller feature map is mapped to the same size and number of channels as the largest feature map F1 through upsampling and convolution:
[0150]
[0151]
[0152] The aligned feature map can be represented as follows:
[0153] 2. Feature fusion:
[0154] The aligned feature maps are concatenated along the channel dimension to form a joint feature map, which can be represented as follows:
[0155]
[0156] Then the joint feature map is divided into smaller blocks:
[0157]
[0158] Next, multi-scale global feature fusion is performed along the channel dimension using the Channel Mix operation of VRWKV. VRWKV (Vision RWKV) is a novel design that extends the RWKV architecture to visual tasks, enabling it to capture tasks with long-range dependencies in visual data (such as medical image segmentation and object detection). VRWKV achieves efficient long-sequence information modeling through a linear attention mechanism, while introducing spatial mixing and channel mixing modules to further enhance model performance. The mixing result can be represented as follows:
[0159] Where N = H1 × W1.
[0160] The mixed features are then reprojected back onto the two-dimensional feature map, which can be represented as follows:
[0161]
[0162] 3. Feature separation and restoration:
[0163] Joint feature map F fold The feature maps are separated into three sub-maps, and their original dimensions and number of channels are restored. Specifically, the dimensions and channels of F1, F2, and F3 are restored. This process can be represented as follows:
[0164]
[0165] in, These three feature maps will serve as the second feature map set and as input to the corresponding decoding layer.
[0166] 203. Multiple decoding layers based on the decoder sequentially perform feature decoding on the multiple feature maps to obtain the semantic segmentation result of the image to be recognized. The multiple decoding layers correspond one-to-one with the multiple encoding layers, and the input of the first decoding layer of the multiple decoding layers is the output of the last encoding layer of the encoding layer. The inputs of the other decoding layers of the multiple decoding layers include the output of the previous decoding layer and the output of the encoding layer corresponding to the decoding layer.
[0167] In this embodiment, the server uses the output of the last encoding layer of the encoder as the input of the first decoding layer of the decoder, and then performs upsampling decoding on the feature map to obtain the corresponding decoding result. This decoding result is then used as the input of the second decoding layer of the decoder, which also includes the output of the encoding layer preceding the last encoding layer. Decoding is performed based on these inputs to obtain the corresponding decoding result. This process continues until decoding is complete, resulting in the semantic segmentation result of the image to be recognized.
[0168] The structure of each decoding layer in this decoder can be as follows: Figure 8 As shown, it can include 1x1 convolutional layers, depthwise convolutional layers, 1x1 convolutional layers, and upsampling layers. Based on this structure, the specific processing flow can be as follows:
[0169] Assume the input is a feature map After the first 1×1 convolution, the feature distribution is adjusted while maintaining the same number of input and output channels. The result can be represented as follows:
[0170]
[0171] Spatial feature extraction using DW-Conv yields the following results:
[0172]
[0173] After a second 1×1 convolution, the feature map is mapped to the target output channel number of the decoder, and the result can be represented as follows:
[0174]
[0175] After upsampling, the size of the feature map is increased. The upsampling ratio should be the same as the downsampling ratio of the encoder.
[0176]
[0177] by Figure 7 Taking the model architecture shown as an example, if the entire decoder consists of four stages, and the output of each stage is F out1 F out2 ,F out3 F out 4; The fourth stage output of the encoder is The three obtained through the CCM module
[0178] "′
[0179] If the feature maps are F1, F2, and F3, then the decoding process can be represented as follows:
[0180]
[0181] If the original size of the image to be recognized is H0×W0, and after the encoder, the sizes of each feature map are represented as H0=2H1=4H2=8H3=16H4, W0=2W1=4W2=8W3=16W4, then after one output layer, the final output of the image semantic segmentation model is:
[0182] S px =softmax(Conv 1×1 (F out4 ))
[0183] The image semantic segmentation apparatus of this application is described in detail below. Please refer to [link / reference]. Figure 9 , Figure 9 This is a schematic diagram of one embodiment of the image semantic segmentation apparatus in this application. The image semantic segmentation apparatus 20 includes:
[0184] The acquisition module 201 is used to acquire the image to be recognized;
[0185] The processing module 202 is used to sequentially extract features from the image to be recognized based on multiple coding layers of the encoder to obtain multiple feature maps. The multiple feature maps are feature maps of different sizes output by multiple coding layers in the encoder. The multiple coding layers include coding layers based on convolutional neural networks and coding layers based on convolutional neural networks and spatial hybrid networks. The multiple decoding layers based on the decoder sequentially decode the multiple feature maps to obtain the semantic segmentation result of the image to be recognized. The multiple decoding layers correspond one-to-one with the multiple coding layers, and the input of the first decoding layer of the multiple decoding layers is the output of the last coding layer of the coding layer. The inputs of the other decoding layers of the multiple decoding layers include the output of the previous decoding layer and the output of the coding layer corresponding to the decoding layer.
[0186] This application provides an image semantic segmentation apparatus. Using this apparatus, a multi-stage encoding layer is employed for feature extraction in the encoder. The encoder is constructed from a convolutional neural network-based encoding layer and an encoding layer combining convolutional neural networks and spatial hybrid networks. This allows the encoder to capture long-range dependency information of the image to be recognized through the spatial hybrid network, while simultaneously learning a fine representation of the local features of the image to be recognized based on the convolutional neural network. Therefore, it enables collaborative modeling of the global semantics and detailed information of the image to be recognized, significantly improving the accuracy of image semantic segmentation.
[0187] Optionally, in the above Figure 9 Based on the corresponding embodiments, in another embodiment of the image semantic segmentation apparatus 20 provided in this application,
[0188] The encoding layer based on the convolutional neural network is generated by stacking multiple first-type convolutional modules, which include convolutional layers and deep convolutional layers.
[0189] The encoding layer, which is based on convolutional neural networks and spatial hybrid networks, is generated by stacking one first-type convolutional module and multiple second-type convolutional modules. The second-type convolutional module includes convolutional layers, spatial hybrid networks, and deep convolutional layers.
[0190] In this embodiment, an image semantic segmentation apparatus is provided. Using this apparatus, the encoder is constructed from a coding layer based on a convolutional neural network and a coding layer combining a convolutional neural network and a spatial hybrid network. This allows the encoder to capture long-range dependency information of the image to be recognized through the spatial hybrid network, while simultaneously learning a fine representation of the local features of the image to be recognized based on the convolutional neural network. Therefore, it enables collaborative modeling of the global semantics and detailed information of the image to be recognized, thereby significantly improving the accuracy of image semantic segmentation.
[0191] Optionally, in the above Figure 9 Based on the corresponding embodiments, in another embodiment of the image semantic segmentation apparatus 20 provided in this application, the encoder includes four coding layers, wherein the first coding layer and the second coding layer are coding layers based on convolutional neural networks, and the third coding layer and the fourth coding layer are coding layers based on convolutional neural networks and spatial hybrid networks; the processing module 202 is used to extract features from the image to be recognized based on the first coding layer to obtain a first feature map;
[0192] Based on the second coding layer, features are extracted from the first feature map to obtain the second feature map;
[0193] Based on this third coding layer, features are extracted from the second feature map to obtain the third feature map;
[0194] Based on this fourth coding layer, features are extracted from the third feature map to obtain the fourth feature map;
[0195] The first feature map, the second feature map, the third feature map, and the fourth feature map are the multiple feature maps.
[0196] In this embodiment, an image semantic segmentation apparatus is provided. Using this apparatus, the encoder is constructed from a coding layer based on a convolutional neural network and a coding layer combining a convolutional neural network and a spatial hybrid network. This allows the encoder to capture long-range dependency information of the image to be recognized through the spatial hybrid network, while simultaneously learning a fine representation of the local features of the image to be recognized based on the convolutional neural network. Therefore, it enables collaborative modeling of the global semantics and detailed information of the image to be recognized, thereby significantly improving the accuracy of image semantic segmentation.
[0197] Optionally, in the above Figure 9 Based on the corresponding embodiments, in another embodiment of the image semantic segmentation apparatus 20 provided in this application,
[0198] The processing module 202 is used to extract features from the image to be recognized using the first type of convolutional module in the first coding layer to obtain a first intermediate feature map;
[0199] The first intermediate feature map is extracted using the second type-1 convolutional module in the first coding layer to obtain the second intermediate feature map;
[0200] The first intermediate feature map and the second intermediate feature map are added together to obtain a third intermediate feature map, which is used as the output of the second first type convolutional module in the first coding layer.
[0201] The third intermediate feature map is extracted using the third type-1 convolutional module in the first coding layer to obtain the fourth intermediate feature map;
[0202] The third intermediate feature map and the fourth intermediate feature map are added together to obtain the fifth intermediate feature map, which is used as the output of the third first type convolutional module in the first coding layer.
[0203] The feature extraction operations of each convolutional module in the first coding layer are performed accordingly to obtain the first feature map.
[0204] This application provides an image semantic segmentation apparatus. Using this apparatus, the coding layer based on a convolutional neural network can learn a fine representation of the local features of the image to be recognized, thereby significantly improving the accuracy of image semantic segmentation.
[0205] Optionally, in the above Figure 9 Based on the corresponding embodiments, in another embodiment of the image semantic segmentation apparatus 20 provided in this application,
[0206] The processing module 202 is used to perform layer normalization processing on the image to be recognized using the first convolutional layer in the first type of convolutional module to obtain a first intermediate feature sequence;
[0207] The intermediate feature sequence is locally aggregated using the deep convolutional layer in the first type of convolutional module to obtain the second intermediate feature sequence;
[0208] The first intermediate feature map is obtained by using the second intermediate feature sequence of the second convolutional layer pair in the first type of convolutional module for feature mapping.
[0209] This application provides an image semantic segmentation apparatus. Using this apparatus, the coding layer based on a convolutional neural network can learn a fine representation of the local features of the image to be recognized, thereby significantly improving the accuracy of image semantic segmentation.
[0210] Optionally, in the above Figure 9 Based on the corresponding embodiments, in another embodiment of the image semantic segmentation apparatus 20 provided in this application, the processing module 202 is used to extract features from the second feature map using the first type of convolution module in the third coding layer to obtain a sixth intermediate feature map;
[0211] The first type II convolutional module in the third coding layer is used to extract features from the sixth intermediate feature map to obtain the seventh intermediate feature map;
[0212] The seventh intermediate feature map and the sixth intermediate feature map are added together to obtain the eighth intermediate feature map, which is used as the output of the first type II convolutional module in the third coding layer.
[0213] The second type II convolutional module in the third coding layer is used to extract features from the eighth intermediate feature map to obtain the ninth intermediate feature map;
[0214] The eighth intermediate feature map and the ninth intermediate feature map are added together to obtain the tenth intermediate feature map, which is used as the output of the second type II convolutional module in the third coding layer.
[0215] The convolutional modules in the third coding layer are executed accordingly to obtain the third feature map.
[0216] This application provides an image semantic segmentation apparatus. Using this apparatus, the coding layer based on convolutional neural networks and spatial hybrid networks can capture long-range dependency information of the image to be identified, thereby improving the accuracy of image semantic segmentation.
[0217] Optionally, in the above Figure 9 Based on the corresponding embodiments, in another embodiment of the image semantic segmentation device 20 provided in this application, the processing module 202 is used to perform layer normalization processing on the image to be identified using the first convolutional layer in the first second type convolutional module to obtain a third intermediate feature sequence.
[0218] The spatial mixing network in the first type II convolutional module is used to spatially mix the third intermediate feature sequence to obtain the fourth intermediate feature sequence;
[0219] The deep convolutional layer in the first type II convolutional module is used to perform local feature aggregation on the fourth intermediate feature sequence to obtain the fifth intermediate feature sequence;
[0220] The fifth intermediate feature sequence of the second convolutional layer pair in the first type II convolutional module is used for feature mapping to obtain the seventh intermediate feature map.
[0221] This application provides an image semantic segmentation apparatus. Using this apparatus, a coding layer based on a convolutional neural network and spatial hybridization can capture long-range dependency information of the image to be identified, thereby significantly improving the accuracy of image semantic segmentation.
[0222] Optionally, in the above Figure 9 Based on the corresponding embodiments, in another embodiment of the image semantic segmentation apparatus 20 provided in this application,
[0223] The processing module 202 is used to perform feature map alignment on the first feature map set of the multiple feature maps to obtain a first intermediate feature map set, which is all feature maps except for the feature map output by the last coding layer in the encoder;
[0224] The feature maps in the first intermediate feature map set are fused to obtain a fused feature map;
[0225] The fused feature map is then subjected to feature separation to obtain a second intermediate feature map set, wherein the number of feature maps in the second intermediate feature map set is the same as the number of feature maps in the first feature map set;
[0226] Each feature map in the second intermediate feature map set is restored in terms of size and channel dimension to obtain the second feature map set, wherein each feature map in the second feature map set is the output of the coding layer in the encoder except for the last coding layer, and each feature map in the second feature map set has the same size and number of channels as each feature map in the first feature map set.
[0227] This application provides an image semantic segmentation apparatus. By employing this apparatus, global and local information at different scales is fully integrated, enhancing the model's ability to recognize complex targets and detailed regions, thereby improving the robustness and generalization of the segmentation.
[0228] Optionally, in the above Figure 9 Based on the corresponding embodiments, in another embodiment of the image semantic segmentation apparatus 20 provided in this application,
[0229] The processing module 202 is used to upsample and convolve each feature map in the first feature map set according to the size of the target feature map to obtain the first intermediate feature map set, wherein the target feature map is the feature map with the largest size in the first feature map set.
[0230] This application provides an image semantic segmentation apparatus. Using this apparatus, multiple feature maps of different sizes are upsampled to obtain feature maps of the same size. This allows for the learning of more generalized feature representations, reducing the risk of overfitting. Simultaneously, aligned features reduce unnecessary feature transformations, thereby reducing computational resource consumption.
[0231] Optionally, in the above Figure 9 Based on the corresponding embodiments, in another embodiment of the image semantic segmentation apparatus 20 provided in this application,
[0232] The processing module 202 is used to concatenate the intermediate feature maps in the first intermediate feature map set according to the channel dimension to generate a joint feature map;
[0233] The joint feature map is segmented according to the target size to obtain a set of separate feature maps;
[0234] Multi-size global feature fusion is performed on each feature map in the set of separated feature maps to obtain the fused feature map.
[0235] This application provides an image semantic segmentation apparatus. Using this apparatus, multiple feature maps are stitched together according to channel dimensions, then separated and fused again. This allows for better capture of the feature data of the image to be identified, effectively merging information from different feature maps. Furthermore, the separation and fusion process reduces information loss, thereby improving the accuracy of image semantic segmentation.
[0236] Optionally, in the above Figure 9 Based on the corresponding embodiments, in another embodiment of the image semantic segmentation apparatus 20 provided in this application,
[0237] The acquisition module 201 is used to acquire an initial encoder, an initial decoder, and training samples. The training samples include training image data and real labels. The initial encoder has the same network structure as the encoder, and the initial decoder has the same network structure as the decoder.
[0238] The processing module 202 is used to perform semantic segmentation processing on the training sample using the initial semantic segmentation model to obtain the predicted semantic segmentation result of the training sample; calculate the loss value based on the real label and the predicted semantic segmentation result, the loss value being the sum of the cross-entropy loss value and the overlap loss function; and train the initial encoder and the initial decoder based on the loss value to obtain the encoder and the decoder.
[0239] This application provides an image semantic segmentation apparatus. Using this apparatus, the cross-entropy loss function is primarily used to measure the difference between the predicted probability distribution and the true probability distribution, and is suitable for classification problems. The overlap loss function, on the other hand, measures the degree of overlap between the predicted bounding box and the true bounding box, and is suitable for tasks requiring localization, such as object detection. Using both loss functions simultaneously can optimize the model's classification and localization performance.
[0240] The image semantic segmentation apparatus provided in this application can be used on a server; please refer to [link / reference]. Figure 10 , Figure 10This is a schematic diagram of a server structure provided in an embodiment of this application. The server 300 can vary significantly due to different configurations or performance. It may include one or more central processing units (CPUs) 322 (e.g., one or more processors) and memory 332, and one or more storage media 330 (e.g., one or more mass storage devices) for storing application programs 342 or data 344. The memory 332 and storage media 330 can be temporary or persistent storage. The program stored in the storage media 330 may include one or more modules (not shown in the diagram), each module including a series of instruction operations on the server. Furthermore, the CPU 322 may be configured to communicate with the storage media 330 and execute the series of instruction operations stored in the storage media 330 on the server 300.
[0241] Server 300 may also include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input / output interfaces 358, and / or one or more operating systems 341, such as Windows Server. TM Mac OS X TM Unix TM Linux TM FreeBSD TM etc.
[0242] The steps performed by the server in the above embodiments can be based on this Figure 10 The server structure shown.
[0243] The image semantic segmentation apparatus provided in this application can be used in terminal devices. Please refer to [link / reference]. Figure 11 For ease of explanation, only the parts relevant to the embodiments of this application are shown. For specific technical details not disclosed, please refer to the method section of the embodiments of this application. In the embodiments of this application, a smartphone is used as an example for illustration:
[0244] Figure 11 This is a block diagram illustrating a portion of the structure of a smartphone related to the terminal device provided in the embodiments of this application. (Reference) Figure 11 The smartphone includes components such as a radio frequency (RF) circuit 410, a memory 420, an input unit 430, a display unit 440, a sensor 450, an audio circuit 460, a wireless fidelity (WiFi) module 470, a processor 480, and a power supply 490. Those skilled in the art will understand that... Figure 11The smartphone structure shown does not constitute a limitation on smartphones and may include more or fewer components than shown, or combine certain components, or have different component arrangements.
[0245] The following is combined with Figure 11 A detailed introduction to the various components of a smartphone:
[0246] RF circuit 410 can be used for receiving and transmitting signals during information transmission or calls. Specifically, it receives downlink information from the base station and processes it with processor 480; additionally, it transmits uplink data to the base station. Typically, RF circuit 410 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low-noise amplifier (LNA), a duplexer, etc. Furthermore, RF circuit 410 can also communicate wirelessly with networks and other devices. The aforementioned wireless communication can use any communication standard or protocol, including but not limited to Global System for Mobile Communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), etc.
[0247] The memory 420 can be used to store software programs and modules. The processor 480 executes various functions and data processing of the smartphone by running the software programs and modules stored in the memory 420. The memory 420 may mainly include a program storage area and a data storage area. The program storage area may store the operating system, applications required for at least one function (such as sound playback function, image playback function, etc.), etc.; the data storage area may store data created according to the use of the smartphone (such as audio data, phonebook, etc.). In addition, the memory 420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other volatile solid-state storage device.
[0248] The input unit 430 can be used to receive input numerical or character information, and to generate key signal inputs related to user settings and function control of the smartphone. Specifically, the input unit 430 may include a touch panel 431 and other input devices 432. The touch panel 431, also known as a touch screen, can collect touch operations performed by the user on or near it (such as operations performed by the user using a finger, stylus, or any suitable object or accessory on or near the touch panel 431), and drive the corresponding connected devices according to a pre-set program. Optionally, the touch panel 431 may include two parts: a touch detection device and a touch controller. The touch detection device detects the user's touch position and the signal generated by the touch operation, and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts it into touch point coordinates, and sends it to the processor 480, and can also receive and execute commands sent by the processor 480. In addition, the touch panel 431 can be implemented using various types such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the touch panel 431, the input unit 430 may also include other input devices 432. Specifically, other input devices 432 may include, but are not limited to, one or more of the following: physical keyboard, function keys (such as volume control buttons, power buttons, etc.), trackball, mouse, joystick, etc.
[0249] Display unit 440 can be used to display information input by the user or information provided to the user, as well as various menus of the smartphone. Display unit 440 may include display panel 441, optionally configured as a liquid crystal display (LCD), organic light-emitting diode (OLED), or similar form. Further, touch panel 431 may cover display panel 441. When touch panel 431 detects a touch operation on or near it, it transmits the information to processor 480 to determine the type of touch event. Subsequently, processor 480 provides corresponding visual output on display panel 441 based on the type of touch event. Although in Figure 11 In this embodiment, the touch panel 431 and the display panel 441 are two separate components to realize the input and output functions of the smartphone. However, in some embodiments, the touch panel 431 and the display panel 441 can be integrated to realize the input and output functions of the smartphone.
[0250] The smartphone may also include at least one sensor 450, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor can adjust the brightness of the display panel 441 according to the ambient light level, and the proximity sensor can turn off the display panel 441 and / or the backlight when the smartphone is moved to the ear. As a type of motion sensor, an accelerometer sensor can detect the magnitude of acceleration in various directions (generally three axes), and can detect the magnitude and direction of gravity when stationary. It can be used for applications that recognize the smartphone's posture (such as landscape / portrait switching, related games, magnetometer posture calibration), vibration recognition-related functions (such as pedometer, tapping), etc. Other sensors that may be configured in the smartphone, such as gyroscopes, barometers, hygrometers, thermometers, and infrared sensors, will not be described in detail here.
[0251] Audio circuit 460, speaker 461, and microphone 462 provide an audio interface between the user and the smartphone. Audio circuit 460 converts received audio data into electrical signals and transmits them to speaker 461, where speaker 461 converts them into sound signals for output. On the other hand, microphone 462 converts collected sound signals into electrical signals, which are received by audio circuit 460, converted into audio data, and then processed by processor 480 before being transmitted via RF circuit 410 to, for example, another smartphone, or the audio data can be output to memory 420 for further processing.
[0252] WiFi is a short-range wireless transmission technology. Smartphones, through their WiFi modules (470), can help users send and receive emails, browse web pages, and access streaming media, providing wireless broadband internet access. Although Figure 11 WiFi module 470 is shown, but it is understood that it is not an essential component of a smartphone and can be omitted as needed without changing the nature of the invention.
[0253] The processor 480 is the control center of the smartphone, connecting various parts of the smartphone through various interfaces and lines. It performs various functions and processes data by running or executing software programs and / or modules stored in the memory 420, and by calling data stored in the memory 420, thereby providing overall monitoring of the smartphone. Optionally, the processor 480 may include one or more processing units; optionally, the processor 480 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and applications, and the modem processor mainly handles wireless communication. It is understood that the aforementioned modem processor may also not be integrated into the processor 480.
[0254] The smartphone also includes a power supply 490 (such as a battery) that supplies power to various components. Optionally, the power supply can be logically connected to the processor 480 through a power management system, thereby enabling functions such as charging, discharging, and power consumption management through the power management system.
[0255] Although not shown, smartphones may also include a camera, Bluetooth module, etc., which will not be described in detail here.
[0256] The steps performed by the terminal device in the above embodiments can be based on this Figure 11 The terminal device structure is shown.
[0257] This application also provides a computer-readable storage medium storing a computer program that, when run on a computer, causes the computer to perform the methods described in the foregoing embodiments.
[0258] This application also provides a computer program product including a program, which, when run on a computer, causes the computer to perform the methods described in the foregoing embodiments.
[0259] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0260] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection between apparatuses or units through some interfaces, and may be electrical, mechanical, or other forms.
[0261] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0262] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0263] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0264] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims
1. An image semantic segmentation method, characterized in that, include: Acquire the image to be recognized; The image to be identified is sequentially extracted using multiple coding layers of the encoder to obtain multiple feature maps. The multiple feature maps are feature maps of different sizes output by multiple coding layers in the encoder. The multiple coding layers include coding layers based on convolutional neural networks and coding layers based on convolutional neural networks and spatial hybrid networks. Multiple decoding layers based on the decoder sequentially perform feature decoding on the multiple feature maps to obtain the semantic segmentation result of the image to be identified. The multiple decoding layers correspond one-to-one with the multiple encoding layers, and the input of the first decoding layer of the multiple decoding layers is the output of the last encoding layer of the multiple encoding layers. The inputs of the other decoding layers of the multiple decoding layers include the output of the previous decoding layer and the output of the encoding layer corresponding to the decoding layer.
2. The method according to claim 1, characterized in that, The encoding layer based on the convolutional neural network is generated by stacking multiple first-type convolutional modules, which include convolutional layers and deep convolutional layers. The encoding layer based on convolutional neural networks and spatial hybrid networks is generated by stacking a first type of convolutional module and multiple second type of convolutional modules. The second type of convolutional module includes convolutional layers, spatial hybrid networks, and deep convolutional layers.
3. The method according to claim 2, characterized in that, The encoder includes four coding layers, wherein the first and second coding layers are coding layers based on convolutional neural networks, and the third and fourth coding layers are coding layers based on a hybrid convolutional neural network and a spatial network. The multiple coding layers based on the encoder sequentially extract features from the image to be recognized to obtain multiple feature maps, including: Based on the first coding layer, feature extraction is performed on the image to be identified to obtain a first feature map; Based on the second coding layer, feature extraction is performed on the first feature map to obtain the second feature map; Based on the third coding layer, feature extraction is performed on the second feature map to obtain the third feature map; Based on the fourth coding layer, features are extracted from the third feature map to obtain the fourth feature map; The first feature map, the second feature map, the third feature map, and the fourth feature map are the plurality of feature maps.
4. The method according to claim 3, characterized in that, The step of extracting features from the image to be identified based on the first coding layer to obtain a first feature map includes: The first type of convolutional module in the first coding layer is used to extract features from the image to be identified in order to obtain a first intermediate feature map; The first intermediate feature map is extracted using the second type-1 convolutional module in the first coding layer to obtain the second intermediate feature map; The first intermediate feature map and the second intermediate feature map are added together to obtain a third intermediate feature map, wherein the third intermediate feature map is used as the output of the second first type convolutional module in the first coding layer; The third type-1 convolutional module in the first coding layer is used to extract features from the third intermediate feature map to obtain the fourth intermediate feature map; The third intermediate feature map and the fourth intermediate feature map are added together to obtain the fifth intermediate feature map, wherein the fifth intermediate feature map is used as the output of the third type-1 convolutional module in the first coding layer; The feature extraction operations of each convolutional module in the first coding layer are performed accordingly to obtain the first feature map.
5. The method according to claim 4, characterized in that, The first intermediate feature map is obtained by extracting features from the image to be identified using the first type-1 convolutional module in the first coding layer. The first convolutional layer in the first type of convolutional module is used to perform layer normalization processing on the image to be identified in order to obtain the first intermediate feature sequence; The intermediate feature sequence is locally aggregated using the deep convolutional layer in the first type of convolutional module to obtain the second intermediate feature sequence; The first intermediate feature map is obtained by performing feature mapping on the second intermediate feature sequence of the second convolutional layer pair in the first type of convolutional module.
6. The method according to claim 3, characterized in that, The step of extracting features from the second feature map based on the third coding layer to obtain the third feature map includes: The first type of convolutional module in the third coding layer is used to extract features from the second feature map to obtain the sixth intermediate feature map; The first type II convolutional module in the third coding layer is used to extract features from the sixth intermediate feature map to obtain the seventh intermediate feature map; The seventh intermediate feature map and the sixth intermediate feature map are added together to obtain the eighth intermediate feature map, wherein the eighth intermediate feature map is used as the output of the second convolutional module in the third coding layer; The second type II convolutional module in the third coding layer is used to extract features from the eighth intermediate feature map to obtain the ninth intermediate feature map; The eighth intermediate feature map and the ninth intermediate feature map are added together to obtain the tenth intermediate feature map, wherein the tenth intermediate feature map is used as the output of the second second convolutional module in the third coding layer; The feature extraction operations of each convolutional module in the third coding layer are performed accordingly to obtain the third feature map.
7. The method according to claim 6, characterized in that, The sixth intermediate feature map is processed using the first type-two convolutional module in the third coding layer to obtain the seventh intermediate feature map, which includes: The first convolutional layer in the first second type convolutional module is used to perform layer normalization processing on the image to be identified in order to obtain the third intermediate feature sequence; The spatial mixing network in the first type II convolutional module is used to spatially mix the third intermediate feature sequence to obtain the fourth intermediate feature sequence; The deep convolutional layer in the first type-two convolutional module is used to perform local feature aggregation on the fourth intermediate feature sequence to obtain the fifth intermediate feature sequence; The fifth intermediate feature sequence of the second convolutional layer pair in the first type of convolutional module is used for feature mapping to obtain the seventh intermediate feature map.
8. The method according to any one of claims 1 to 7, characterized in that, The method further includes: The first feature map set of the plurality of feature maps is aligned to obtain a first intermediate feature map set, wherein the first feature map set is all feature maps except the feature map output by the last coding layer in the encoder; The feature maps in the first intermediate feature map set are fused to obtain a fused feature map; The fused feature map is subjected to feature separation to obtain a second intermediate feature map set, wherein the number of feature maps in the second intermediate feature map set is the same as the number of feature maps in the first feature map set; Each feature map in the second intermediate feature map set is restored in terms of size and channel dimensions to obtain the second feature map set, wherein each feature map in the second feature map set is the output of the encoding layer in the encoder except for the last encoding layer, and each feature map in the second feature map set has the same size and number of channels as each feature map in the first feature map set.
9. The method according to claim 8, characterized in that, Aligning the first set of feature maps among the plurality of feature maps to obtain a first intermediate set of feature maps includes: Each feature map in the first feature map set is upsampled and convolved according to the size of the target feature map to obtain the first intermediate feature map set, wherein the target feature map is the largest feature map in the first feature map set.
10. The method according to claim 8, characterized in that, The step of fusing the feature maps in the first intermediate feature map set to obtain a fused feature map includes: The intermediate feature maps in the first set of intermediate feature maps are concatenated according to the channel dimension to generate a joint feature map; The joint feature map is segmented according to the target size to obtain a set of separate feature maps; Multi-size global feature fusion is performed on each feature map in the set of separated feature maps to obtain the fused feature map.
11. The method according to any one of claims 1 to 7, 9 to 10, characterized in that, The method further includes: Obtain an initial encoder, an initial decoder, and training samples. The training samples include training image data and ground truth labels. The initial encoder has the same network structure as the encoder, and the initial decoder has the same network structure as the decoder. The initial semantic segmentation model is used to perform semantic segmentation on the training samples to obtain the predicted semantic segmentation results of the training samples; The loss value is calculated based on the real label and the predicted semantic segmentation result. The loss value is the sum of the cross-entropy loss value and the overlap loss function. The initial encoder and initial decoder are trained based on the loss value to obtain the encoder and the decoder.
12. An image semantic segmentation device, characterized in that, include: The acquisition module is used to acquire the image to be recognized; The processing module is used to sequentially extract features from the image to be recognized based on multiple coding layers of the encoder to obtain multiple feature maps. These feature maps are feature maps of different sizes output from multiple coding layers in the encoder. The multiple coding layers include coding layers based on convolutional neural networks and coding layers based on a combination of convolutional neural networks and spatial hybrid networks. The module also performs feature decoding on the multiple feature maps sequentially based on multiple decoding layers of the decoder to obtain the semantic segmentation result of the image to be recognized. Each decoding layer corresponds one-to-one with a coding layer, and the input of the first decoding layer is the output of the last coding layer. The inputs of the other decoding layers include the output of the previous decoding layer and the output of the corresponding coding layer.
13. A computer device, characterized in that, include: Memory, processor, and bus system; The memory is used to store programs; The processor is configured to execute a program in the memory, and the processor is configured to execute the method of any one of claims 1 to 11 according to instructions in the program code; The bus system is used to connect the memory and the processor to enable communication between the memory and the processor.
14. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method as claimed in any one of claims 1 to 11.
15. A computer program product, comprising a computer program, characterized in that, The computer program is executed by a processor using the method as described in any one of claims 1 to 11.