Method and device for constructing image semantic segmentation model and image processing, electronic equipment and medium
By identifying candidate regions for pasting in the source domain image of a virtual scene and fusing them with the target domain image of a real scene, and training the model using an exponential moving average machine learning model, the problem of poor generalization of virtual synthetic image training models on real images is solved, thus improving the robustness and segmentation accuracy of the model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING WODONG TIANJUN INFORMATION TECH CO LTD
- Filing Date
- 2021-05-21
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, semantic segmentation models trained from virtual synthetic images have poor generalization ability on real images, resulting in inaccurate predictions and low reliability.
Candidate regions for pasting are identified from the source domain image of the virtual scene, including high-frequency categories and low-frequency long-tail categories. These are then fused with the target domain image of the real scene and trained using a machine learning model with exponential moving average to achieve feature-level and output-level alignment.
This improved the generalization performance of the semantic segmentation model in new unlabeled scenarios, and enhanced the robustness of the model and the accuracy of semantic segmentation.
Smart Images

Figure CN115393599B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer technology, and in particular to a method, apparatus, electronic device, and medium for constructing an image semantic segmentation model and image processing. Background Technology
[0002] As one of the fundamental tasks of computer vision, semantic segmentation of images can serve as a preliminary step in many practical applications, including image / video captioning, image-to-image translation, video content analysis, scene understanding, and autonomous driving. The goal of semantic segmentation is to predict a semantic category label for each pixel in an image, enabling the differentiation of different objects within the image.
[0003] However, training a high-performing semantic segmentation model typically requires a large amount of pixel-level labeled data, and manual annotation is extremely costly in terms of manpower, financial resources, and materials. Since it is much easier to generate synthetic images with dense pixel labels through 3D game engines, most of the work in related techniques has focused on using virtual synthetic images to train semantic segmentation models.
[0004] In the process of realizing the present invention, the inventors have discovered at least the following technical problems in the related technologies: in the related technologies, the semantic segmentation model trained by virtual synthetic images usually has poor generalization on real images, and when the model built based on virtual labeled source domain images performs semantic segmentation on real scene images, there are problems of inaccurate prediction and poor prediction reliability. Summary of the Invention
[0005] To address or at least partially address the aforementioned technical problems, embodiments of this disclosure provide a method, apparatus, electronic device, and medium for constructing an image semantic segmentation model and image processing.
[0006] In a first aspect, embodiments of this disclosure provide a method for constructing an image semantic segmentation model. The method includes: determining candidate regions to be pasted in a source domain image representing a virtual scene, wherein the candidate regions to be pasted simultaneously contain both frequently occurring categories and less frequently occurring long-tail categories; and then, for a randomly selected source domain image x... s and a randomly selected target domain image x to represent the real scene. t Based on preset transparency parameters, the above candidate region image to be pasted is compared with the current source region image x. s and the current target domain image x t The images are fused separately to obtain the current source domain mixed image x. ps Image x blended with the current target domain ptThe statistical distribution of the real-world scene differs from that of the virtual scene. Multiple sets of matching source domain images, source domain mixed images, and target domain mixed images are input into the first machine learning model, while the corresponding target domain images are input into the second machine learning model for training. The second machine learning model has the same model structure as the first machine learning model but different parameters, and the parameters of the second machine learning model are the exponential moving average of the parameters of the first machine learning model. The first machine learning model and the second machine learning model are trained simultaneously to achieve feature-level alignment and output-level alignment between the source domain and the target domain, thereby obtaining a trained semantic segmentation model.
[0007] According to embodiments of this disclosure, the above-mentioned determination of candidate regions to be pasted in a source domain image used to characterize a virtual scene includes: randomly selecting a source domain image as a source domain template image x. p Determine the above source domain template image x p The image region corresponding to the category whose total pixel count is a preset proportion is the candidate image region; the k least frequent long-tail categories are determined according to the pixel category distribution of all source domain images in the above source domain, k≥2 and k is an integer; a preset number of categories are selected from the above k long-tail categories as specified long-tail categories; long-tail image regions corresponding to the above specified long-tail categories are selected from the source domain images containing the above specified long-tail categories; and the source domain images that only include the above candidate image regions and the source domain images that only include the above long-tail image regions are merged to obtain the above candidate region image to be pasted.
[0008] According to embodiments of this disclosure, the above-mentioned candidate region image to be pasted is compared with the current source region image x based on a preset transparency parameter. s and the current target domain image x t The images are fused separately to obtain the current source domain mixed image x. ps Image x blended with the current target domain pt This includes: performing transparency weighting processing on the above candidate region image to be pasted according to a preset transparency parameter β, where 0 < β < 1; and processing the above current source region image x s The regions corresponding to the positions in the aforementioned candidate regions of the image to be pasted are weighted by transparency using a 1-β method, while the transparency of the remaining regions is 1; the aforementioned current target domain image x t The region corresponding to the position of the above-mentioned candidate region image to be pasted is subjected to transparency weighting according to 1-β, and the transparency of the remaining regions is 1; the transparent weighted candidate region image to be pasted and the transparent weighted current source domain image x are then compared. s The image is fused to obtain the image x from the current source domain. s Source domain blending image x ps; and the candidate region image to be pasted after transparency weighting and the current target domain image x after transparency weighting. t The images are fused to obtain the image x for the current target domain. t Target domain blending image x pt .
[0009] According to embodiments of this disclosure, the simultaneous training of a first machine learning model and a second machine learning model to achieve feature-level alignment and output-level alignment between the source domain and the target domain, thereby obtaining a trained semantic segmentation model, includes: aligning the probability maps of the target domain fused image in the first machine learning model and the second machine learning model based on weighted cross-entropy loss to achieve output-level alignment; aligning the feature maps extracted from the source domain fused image and the target domain fused image in the first machine learning model based on weighted maximum mean difference loss to achieve feature-level alignment; wherein, when simultaneously training the first machine learning model and the second machine learning model, when the parameters θ of the first machine learning model are optimized through gradient backpropagation, the parameters θ′ of the second machine learning model are updated using an exponential moving average in training step t.
[0010] According to embodiments of this disclosure, the parameters of the second machine learning model satisfy: θ′ t =α·θ′ t-1 +(1-α)·θ t , t≥2, θ′ t Let θ represent the parameters of the second machine learning model in step t, α represent the exponential moving average coefficient, 0 < α < 1, and θ t This represents the parameters of the first machine learning model in step t of training.
[0011] According to embodiments of this disclosure, the above-mentioned alignment of the probability maps of the target domain fused images in the second machine learning model and the first machine learning model based on weighted cross-entropy loss to achieve output-level alignment includes: for any one of multiple sets of matching source domain images, source domain fused images, target domain images, and target domain fused images, aligning the current source domain image x... s Current source domain mixed image x ps Image x blended with the current target domain pt The input is fed into the first machine learning model for training, and the corresponding current target domain image x is fed into the second machine learning model for training; from the second machine learning model f... θ′ Determine the current target domain image x t pseudo-tags According to the first machine learning model f θ Determine the current target domain blending image x pt Predictive semantic graph p pt Current source domain image xs Predictive semantic graph p s and the current source domain blending image x ps Predictive semantic graph p ps Based on the current source domain image x s Real Labels s And the above predicted semantic graph p s Determine the semantic segmentation loss based on cross-entropy Based on the current source domain mixed image x ps Real Labels p And the above predicted semantic graph p ps Determine the soft-paste semantic segmentation loss based on cross-entropy. Based on the current source domain mixed image x ps Real Labels p Current target domain blended image x pt Predictive semantic graph p pt and the current target domain image x t pseudo-tags Determine the loss of prediction consistency Specifically, by training the first and second machine learning models multiple times, the prediction consistency loss is improved. Convergence achieves output-level alignment.
[0012] According to embodiments of this disclosure, the above-mentioned approach to any one of multiple sets of matching source domain images, source domain mixed images, target domain images, and target domain mixed images includes: randomly selecting a set of images from the multiple sets of matching source domain images, source domain mixed images, target domain images, and target domain mixed images as the input for the current training, wherein the set of images includes matching source domain images, source domain mixed images, target domain images, and target domain mixed images; counting the number of training iterations based on a counter during each training iteration until the counter reaches a preset number of iterations, where the preset number of iterations corresponds to the learning rate of the neural network decaying to zero, at which point the selection of input images stops.
[0013] According to embodiments of this disclosure, the above-mentioned alignment of feature maps extracted from the source domain mixed image and the target domain mixed image of the first machine learning model based on weighted maximum mean difference loss to achieve feature-level alignment includes: aligning the feature maps based on the current source domain mixed image x ps Image x blended with target domain pt The extracted features are used to determine the soft-paste image region alignment loss. Alignment loss with global features Specifically, by training the first and second machine learning models multiple times, the soft-paste image region alignment loss is reduced. Alignment loss with global features Convergence achieves feature-level alignment.
[0014] Secondly, embodiments of this disclosure provide an image processing method. The image processing method includes: acquiring a target image to be processed, representing an actual scene; inputting the target image into a pre-constructed semantic segmentation model, and outputting a semantic segmentation result of the target image from the semantic segmentation model, wherein the semantic segmentation model is constructed using the method for constructing an image semantic segmentation model as described above; and processing the target image based on the semantic segmentation result.
[0015] Thirdly, embodiments of this disclosure provide an apparatus for constructing an image semantic segmentation model. The apparatus for constructing the image semantic segmentation model includes: a candidate region image determination module, an image fusion module, a model input module, and a model training module. The candidate region image determination module is used to determine candidate regions to be pasted in a source domain image, wherein the source domain image represents a virtual scene, and the candidate regions to be pasted simultaneously contain both frequently occurring categories and less frequently occurring long-tail categories. The image fusion module is used to perform a random selection of a source domain image x... s and a randomly selected target domain image x to represent the real scene. t Based on preset transparency parameters, the above candidate region image to be pasted is compared with the current source region image x. s and the current target domain image x t The images are fused separately to obtain the current source domain mixed image x. ps Image x blended with the current target domain pt The statistical distributions of the aforementioned real-world scenarios differ from those of the aforementioned virtual scenarios. The model input module is used to input multiple sets of matching source domain images, source domain mixed images, and target domain mixed images into the first machine learning model, and simultaneously input the corresponding target domain images into the second machine learning model for training. The second machine learning model has the same model structure as the first machine learning model but different parameters, and the parameters of the second machine learning model are an exponential moving average of the parameters of the first machine learning model. The model training module is used to simultaneously train the first and second machine learning models, achieving feature-level alignment and output-level alignment between the source and target domains, thereby obtaining the trained semantic segmentation model.
[0016] Fourthly, embodiments of this disclosure provide an image processing apparatus. The image processing apparatus includes: an image acquisition module, a semantic segmentation module, and an image processing module. The image acquisition module acquires a target image to be processed, representing a real-world scene. The semantic segmentation module inputs the target image into a pre-constructed semantic segmentation model, and outputs a semantic segmentation result of the target image from the semantic segmentation model. The semantic segmentation model is constructed using the method for constructing an image semantic segmentation model as described above, or using the apparatus for constructing an image semantic segmentation model as described above. The image processing module processes the target image based on the semantic segmentation result.
[0017] Fifthly, embodiments of this disclosure provide an electronic device. The electronic device includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, communication interface, and memory communicate with each other via the communication bus; the memory stores computer programs; and the processor, when executing the program stored in the memory, implements the method for constructing an image semantic segmentation model or the image processing method described above.
[0018] Sixthly, embodiments of this disclosure provide a computer-readable storage medium. The computer-readable storage medium stores a computer program that, when executed by a processor, implements the method for constructing an image semantic segmentation model or the image processing method as described above.
[0019] Compared with the prior art, the technical solutions provided in this disclosure have at least some or all of the following advantages:
[0020] The method for constructing an image semantic segmentation model provided in this disclosure introduces a candidate region image to be pasted, which includes both frequently occurring categories and less frequently occurring long-tail categories. The candidate region image is then fused with a source domain image to obtain a source domain hybrid image, and the candidate region image is fused with a target domain image to obtain a target domain hybrid image. Based on training the source domain image into a first machine learning model and the target domain image into a second machine learning model, the source domain hybrid image and the target domain hybrid image are also input into the first machine learning model for training. This effectively reduces the difference between the source domain representing a virtual scene and the target domain representing a real scene. Furthermore, by adding the source domain hybrid image and the target domain hybrid image to the input of the first machine learning model, the generalization performance of the trained semantic segmentation model in unlabeled new scenes (target domain) is improved, enhancing the robustness and accuracy of the model's semantic segmentation. Attached Figure Description
[0021] The accompanying drawings, which are incorporated in and form a part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure.
[0022] To more clearly illustrate the technical solutions in the embodiments of this disclosure or the prior art, the accompanying drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, those skilled in the art can obtain other drawings based on these drawings without creative effort.
[0023] Figure 1 This schematically illustrates the system architecture of the method and apparatus for constructing an image semantic segmentation model applicable to embodiments of this disclosure;
[0024] Figure 2 A flowchart illustrating a method for constructing an image semantic segmentation model according to an embodiment of the present disclosure is shown schematically.
[0025] Figure 3 The diagram illustrates a detailed implementation flowchart of operation S21, which involves determining a candidate region image to be pasted in a source domain image used to characterize a virtual scene, according to an embodiment of the present disclosure.
[0026] Figure 4 This diagram schematically illustrates a specific implementation scenario of operation S21 according to an embodiment of the present disclosure;
[0027] Figure 5 This illustration schematically shows the comparison of the candidate region image to be pasted with the current source region image x based on a preset transparency parameter according to an embodiment of the present disclosure. s and the current target domain image x t Detailed implementation flowchart of the separate fusion operation S22;
[0028] Figure 6 This diagram schematically illustrates a specific implementation scenario of operation S22 according to an embodiment of the present disclosure;
[0029] Figure 7 The illustration shows a detailed implementation flowchart of operation S24, which simultaneously trains a student model and a teacher model according to an embodiment of the present disclosure, so that feature-level alignment and output-level alignment are achieved between the source domain and the target domain, thereby obtaining a trained semantic segmentation model.
[0030] Figure 8 The detailed implementation process of the method for constructing an image semantic segmentation model according to embodiments of the present disclosure is illustrated.
[0031] Figure 9 A flowchart illustrating an image processing method according to an embodiment of the present disclosure is shown schematically.
[0032] Figure 10 This schematic diagram illustrates a structural block diagram of an apparatus for constructing an image semantic segmentation model according to an embodiment of the present disclosure;
[0033] Figure 11 A schematic block diagram of an image processing apparatus according to embodiments of the present disclosure is shown; and
[0034] Figure 12 The schematic diagram illustrates a structural block diagram of an electronic device provided in an embodiment of the present disclosure. Detailed Implementation
[0035] In related technologies, due to the difference in statistical distribution between virtual synthetic images (belonging to the source domain) and real images (belonging to the target domain), i.e., a domain difference, directly using semantic segmentation models trained on source domain image data to test target domain images may lead to a significant degradation in semantic segmentation prediction performance, resulting in poor model prediction reliability. In other words, models trained on virtual synthetic images have poor generalization ability on real images. Repeatedly applying models learned on virtual synthetic data may impair their performance on real-world data; this phenomenon is called "domain shift." For example, taking the segmentation result of a single frame in a real street scene video as an example, a semantic segmentation model trained on synthetic data from a video game cannot correctly segment the real street scene into different semantic categories, such as roads, people, and vehicles.
[0036] Therefore, how to utilize labeled source domain data that has a different distribution from the target domain data to guide the improvement of model robustness, so that it has better segmentation performance in the unlabeled target domain and improves the performance of unsupervised domain adaptation (UDA), has important practical value.
[0037] The average teacher framework has been used in some methods for unsupervised adaptive semantic segmentation. These UDA methods align the output level by applying consistency constraints to the target predictions of the student and teacher models, respectively. While effective, these methods still suffer from training instability and slow convergence, especially during the initial training phase, due to inaccurate predictions of the unlabeled target domain.
[0038] In view of this, embodiments of the present disclosure provide a method and apparatus for constructing an image semantic segmentation model, and also provide an image processing method and apparatus, an electronic device, and a computer-readable storage medium. The method for constructing an image semantic segmentation model includes: determining candidate regions to be pasted in a source domain image used to represent a virtual scene, wherein the candidate regions to be pasted simultaneously include categories with high frequency of occurrence and long-tail categories with low frequency of occurrence; and for a randomly selected source domain image x... sand a randomly selected target domain image x to represent the real scene. t Based on preset transparency parameters, the above candidate region image to be pasted is compared with the current source region image x. s and the current target domain image x t The images are fused separately to obtain the current source domain mixed image x. ps Image x blended with the current target domain pt The statistical distribution of the real-world scene differs from that of the virtual scene. Multiple sets of matching source domain images, source domain mixed images, and target domain mixed images are input into the first machine learning model, while the corresponding target domain images are input into the second machine learning model for training. The second machine learning model has the same model structure as the first machine learning model but different parameters, and the parameters of the second machine learning model are the exponential moving average of the parameters of the first machine learning model. The first and second machine learning models are trained simultaneously to achieve feature-level alignment and output-level alignment between the source and target domains, thereby obtaining a trained semantic segmentation model.
[0039] To make the objectives, technical solutions, and advantages of the embodiments of this disclosure clearer, the technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this disclosure. Based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.
[0040] Figure 1 The schematic illustration shows the system architecture of the method and apparatus for constructing an image semantic segmentation model applicable to embodiments of this disclosure.
[0041] Reference Figure 1 As shown, the system architecture 100 of the method and apparatus for constructing an image semantic segmentation model applicable to embodiments of this disclosure includes: terminal devices 101, 102, and 103, a network 104, and a server 105. The network 104 serves as a medium for providing a communication link between the terminal devices 101, 102, and 103 and the server 105. The network 104 may include various connection types, such as wired or wireless communication links, or fiber optic cables, etc.
[0042] Users can use terminal devices 101, 102, and 103 to interact with server 105 via network 104 to receive or send messages, etc. Terminal devices 101, 102, and 103 can be equipped with image capture devices, image / video playback applications, etc. Other communication client applications can also be installed, such as shopping applications, web browser applications, search applications, instant messaging tools, email clients, social media platform software, etc. (for example only).
[0043] Terminal devices 101, 102, and 103 can be various electronic devices with displays that support image / video playback. These electronic devices may further include image capture devices, such as smartphones, tablets, laptops, desktop computers, autonomous vehicles, and so on.
[0044] Server 105 can be a server that provides various services, such as a backend management server that provides data processing support for images or videos captured by users using terminal devices 101, 102, and 103 (this is just an example). The backend management server can analyze and process received image / video processing requests and other data, and feed back the processing results (such as web pages, information, or data obtained or generated according to user requests) to the terminal devices.
[0045] It should be noted that the method for constructing an image semantic segmentation model provided in this disclosure can generally be executed by server 105 or a terminal device with a certain computing power. Correspondingly, the apparatus for constructing an image semantic segmentation model provided in this disclosure can generally be located in server 105 or the aforementioned terminal device with a certain computing power. The method for constructing an image semantic segmentation model provided in this disclosure can also be executed by a server or server cluster that is different from server 105 and capable of communicating with terminal devices 101, 102, 103 and / or server 105. Correspondingly, the apparatus for constructing an image semantic segmentation model provided in this disclosure can also be located in a server or server cluster that is different from server 105 and capable of communicating with terminal devices 101, 102, 103 and / or server 105.
[0046] It should be understood that Figure 1 The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included.
[0047] The first exemplary embodiment of this disclosure provides a method for constructing an image semantic segmentation model.
[0048] In the embodiments of this disclosure, the second machine learning model has the same model structure as the first machine learning model but different parameters, and the parameters of the second machine learning model are the exponential moving average of the parameters of the first machine learning model. In this embodiment, the first machine learning model is used as the student model and the second machine learning model is used as the teacher model as an example.
[0049] Figure 2 A flowchart illustrating a method for constructing an image semantic segmentation model according to an embodiment of the present disclosure is shown schematically.
[0050] Reference Figure 2 As shown, the method for constructing an image semantic segmentation model provided in this embodiment includes the following operations: S21, S22, S23 and S24.
[0051] In operation S21, a candidate region image to be pasted is determined in the source domain image used to characterize the virtual scene. The candidate region image to be pasted contains both high-frequency categories and low-frequency long-tail categories.
[0052] Here, "high-frequency categories" refers to categories that appear frequently in all source domain images. For example, categories that account for a predetermined proportion (for example, 1 / 2) of the total categories in a given source domain image (hereinafter described as a randomly selected source domain template image) can be considered high-frequency categories. "Low-frequency long-tail categories" refers to the categories that appear least frequently in all source domain images. The "high" and "low" here refer to the relative importance between long-tail categories and high-frequency categories; therefore, the scope of protection for these terms is clear.
[0053] In operation S22, for a randomly selected source domain image x s and a randomly selected target domain image x to represent the real scene. t Based on preset transparency parameters, the above candidate region image to be pasted is compared with the current source region image x. s and the current target domain image x t The images are fused separately to obtain the current source domain mixed image x. ps Image x blended with the current target domain pt .
[0054] The statistical distributions of the aforementioned real-world scenarios differ from those of the aforementioned virtual scenarios.
[0055] By fusing the candidate region image to be pasted with corresponding source and target domain images randomly selected from the source and target domains, a result can be obtained for the current source domain image x. s The current source domain blended image x ps and for the current target domain image x t The current target domain blending image x pt The resulting fused image x from the current source domain ps Image x blended with the current target domain pt They share the same image region (candidate region to be pasted), which can serve as an intermediary (or medium) connecting the source and target domains, reducing the domain differences between the source and target domains and improving the model's generalization performance in unlabeled new scenarios.
[0056] In operation S23, multiple sets of matching source domain images, source domain mixed images, and target domain mixed images are input into the student model of the average teacher frame, while the corresponding target domain images are input into the teacher model in the average teacher frame for training.
[0057] For a randomly selected source domain image x s and a randomly selected target domain image x t In operation S22, the image x from the current source domain can be obtained. s The corresponding current source domain blending image x ps and the current target domain image x t The corresponding current target domain blending image x pt This yields a set of matching input data, represented as {source domain image, source domain blended image, target domain image, target domain blended image}. In operation S23, multiple sets of matching input data are successively input into the average teacher frame for training. Each training iteration inputs a set of matching input data, in which the source domain image, source domain blended image, and target domain blended image are input into the student model, and the target domain image is input into the teacher model.
[0058] In operation S24, both the student model and the teacher model are trained simultaneously, enabling feature-level alignment and output-level alignment between the source and target domains, thereby obtaining the trained semantic segmentation model.
[0059] By simultaneously training the student model and the teacher model in the average teacher framework, feature-level alignment and output-level alignment are achieved between the source domain and the target domain. At this point, the student model and the teacher model with the corresponding well-trained parameters are the semantic segmentation models that have been trained.
[0060] Based on the operations S21-S24 described above, candidate region images for pasting are introduced, which simultaneously include both frequently occurring categories and less frequently occurring long-tail categories. These candidate region images are then fused with the source domain image and the target domain image, respectively, to obtain source domain mixed images and target domain mixed images. This helps to narrow the distance between the target domain and the source domain, while also avoiding problems such as large statistical distribution differences between the source and target domains, inconsistent spatial layouts caused by directly pasting parts of the source domain, and class imbalance. In addition to training the student model with the source domain image and the teacher model with the target domain image, the source domain mixed image and the target domain mixed image are also input into the student model for training. This effectively reduces the difference between the source domain representing the virtual scene and the target domain representing the real scene. Furthermore, by adding the source domain mixed image and the target domain mixed image to the input of the average teacher frame, the generalization performance of the trained semantic segmentation model in unlabeled new scenes (target domain) is improved, enhancing the robustness and accuracy of the semantic segmentation.
[0061] In this disclosure, the terms have the following meanings: Source Domain: Virtual street view images or other data (weather, lighting, scene spatial layout, etc.) with a different statistical distribution than the target domain, and with semantic labels. Target Domain: Real street view images or other data (weather, lighting, scene spatial layout, etc.) with a different statistical distribution than the source domain, and without semantic labels. Pseudo-labels: Semantic segmentation results obtained by using an existing model to predict on the target domain are used as pseudo-labels. Average Teacher Framework: The average teacher framework is a widely used framework based on the simple idea that, under the supervision of labeled data, unlabeled data should produce consistent predictions under different perturbations. It consists of two models, a student model and a teacher model, where the teacher model is the exponential moving average (EMA) of the student model. The teacher model transfers learned knowledge to the student model by adjusting both domains at the output level using consistency regularization. UDA: Unsupervised Domain Adaptation is a special case of transfer learning. The goal is to map data from source and target domains with different distributions into the same feature space, making their distances in that space as close as possible. Therefore, a model trained on the source domain in the feature space can be transferred to the target domain to improve accuracy in the target domain. Source domain: virtual street view images or other data with different statistical distributions (weather, lighting, scene spatial layout, etc.) that have semantic labels.
[0062] Figure 3 The diagram illustrates a detailed implementation flowchart of operation S21, which involves determining a candidate region image to be pasted in a source domain image used to characterize a virtual scene, according to an embodiment of the present disclosure.
[0063] According to embodiments of this disclosure, referring to Figure 3 As shown, operation S21, which determines the candidate region image to be pasted in the source domain image used to characterize the virtual scene, includes the following sub-operations: S211, S212, S213, S214, and S215. In this disclosure, the candidate region image to be pasted refers to the source domain image containing only the candidate region to be pasted, which can be obtained by performing operations S211 to S215.
[0064] In sub-operation S211, a source domain image is randomly selected as the source domain template image x. p Determine the above source domain template image x p The image regions corresponding to categories whose total pixel count is a preset proportion are candidate image regions.
[0065] In suboperation S212, the k least frequent long-tail categories are determined based on the pixel category distribution of all source domain images in the source domain, where k ≥ 2.
[0066] In sub-operation S213, a preset number of categories are selected from the aforementioned k long-tail categories as designated long-tail categories. Here, the preset number is less than k; in some embodiments, the preset number may also be equal to k.
[0067] In sub-operation S214, a long-tail image region corresponding to the specified long-tail category is selected from the source domain image containing the specified long-tail category.
[0068] In sub-operation S215, the source domain image containing only the above-mentioned candidate image region is merged with the source domain image containing only the above-mentioned long-tail image region to obtain the above-mentioned candidate region image to be pasted.
[0069] In the embodiments of this disclosure, the preset ratio in the sub-operation S211 can be any number within a range. For example, the ratio range can be 1 / 3 to 5 / 9 (inclusive of endpoints), and the value of this ratio range can be adaptively changed according to actual needs. In an exemplary instance, the preset ratio can be 1 / 2. The setting of this preset ratio needs to ensure that it does not affect the coverage of more features of the source domain image / target domain image, and also enables the connection between the source domain and the target domain.
[0070] For a real-world dataset, the GTA5 dataset, the k least frequent long-tail categories in the above sub-operation S212 are: rider, bus, train, motorcycle, and bicycle, corresponding to k=5. For the SYNTHIA dataset, the k least frequent long-tail categories are: wall, light, bus, and bicycle, corresponding to k=4.
[0071] In one embodiment, S represents the source domain, which includes the source domain image X. S and pixel-level label Y S Let T denote the target domain, where the target domain T contains only the unlabeled target domain image X. T The goal of semantic segmentation based on unsupervised domain adaptation (UDA) is to segment {X}... s Y S X T Train a target domain image X on} T A model that accurately predicts semantic labels.
[0072] This disclosure proposes a novel semantic segmentation model within the average teacher framework. This model obtains input data using a double soft paste (DSP) approach. Specifically, it introduces candidate regions to be pasted, including both frequently occurring categories and less frequently occurring long-tail categories. These candidate regions are then fused with the source and target domain images, respectively, to obtain a source domain blended image and a target domain blended image. These images serve as input data for model training within a teacher and student model under the average teacher framework. The semantic segmentation model comprises two segmentation models: a student model and a teacher model. Each model is a neural network; therefore, the student model can also be called a student network, and the teacher model can also be called a teacher network. Student model f θ The parameter θ is learnable, and the teacher network f θ′ The parameter θ′ is calculated from the exponential moving average (EMA) of the parameter θ of the student model.
[0073] The implementation process of the above sub-operations S211 to S215 will be described below in the context of specific scenarios.
[0074] In one exemplary embodiment, there are a total of 8 images (or pictures) in the source domain S. These 8 images are images A to H, and are all different virtual scene images, such as screenshots from a game scene or images from other virtual scenes. These 8 images contain a total of 18 categories, namely: trees, motorcycles, riders, buses, trains, bicycles, pedestrians, walls, fences, rivers, runways, lighthouses, stadiums, halls, flags, podiums, race cars, and race car drivers.
[0075] For example, image A contains 4 categories: trees, pedestrians, buses, and fences. Image B contains 3 categories: trains, rivers, and trees. Image C contains 4 categories: motorcycles, riders, bicycles, and racetracks. Image D contains 6 categories: trees, lighthouses, flags, race car drivers, race cars, and racetracks. Image E contains 3 categories: stadiums, halls, and podiums. Image F contains 6 categories: stadiums, walls, podiums, racetracks, race cars, and race car drivers. Image G contains 6 categories: trees, buses, rivers, fences, pedestrians, and stadiums. Image H contains 5 categories: riders, motorcycles, trees, fences, and lighthouses.
[0076] Figure 4 The diagram illustrates a specific implementation scenario of operation S21 according to an embodiment of the present disclosure.
[0077] Figure 4In the illustrated images, the placement and size of each bounding box represent the size and position of the image region corresponding to that box. The text within the bounding box indicates the category corresponding to that image region. For simplicity, the actual images within many image regions are not shown; only the category corresponding to the image region is indicated. (The following...) Figure 6 The representation is similar.
[0078] In suboperation S211, an image is randomly selected from the 8 images of the source domain S as the source domain template image x. p Let's take a randomly selected image A as an example. (Refer to...) Figure 4 As shown, there are 4 categories in image A. Half of the categories (the preset ratio is 1 / 2 here) are selected as candidate image regions, meaning two categories in image A need to be selected. Here, we take randomly selecting trees and buses as examples. Therefore, the image regions corresponding to the two categories in image A, trees and buses, are represented as candidate image regions 401a and 401b, respectively. The source image in sub-operation S215, which only includes the aforementioned candidate image regions 401a and 401b, corresponds to 401.
[0079] Given an image from a source domain S, the frequency distribution of pixel categories can be computed as {p1, p2, ..., p...} c The frequency distribution for one of the categories satisfies the following expression:
[0080]
[0081] Where c represents the total number of categories in all source domain images, c ij Indicates whether the j-th source domain image contains category c. i N represents the total number of source domain images in the source domain S. All images in the source domain have the same size.
[0082] Taking the above embodiment as an example, the frequency distributions of trees, motorcycles, riders, buses, trains, bicycles, pedestrians, walls, fences, rivers, runways, lighthouses, stadiums, halls, flags, podiums, race cars, and race car drivers are respectively represented as p1, p2, p3, p4, p5, p6, p7, p8, p9, p 10 p 11 p 12 p 13 p 14 p 15 p 16 p 17 p 18According to expression (1), the following can be calculated: p1 (trees) = 5 / 8, p2 (motorcycles) = 2 / 8, p3 (riders) = 2 / 8, p4 (buses) = 2 / 8, p5 (trains) = 1 / 8, p6 (bicycles) = 1 / 8, p7 (pedestrians) = 2 / 8, p8 (walls) = 1 / 8, p9 (fences) = 3 / 8, p 10 (River) = 2 / 8, p 11 (Runway) = 3 / 8, p 12 (Lighthouse) = 2 / 8, p 13 (Venue) = 3 / 8, p 14 (Hall) = 1 / 8, p 15 (Flag) = 1 / 8, p 16 (Podium) = 2 / 8, p 17 (Racing) = 2 / 8, p 18 (Racing driver) = 2 / 8.
[0083] In suboperation S212, the k least frequent categories are selected as long-tail categories, and images containing these categories are recorded as dataset D to facilitate subsequent sampling processes.
[0084] In this embodiment, taking k=5 as an example, the five categories with the lowest distribution frequency are: train, bicycle, wall, hall and flag. These five categories are long-tail categories.
[0085] In sub-operation S213, two categories are selected from the long-tail categories as designated long-tail categories. For example, two categories can be randomly selected from train, bicycle, wall, hall, and flag as designated long-tail categories. Flag and train are used as designated long-tail categories as an example.
[0086] In sub-operation S214, long-tail image regions 402a and 402b corresponding to the specified long-tail categories are selected from the source domain images containing the specified long-tail categories. Specifically, the long-tail image region 402a corresponding to the train and the long-tail image region 402b corresponding to the flag can belong to different source domain images, or they can be located in the same source domain image. Sub-operation S215 involves merging the source domain images containing all the long-tail image regions of the specified long-tail categories to obtain the final image. (Refer to...) Figure 4 As shown, the source domain image that only includes the aforementioned long-tailed image region in sub-operation S215 corresponds to 402.
[0087] In sub-operation S215, the source domain image 401, which only includes the aforementioned candidate image region, and the source domain image 402, which only includes the aforementioned long-tail image region, are merged to obtain the aforementioned candidate region image 403 to be pasted. Figure 4 As shown.
[0088] In one embodiment, the hybrid source and target domains are obtained by soft-pasting candidate region images that simultaneously possess both frequently occurring categories and less frequently occurring long-tail categories onto the source and target domain images. By employing a transparent weighted processing (soft-pasting) strategy in the fusion of the candidate region images to be pasted with the source / target domain images, the structural layout and appearance of the pasted images can be maintained intact. This preserves the information of the original domain and reduces the domain differences between the source and target domains through the fusion of the candidate regions to be pasted, thereby increasing the model's generalization performance in unlabeled new scenarios.
[0089] Figure 5 This illustration schematically shows the comparison of the candidate region image to be pasted with the current source region image x based on a preset transparency parameter according to an embodiment of the present disclosure. s and the current target domain image x t Detailed implementation flowchart of the separate fusion operation S22.
[0090] According to embodiments of this disclosure, referring to Figure 5 As shown, based on preset transparency parameters, the above-mentioned candidate region image to be pasted is compared with the current source region image x. s and the current target domain image x t The images are fused separately to obtain the current source domain mixed image x. ps Image x blended with the current target domain pt Operation S22 includes the following sub-operations: S221, S222, S223, S224 and S225.
[0091] In sub-operation S221, the above candidate region image to be pasted is subjected to transparency weighting processing according to the preset transparency parameter β, where 0 < β < 1.
[0092] In sub-operation S222, the aforementioned current source domain image x s The regions corresponding to the positions of the above candidate regions to be pasted are weighted by transparency according to 1-β, while the transparency of the remaining regions is 1.
[0093] In sub-operation S223, the aforementioned current target domain image x t The regions corresponding to the positions of the above candidate regions to be pasted are weighted by transparency according to 1-β, while the transparency of the remaining regions is 1.
[0094] In sub-operation S224, the candidate region image to be pasted after transparency weighting and the current source domain image x after transparency weighting are compared. s The image is fused to obtain the image x from the current source domain. s Source domain blending image x ps .
[0095] In sub-operation S225, the candidate region image to be pasted after transparency weighting and the current target domain image x after transparency weighting are compared. t The images are fused to obtain the image x for the current target domain. t Target domain blending image x pt .
[0096] Figure 6 The diagram illustrates a specific implementation scenario of operation S22 according to an embodiment of the present disclosure.
[0097] Reference Figure 6 As shown, the candidate region image 403 to be pasted obtained by the aforementioned operation S21 is subjected to transparency weighting processing with transparency parameter β in sub-operation S221, resulting in a transparency weighted candidate region image 603 to be pasted. The transparency weighting processing of the candidate region image 403 to be pasted with transparency parameter β includes: performing transparency weighting processing on each candidate region to be pasted in the candidate region image 403 to be pasted, where each candidate region to be pasted includes: candidate image regions and long-tail image regions of a specified long-tail category. For example... Figure 6 The candidate image regions 401a (corresponding to the tree category), 401b (corresponding to the bus category), 402a (corresponding to the train category), and 402b (corresponding to the flag category) are weighted and transparent respectively, resulting in candidate regions 601a, 601b, 602a, and 602b to be pasted. Figure 6 The diagram uses dot filling to illustrate the transparency process performed with the transparency parameter β.
[0098] In this embodiment, image E from the aforementioned embodiment is used to illustrate a randomly selected current source domain image x. s The image REAL, representing a real river scene, is used as the target domain image x. t An example. Image E contains three categories: venue, lobby, and podium.
[0099] In suboperation S222 and suboperation S223, such as Figure 6 As shown, for images E and REAL, the regions in images E and REAL corresponding to the positions of image regions 401a, 401b, 402a, and 402b in the candidate region image 403 to be pasted are subjected to transparency weighting processing according to 1-β, while the transparency of other regions is 1. Figure 6The image uses a grid filling to illustrate the transparency process performed with a transparency parameter of 1-β. The corresponding regions in image E after transparency processing with the transparency parameter 1-β are labeled 604a, 604b, 604c, and 604d. The current source domain image x after transparency weighting is shown in the image. s exist Figure 6 The label is 604. The corresponding regions of the image REAL after transparency processing with the transparency parameter 1-β are labeled 605a, 605b, 605c, and 605d. The current target domain image x after transparency weighting processing... t exist Figure 6 The winning designation is 605.
[0100] Furthermore, in sub-operation S224, the candidate region image 603 after transparency weighting and the current source domain image x after transparency weighting can be combined. s 604 is fused to obtain the image x from the current source domain. s Source domain blending image x ps 606. In sub-operation S225, the candidate region image 603 after transparency weighting and the current target domain image x after transparency weighting are compared. t 605 is fused to obtain the image x for the current target domain. t Target domain blending image x pt 607.
[0101] The sub-operations S221 to S225 described above, which simultaneously perform transparency weighting processing on the source and target domain images, can be described as a double soft-paste method. This can be implemented by applying a double soft-paste algorithm to the images. Since the candidate regions to be pasted are pasted into the source or target domain images using a transparency weighting method, in the following description, the candidate regions to be pasted (401a, 401b, 402a, and 402b) and the corresponding transparently processed regions 604a, 604b, 604c, and 604d are collectively referred to as soft-paste image regions.
[0102] For example, the dual soft paste (DSP) algorithm is as follows.
[0103] Input: Source domain template image x p and its tag y p Source domain image x s Target domain image x t A predefined long-tail dataset D, with transparency parameter β.
[0104] Output: DSP mask M, source-domain blended image x ps Target domain blended image x pt .
[0105] The parameters and corresponding calculation logic defined in the DSP algorithm are as follows:
[0106] S class ←y p The set of categories that appear in
[0107] c←from S class Randomly select S class / 2 categories
[0108] For each i, jdo
[0109]
[0110] end for
[0111] c tail y tail ← Randomly select images containing the long-tail class from D.
[0112] For each i, jdo
[0113] If y tail (i, j) ∈ c tail
[0114] Then M(i,j) = 1
[0115] end for
[0116] x ps =βM⊙x p +(1-βM)⊙x s x pt =βM⊙x p +(1-βM)⊙x t
[0117] Return M, x ps x pt .
[0118] For example, β can be set to 0.8 by default. β can also be other parameter values, and the specific parameter values can be modified according to actual needs.
[0119] Figure 7 The schematic diagram illustrates a detailed implementation flowchart of operation S24, which involves simultaneously training a student model and a teacher model according to an embodiment of the present disclosure, thereby achieving feature-level alignment and output-level alignment between the source and target domains to obtain a trained semantic segmentation model.
[0120] According to embodiments of this disclosure, referring to Figure 7As shown, the operation S24, which trains both the student model and the teacher model simultaneously to achieve feature-level alignment and output-level alignment between the source domain and the target domain, and thus obtains the trained semantic segmentation model, includes the following sub-operations: S241 and S242.
[0121] In suboperation S241, the probability maps of the target domain fusion images in the teacher model and student model are aligned based on weighted cross-entropy loss to achieve output-level alignment.
[0122] In suboperation S242, feature maps extracted from the source domain mixed image and the target domain mixed image of the student model are aligned based on the weighted maximum mean difference loss to achieve feature-level alignment.
[0123] In the simultaneous training of the student and teacher models, when the parameters θ of the student model are optimized through gradient backpropagation, the parameters θ′ of the teacher model are updated using an exponential moving average during the training steps. t =α·θ′ t-1 +(1-α)·θ t , t≥2, θ′ t Let θ represent the parameters of the teacher model in step t, α represent the exponential moving average coefficient, 0 < α < 1, and θ t This represents the parameters of the student model in step t of training.
[0124] Specifically, the above-mentioned process of aligning the probability maps of the target domain fusion images in the teacher model and student model based on weighted cross-entropy loss to achieve output-level alignment includes the following steps.
[0125] (a) For any one of the multiple sets of source domain images, source domain mixed images, target domain images, and target domain mixed images, the current source domain image x s Current source domain mixed image x ps Image x blended with the current target domain pt The image is input into the student model for training, and the corresponding current target domain image x is used. t The data is input into the teacher model for training.
[0126] During each training iteration, operations (b) to (f) are performed on the current input image data. By training the teacher model and student model multiple times, the prediction consistency loss is improved. Convergence achieves output-level alignment.
[0127] (b) From the teacher model f θ′ Determine the current target domain image x t pseudo-tags
[0128] (c) Based on student model fθ Determine the current target domain blending image x pt Predictive semantic graph p pt Current source domain image x s Predictive semantic graph p s and the current source domain blending image x ps Predictive semantic graph p ps .
[0129] (d) Based on the current source domain image x s Real Labels s And the above predicted semantic graph p s Determine the semantic segmentation loss based on cross-entropy
[0130] According to embodiments of this disclosure, the above-described semantic segmentation loss based on cross-entropy... Satisfy the following expression:
[0131]
[0132] Where H, W, and C represent the height, width, and number of categories of the image, respectively, and y s Represents the current source domain image x s The true label.
[0133] (e) Based on the current source domain blending image x ps Real Labels p And the above predicted semantic graph p ps Determine the soft-paste semantic segmentation loss based on cross-entropy.
[0134] According to embodiments of this disclosure, the above-described soft-paste semantic segmentation loss based on cross-entropy... Satisfy the following expression:
[0135]
[0136] Among them, y p Represents the true label of the source domain mixed image.
[0137] (f) Based on the current source domain blending image x ps Real Labels p Current target domain blended image x pt Predictive semantic graph p pt and the current target domain image x t pseudo-tags Determine the loss of prediction consistency
[0138] Since the teacher model and the student model are expected to produce the same predictions for the same image under different perturbations, prediction consistency loss is used to train the network.
[0139] According to embodiments of this disclosure, predicting consistency loss Satisfy the following expression:
[0140]
[0141] The above-mentioned approach to any one of the multiple sets of matching source domain images, source domain mixed images, target domain images, and target domain mixed images includes: randomly selecting a set of images from the multiple sets of matching source domain images, source domain mixed images, target domain images, and target domain mixed images as the input for the current training, and this set of images includes the matching source domain images, source domain mixed images, target domain images, and target domain mixed images; counting the number of training iterations based on a counter during each training iteration until the counter reaches a preset number of iterations, at which point the selection of input images stops.
[0142] According to embodiments of this disclosure, the process of aligning feature maps extracted from the source domain mixed image and the target domain mixed image of the student model based on weighted maximum mean difference loss to achieve feature-level alignment includes: ... ps Image x blended with target domain pt The extracted features are used to determine the soft-paste image region alignment loss. Alignment loss with global features Among them, by training the teacher model and the student model multiple times, the soft-paste image region alignment loss was reduced. Alignment loss with global features Convergence achieves feature-level alignment.
[0143] Due to source domain mixing image x ps Image x blended with target domain pt Having the same region for soft-paste images (corresponding to the source domain image), therefore, within that region, from x... ps and x pt The features extracted should be as similar as possible.
[0144] To address this, the maximum mean difference loss is employed, which learns transferable features by minimizing the maximum mean difference of its kernel embeddings.
[0145] Among them, soft-paste image region alignment loss Satisfy the following expression:
[0146]
[0147] Where μ(·) represents kernel mean embedding, fe Representing the student model f θ Feature extractor, This represents the reproducible kernel Hilbert space (RKHS). Because x ps and x pt The pasted image regions in the image have different source and target domain information backgrounds, which may be embedded in the extracted features. The soft-paste image region alignment loss can implicitly reduce the domain gap.
[0148] Global feature alignment loss Satisfy the following expression:
[0149]
[0150] By trying to minimize x ps and x pt The maximum mean difference of global image features is used to align the feature distributions of the source and target domains.
[0151] The teacher and student models can be trained simultaneously, constructing a total loss function that includes all the aforementioned loss functions:
[0152]
[0153] Where, λ feature It is a hyperparameter used to balance different losses; for example, it is set to 0.005.
[0154] Figure 8 The detailed implementation process of the method for constructing an image semantic segmentation model according to embodiments of the present disclosure is illustrated.
[0155] Reference Figure 8 As shown, candidate regions for pasting are determined in the source domain template image. The source domain template image and the source domain image are fused using a dual soft paste (DSP) algorithm to obtain a source domain hybrid image. The source domain template image and the target domain image are then fused using the DSP algorithm to obtain a target domain hybrid image. The source domain image, the source domain hybrid image, and the target domain hybrid image are input into the student network (corresponding to the student model), and the target domain image is input into the teacher network (corresponding to the teacher model). By simultaneously training the teacher network and the student network, the probability maps of the target domain fused images in the teacher model and the student model are aligned based on weighted cross-entropy loss to achieve output-level alignment; the feature maps extracted from the source domain hybrid image and the target domain hybrid image in the student model are aligned based on weighted maximum mean difference loss to achieve feature-level alignment.
[0156] A second exemplary embodiment of this disclosure provides a method for image processing.
[0157] Figure 9 A flowchart illustrating an image processing method according to an embodiment of the present disclosure is shown schematically.
[0158] Reference Figure 9 As shown, the image processing method provided in this embodiment includes the following operations: S91, S92 and S93.
[0159] In operation S91, a target image to be processed is acquired to characterize the actual scene.
[0160] The target image can be a frame from a video or an image to be processed. For example, in an autonomous driving application scenario, scene analysis is performed based on footage or video taken by the autonomous vehicle during its operation. In this case, the target image corresponds to a specific video frame from the video taken by the autonomous vehicle that needs to be analyzed.
[0161] In operation S92, the target image is input into a pre-constructed semantic segmentation model, and the semantic segmentation result of the target image is output by the semantic segmentation model.
[0162] The semantic segmentation model described above is constructed using the method for constructing an image semantic segmentation model disclosed herein. For example, the semantic segmentation model can be pre-constructed using operations S21 to S24 described in the first embodiment. The constructed semantic segmentation model is stored in the image processing device. When performing operations S91 to S92 multiple times, it is not necessary to construct the semantic segmentation model each time; the target image can be directly input into the stored semantic segmentation model for processing.
[0163] In the above scenario, a video frame image to be analyzed from the video captured by the autonomous vehicle is input into a pre-constructed semantic segmentation model. Since the semantic segmentation model has good prediction accuracy for real-world scenarios, the semantic segmentation result of the video frame image output by the semantic segmentation model can serve as the basis for subsequent analysis.
[0164] In operation S93, the target image is processed based on the semantic segmentation results.
[0165] In the above scenario, scene analysis can be performed on the video frame image based on the semantic segmentation results, such as security analysis.
[0166] The above operation S93, which processes the target image based on the semantic segmentation result, can be extended to multiple practical applications, such as adding subtitles to images / videos, image-to-image translation, video content analysis, scene understanding, autonomous driving, etc., using the semantic segmentation result as the basis for subsequent practical applications.
[0167] A third exemplary embodiment of this disclosure provides an apparatus for constructing an image semantic segmentation model.
[0168] Figure 10 The diagram illustrates a structural block diagram of an apparatus for constructing an image semantic segmentation model according to an embodiment of the present disclosure.
[0169] Reference Figure 10 As shown, the apparatus 1000 for constructing an image semantic segmentation model provided in this embodiment includes: a candidate region image determination module 1001, an image fusion module 1002, a model input module 1003, and a model training module 1004.
[0170] The above-mentioned candidate region image determination module 1001 is used to determine the candidate region image to be pasted in the source domain image. The source domain image is used to represent the virtual scene. The candidate region image to be pasted simultaneously includes categories with high frequency of occurrence and long-tail categories with low frequency of occurrence.
[0171] The image fusion module 1002 described above is used for a randomly selected source domain image x s and a randomly selected target domain image x to represent the real scene. t Based on preset transparency parameters, the above candidate region image to be pasted is compared with the current source region image x. s and the current target domain image x t The images are fused separately to obtain the current source domain mixed image x. ps Image x blended with the current target domain pt The statistical distributions of the aforementioned real-world scenarios differ from those of the aforementioned virtual scenarios.
[0172] The aforementioned model input module 1003 is used to input multiple sets of matching source domain images, source domain mixed images and target domain mixed images into the first machine learning model, and at the same time input the corresponding target domain images into the second machine learning model for training. The second machine learning model has the same model structure as the first machine learning model but different parameters, and the parameters of the second machine learning model are the exponential moving average of the parameters of the first machine learning model.
[0173] The aforementioned model training module 1004 is used to simultaneously train the first machine learning model and the second machine learning model, so that feature-level alignment and output-level alignment are achieved between the source domain and the target domain, thereby obtaining the trained semantic segmentation model.
[0174] A fourth exemplary embodiment of this disclosure provides an apparatus for image processing.
[0175] Figure 11A schematic block diagram of an image processing apparatus according to an embodiment of the present disclosure is shown.
[0176] Reference Figure 11 As shown, the image processing apparatus 1100 provided in this embodiment includes: an image acquisition module 1101, a semantic segmentation module 1102, and an image processing module 1103. The image processing apparatus 1100 stores a pre-built semantic segmentation model, or can communicate with a device that builds a semantic segmentation model to invoke the built semantic segmentation model for image semantic segmentation processing.
[0177] The image acquisition module 1101 described above is used to acquire the target image to be processed, which is used to characterize the actual scene.
[0178] The semantic segmentation module 1102 is used to input the target image into a pre-constructed semantic segmentation model, and the semantic segmentation model outputs the semantic segmentation result of the target image. The semantic segmentation model is constructed using the method for constructing an image semantic segmentation model as described above or using the apparatus for constructing an image semantic segmentation model as described above.
[0179] The image processing module 1103 is used to process the target image based on the semantic segmentation result.
[0180] In the third embodiment described above, any plurality of the candidate region image determination module 1001, image fusion module 1002, model input module 1003, and model training module 1004 can be combined into one module, or any one of these modules can be split into multiple modules. Alternatively, at least part of the functionality of one or more of these modules can be combined with at least part of the functionality of other modules and implemented in one module. At least one of the candidate region image determination module 1001, image fusion module 1002, model input module 1003, and model training module 1004 can be at least partially implemented as hardware circuitry, such as a field-programmable gate array (FPGA), a programmable logic array (PLA), a system-on-a-chip, a system-on-a-substrate, a system-on-package, an application-specific integrated circuit (ASIC), or any other reasonable means of integrating or packaging the circuitry, or implemented in hardware or firmware, or in any one of the three implementation methods of software, hardware, and firmware, or in a suitable combination of any of these. Alternatively, at least one of the candidate region image determination module 1001, image fusion module 1002, model input module 1003, and model training module 1004 can be at least partially implemented as a computer program module, which can perform corresponding functions when the computer program module is run.
[0181] In the fourth embodiment described above, any plurality of the image acquisition module 1101, semantic segmentation module 1102, and image processing module 1103 can be combined into one module, or any one of these modules can be split into multiple modules. Alternatively, at least part of the functionality of one or more of these modules can be combined with at least part of the functionality of other modules and implemented in one module. At least one of the image acquisition module 1101, semantic segmentation module 1102, and image processing module 1103 can be at least partially implemented as hardware circuitry, such as a field-programmable gate array (FPGA), a programmable logic array (PLA), a system-on-a-chip, a system-on-a-substrate, a system-on-package, an application-specific integrated circuit (ASIC), or any other reasonable means of integrating or packaging the circuitry, or implemented in software, hardware, or firmware, or in any appropriate combination of any of these three implementation methods. Alternatively, at least one of the image acquisition module 1101, semantic segmentation module 1102, and image processing module 1103 can be at least partially implemented as a computer program module, which can perform corresponding functions when the computer program module is run.
[0182] The fifth exemplary embodiment of this disclosure provides an electronic device.
[0183] Figure 12 The schematic diagram illustrates a structural block diagram of an electronic device provided in an embodiment of the present disclosure.
[0184] Reference Figure 12 As shown, the electronic device 1200 provided in this embodiment includes a processor 1201, a communication interface 1202, a memory 1203, and a communication bus 1204. The processor 1201, the communication interface 1202, and the memory 1203 communicate with each other through the communication bus 1204. The memory 1203 is used to store computer programs. When the processor 1201 executes the program stored in the memory, it implements the method for constructing an image semantic segmentation model or the image processing method described above.
[0185] A sixth exemplary embodiment of this disclosure also provides a computer-readable storage medium. The computer-readable storage medium stores a computer program that, when executed by a processor, implements the method for constructing an image semantic segmentation model or the image processing method as described above.
[0186] The computer-readable storage medium may be included in the device / apparatus described in the above embodiments; or it may exist independently and not assembled into the device / apparatus. The computer-readable storage medium carries one or more programs that, when executed, implement the method according to the embodiments of this disclosure.
[0187] According to embodiments of this disclosure, the computer-readable storage medium can be a non-volatile computer-readable storage medium, such as including, but not limited to: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this disclosure, the computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
[0188] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0189] The above description is merely a specific embodiment of this disclosure, enabling those skilled in the art to understand or implement it. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this disclosure. Therefore, this disclosure is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features claimed herein.
Claims
1. A method for constructing an image semantic segmentation model, characterized in that, include: In the source domain image used to characterize the virtual scene, a candidate region image to be pasted is determined, which simultaneously includes a category with high frequency of occurrence and a long-tail category with low frequency of occurrence. For a randomly selected source domain image and a randomly selected target domain image representing a real scene, based on a preset transparency parameter, the candidate region image to be pasted is fused with the current source domain image and the current target domain image respectively to obtain the current source domain mixed image and the current target domain mixed image. The statistical distributions of the real-world scenes and the virtual scenes are different. Multiple sets of matching source domain images, source domain mixed images, and target domain mixed images are input into the first machine learning model, and the corresponding target domain images are input into the second machine learning model for training. The second machine learning model has the same model structure as the first machine learning model but different parameters, and the parameters of the second machine learning model are the exponential moving average of the parameters of the first machine learning model. as well as Simultaneously train the first machine learning model and the second machine learning model to achieve feature-level alignment and output-level alignment between the source domain and the target domain, thereby obtaining the trained semantic segmentation model.
2. The method of claim 1, wherein, The step of determining the candidate region image to be pasted in the source domain image used to characterize the virtual scene includes: Randomly select a source domain image as a source domain template image, and determine the image region corresponding to the category that accounts for a preset proportion of the total number of pixels in the source domain template image as a candidate image region. The image with the lowest frequency of occurrence is determined based on the pixel category distribution of all source domain images in the source domain. k Long-tail categories k ≥2 and k It is an integer; From the above k Select a preset number of categories from the given long-tail categories as the specified long-tail categories; Select the long-tail image region corresponding to the specified long-tail category from the source domain image containing the specified long-tail category; and The source image containing only the candidate image region is merged with the source image containing only the long-tail image region to obtain the candidate region image to be pasted.
3. The method according to claim 1, characterized in that, The process of fusing the candidate region image to be pasted with the current source domain image and the current target domain image based on a preset transparency parameter to obtain a current source domain blended image and a current target domain blended image includes: The candidate region image to be pasted is subjected to transparency weighting processing according to a preset transparency parameter β, where 0 < β < 1; The regions in the current source image that correspond to the positions of the candidate regions to be pasted are weighted by transparency according to 1-β, while the transparency of the remaining regions is 1. The regions in the current target domain image that correspond to the positions of the candidate regions to be pasted are weighted by transparency according to 1-β, while the transparency of the remaining regions is 1. The candidate region image to be pasted, after undergoing transparency weighting, is fused with the current source domain image after undergoing transparency weighting to obtain a source domain hybrid image for the current source domain image; and The candidate region image to be pasted, after undergoing transparency weighting, is fused with the current target domain image after undergoing transparency weighting to obtain a target domain hybrid image for the current target domain image.
4. The method according to claim 1, characterized in that, The simultaneous training of the first machine learning model and the second machine learning model, enabling feature-level and output-level alignment between the source and target domains, thereby obtaining a trained semantic segmentation model, includes: Alignment of the probability maps of the target domain fused images in the first machine learning model and the second machine learning model is achieved based on weighted cross-entropy loss, thus realizing output-level alignment. The feature maps extracted from the source domain mixed image and the target domain mixed image of the first machine learning model are aligned based on the weighted maximum mean difference loss to achieve feature-level alignment; Specifically, when training the first machine learning model and the second machine learning model simultaneously, while the parameters of the first machine learning model are optimized through gradient backpropagation, the parameters of the second machine learning model are updated using an exponential moving average during the training step.
5. The method according to claim 4, characterized in that, The step of aligning the probability maps of the target domain fused image in the first machine learning model and the second machine learning model based on weighted cross-entropy loss to achieve output-level alignment includes: For any one of the multiple sets of source domain images, source domain mixed images, target domain images, and target domain mixed images, the current source domain image, the current source domain mixed image, and the current target domain mixed image are input into the first machine learning model for training, and the corresponding current target domain image is input into the second machine learning model for training. Determine pseudo-labels for the current target domain image from the second machine learning model; The predicted semantic map of the current target domain mixed image, the predicted semantic map of the current source domain image, and the predicted semantic map of the current source domain mixed image are determined based on the first machine learning model. The semantic segmentation loss based on cross-entropy is determined based on the ground truth label of the current source domain image and the predicted semantic graph of the current source domain image. The soft-paste semantic segmentation loss based on cross-entropy is determined based on the ground truth label of the current source domain mixed image and the predicted semantic map of the current source domain mixed image. The prediction consistency loss is determined based on the ground truth label of the current source domain mixed image, the predicted semantic map of the current target domain mixed image, and the pseudo label of the current target domain image. In this process, by training the second machine learning model and the first machine learning model multiple times, the prediction consistency loss converges, achieving output-level alignment.
6. The method according to claim 5, characterized in that, The set of images, which refers to any one of the multiple sets of source domain images, source domain mixed images, target domain images, and target domain mixed images, includes: Among the multiple sets of matching source domain images, source domain mixed images, target domain images, and target domain mixed images, a set of images is randomly selected as the input for the current training. This set of images includes the matching source domain images, source domain mixed images, target domain images, and target domain mixed images. During each training session, the number of training iterations is counted based on a counter until the counter reaches a preset number, at which point the selection of input images stops.
7. The method according to claim 5, characterized in that, The weighted maximum mean difference loss is used to align the feature maps extracted from the source domain mixed image and the target domain mixed image of the first machine learning model, achieving feature-level alignment, including: The soft-paste image region alignment loss and global feature alignment loss are determined based on the features extracted from the current source domain blended image and target domain blended image. In this process, by training the second machine learning model and the first machine learning model multiple times, the soft-paste image region alignment loss and the global feature alignment loss converge, thus achieving feature-level alignment.
8. An image processing method, characterized in that, include: Obtain the target image to be processed to characterize the actual scene; The target image is input into a pre-constructed semantic segmentation model, and the semantic segmentation model outputs the semantic segmentation result of the target image. The semantic segmentation model is constructed using the method described in any one of claims 1-7. The target image is processed based on the semantic segmentation results.
9. An apparatus for constructing an image semantic segmentation model, characterized in that, include: The candidate region image determination module is used to determine the candidate region image to be pasted in the source domain image, wherein the source domain image is used to represent the virtual scene, and the candidate region image to be pasted simultaneously contains categories with high frequency of occurrence and long-tail categories with low frequency of occurrence. The image fusion module is used to fuse the candidate region image to be pasted with the current source domain image and the current target domain image respectively, based on a preset transparency parameter, for a randomly selected source domain image and a randomly selected target domain image representing a real scene, to obtain a current source domain mixed image and a current target domain mixed image. The statistical distributions of the real-world scenes and the virtual scenes are different. The model input module is used to input multiple sets of matching source domain images, source domain mixed images and target domain mixed images into the first machine learning model, and at the same time input the corresponding target domain images into the second machine learning model for training. The second machine learning model has the same model structure as the first machine learning model but different parameters, and the parameters of the second machine learning model are the exponential moving average of the parameters of the first machine learning model. as well as The model training module is used to train the first machine learning model and the second machine learning model simultaneously, so that feature-level alignment and output-level alignment are achieved between the source domain and the target domain, thereby obtaining the trained semantic segmentation model.
10. An image processing apparatus, characterized in that, include: The image acquisition module is used to acquire the target image to be processed, which represents the actual scene; A semantic segmentation module is used to input the target image into a pre-constructed semantic segmentation model, and to output the semantic segmentation result of the target image from the semantic segmentation model. The semantic segmentation model is constructed using the method described in any one of claims 1-7 or by the apparatus described in claim 9. An image processing module is used to process the target image based on the semantic segmentation result.
11. An electronic device, characterized in that, It includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; Memory, used to store computer programs; A processor, when executing a program stored in memory, implements the method of any one of claims 1-8.
12. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the method of any one of claims 1-8.