Network training method and device, electronic equipment and storage medium
By masking the training sample images and enhancing the knowledge distillation process using a decoder, the problem of poor performance of the student model was solved, achieving effective detection under partially masked images and improving the learning and detection capabilities of the student network.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN SENSETIME TECH CO LTD
- Filing Date
- 2023-01-29
- Publication Date
- 2026-06-12
AI Technical Summary
In existing dense visual detection tasks, the student model performs poorly, mainly because the existing knowledge distillation methods rely on a simple imitation process based on feature maps, which fails to fully exploit the learning ability.
By masking the training sample images and enhancing the knowledge distillation process using a decoder, the learning ability of the student network is improved. The specific method includes obtaining masked sample images and extracting features through a first feature extraction network, and then performing feature recovery processing in conjunction with a decoder until the training results meet the preset requirements.
It improves the detection performance of the student network under partial occlusion conditions, effectively identifies defective images, and enhances the learning and detection capabilities of the student model.
Smart Images

Figure CN116050498B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of artificial intelligence technology, and more specifically, to a network training method, apparatus, electronic device, and storage medium. Background Technology
[0002] Knowledge distillation, also known as teacher-student learning, is an effective technique for model compression and model accuracy improvement. Through knowledge distillation, knowledge can be transferred from a larger teacher model to a smaller student model that is more deployable, thereby improving the performance of the student model.
[0003] Research has found that for dense visual detection tasks, which are more sensitive to image localization information, current knowledge distillation methods mainly rely on imitating teacher feature maps. However, this feature map-based knowledge distillation typically inputs the complete image into the student network and then performs pixel-by-pixel spatial imitation. This imitation process is relatively simple, resulting in insufficient exploitation of the student model's learning ability and poor performance. Summary of the Invention
[0004] This disclosure provides at least one network training method, apparatus, electronic device, and storage medium that can improve the performance of student networks.
[0005] The disclosed embodiments provide a network training method, including:
[0006] Obtain training sample images and corresponding mask sample images; wherein, the mask sample images are generated from the training sample images and the target mask image;
[0007] The mask sample image is input into the first network, and features are extracted from the mask sample image based on the first feature extraction network of the first network to obtain a first multi-level feature map corresponding to the mask sample image.
[0008] The sample image is input into the second network, and features are extracted from the sample image based on the second feature extraction network of the second network to obtain a second multi-level feature map corresponding to the sample image; the size of the second network is larger than the size of the first network;
[0009] Based on the decoder and the mask image corresponding to each feature map in the first multi-level feature map, feature restoration processing is performed on the first multi-level feature map to obtain the restored first multi-level feature map; wherein, the mask image corresponding to each feature map in the first multi-level feature map is obtained by scaling the target mask image respectively.
[0010] The first network is trained based on the first multi-level feature map and the second multi-level feature map after the recovery process, and the above steps are repeated until the training result of the first network meets the preset requirements, thus obtaining the trained first network.
[0011] In this embodiment, the first network is also called the student network, and the second network is also called the teacher network. In the process of knowledge distillation based on features, the training sample images are masked, and the features corresponding to the masked areas are recovered by imitating the second-level feature map output by the second network. This increases the difficulty of feature imitation. That is, without changing the network structure of the first network, the distillation process is enhanced by a separate decoder, thereby improving the learning ability of the first network. Thus, even if the input image to be detected is partially covered, the trained first network can still perform detection and recognition, thereby improving the detection performance of the first network.
[0012] In one possible implementation, the first feature extraction network includes a pyramid-structured feature extraction module. The step of inputting the mask sample image into the first network and extracting features from the mask sample image based on the first feature extraction network of the first network to obtain a first multi-level feature map corresponding to the mask sample image includes:
[0013] The masked sample image is input into the first network, and features are extracted from the masked sample image based on the feature extraction module to obtain intermediate multi-level feature maps, and the intermediate multi-level feature maps are used as the first multi-level feature map; wherein, the intermediate multi-level feature map includes multiple intermediate feature maps of different sizes.
[0014] In this embodiment of the disclosure, since the feature extraction module is designed in a pyramid structure, the extracted first multi-level feature map includes multiple intermediate feature maps of different sizes, which can be applied to various dense visual detection tasks, such as object detection, instance segmentation and semantic segmentation.
[0015] In one possible implementation, the first feature extraction network includes a feature extraction module with a pyramid structure and multiple mask convolution modules; the step of inputting the mask sample image into the first network and performing feature extraction on the mask sample image based on the first feature extraction network of the first network to obtain a first multi-level feature map corresponding to the mask sample image includes:
[0016] The mask sample image and the target mask image are input into the first network, and the feature extraction module is used to extract features from the mask sample image to obtain intermediate multi-level feature maps, wherein the intermediate multi-level feature maps include multiple intermediate feature maps of different sizes.
[0017] Based on the multiple mask convolutional modules and the target mask image, each intermediate feature map in the intermediate multi-level feature map is masked, and the masked intermediate multi-level feature map is used as the first multi-level feature map.
[0018] In this embodiment, in addition to improving the applicability of the first network, by masking each intermediate feature map, the confusing feature interaction between the masked area and the visible area can be avoided. That is, since the backbone network for feature extraction in the first network uses masked convolution, the image blocks that are masked during the convolution process can be prevented from being affected by other visible image blocks, which helps to further improve the performance of the first network.
[0019] In one possible implementation, the masking process performed on each intermediate feature map in the intermediate multi-level feature maps based on the plurality of mask convolutional modules and the target mask image includes:
[0020] Based on the size of each intermediate feature map in the intermediate multi-level feature maps, the target mask image is scaled to obtain a mask image corresponding to each intermediate feature map in the intermediate multi-level feature maps.
[0021] For each intermediate feature map, the intermediate feature map and the corresponding mask image are multiplied by the mask convolution module to obtain the masked intermediate feature map, and the masked multi-level feature map is obtained based on each masked intermediate feature map.
[0022] In this embodiment, the intermediate feature map and the mask image corresponding to the intermediate feature map are multiplied by a dot in the mask convolution module to obtain the intermediate feature map for mask processing, thereby improving the efficiency of mask processing of the intermediate feature map.
[0023] In one possible implementation, the decoder includes a spatial alignment module, a decoding module, and a spatial recovery module. The feature recovery processing of the first multi-level feature map based on the decoder and the mask image corresponding to each feature map in the first multi-level feature map includes:
[0024] Based on the spatial alignment module, the feature maps of different sizes in the first multi-level feature map are aligned to the same spatial resolution, so that the size of each feature map in the first multi-level feature map is aligned, resulting in a spatially aligned multi-level feature map.
[0025] Based on the mask images corresponding to the spatially aligned multi-level feature maps, the mask regions in the spatially aligned multi-level feature maps are replaced with mask markers to obtain multi-level feature maps with mask markers. Then, based on the decoding module, feature prediction processing is performed on the multi-level feature maps with mask markers to obtain multi-level feature maps with feature prediction processing.
[0026] Based on the spatial restoration module, the multi-level feature map of the feature prediction processing with the same spatial resolution is restored to the original size multi-level feature map, thus obtaining the first multi-level feature map after restoration processing.
[0027] In this embodiment, by spatially aligning the first multi-level feature map and then performing feature recovery, and then restoring the spatially aligned feature map to its original size, not only can feature recovery be achieved, but the size of the first multi-level feature map can also be guaranteed, which helps to determine the subsequent feature recovery loss.
[0028] In one possible implementation, aligning the feature maps of different sizes in the first multi-level feature map to the same spatial resolution based on the spatial alignment module includes:
[0029] For each feature map in the first multi-level feature map, the feature map is compared with the target image;
[0030] When the size of the feature map is larger than the size of the target image, the feature map is subjected to dimensionality reduction processing so that its size is the same as that of the target image; or...
[0031] When the size of the feature map is smaller than the size of the target image, nearest neighbor interpolation is used to upsample the feature map so that the size of the feature map is consistent with the size of the target image.
[0032] In this embodiment of the disclosure, each feature map in the first multi-level feature map is aligned to the target image size. In the implementation process, different methods are used for feature maps larger than the target image and feature maps smaller than the target image. In this way, not only can the alignment between each feature map be guaranteed, but the efficiency of spatial alignment can also be improved.
[0033] In one possible implementation, before aligning the feature maps of different sizes in the first multi-level feature maps to the same spatial resolution based on the spatial alignment module, the method further includes:
[0034] Align the number of channels in the first multi-level feature map with the number of channels in the second multi-level feature map; and / or perform layer normalization on the first multi-level feature map and the second multi-level feature map.
[0035] In this embodiment of the disclosure, before aligning each feature map in the first multi-level feature map, the number of channels in the first multi-level feature map is aligned with the number of channels in the second multi-level feature map; and / or, layer normalization processing is performed on the first multi-level feature map and the second multi-level feature map, which helps to improve the accuracy and efficiency of spatial alignment.
[0036] In one possible implementation, the mask images corresponding to the spatially aligned multi-level feature maps are replaced with mask markers in the spatially aligned multi-level feature maps to obtain multi-level feature maps with mask markers, including:
[0037] For each spatially aligned feature map, the spatially aligned feature map is expanded to obtain a one-dimensional expanded feature map, and based on the mask image corresponding to the spatially aligned feature map, the mask region that needs to be replaced in the expanded feature map is determined.
[0038] The mask region is replaced with the mask mark to obtain an expanded feature map with the mask mark, and based on each expanded feature map with the mask mark, a multi-level feature map with the mask mark is obtained.
[0039] In this embodiment of the disclosure, by expanding the spatially aligned feature map, it is helpful to replace the mask mark in the mask region, which in turn helps to predict the features corresponding to the mask region in the subsequent process.
[0040] In one possible implementation, after obtaining the expanded feature map with the mask markers, the method further includes:
[0041] Add cosine absolute position encoding to the expanded feature map with the mask mark, and adaptively adjust the expanded feature map with the mask mark by interpolation based on a preset absolute scale to obtain the adjusted expanded feature map.
[0042] The process of obtaining the multi-level feature map with masked markers based on each expanded feature map bearing the masked markers includes:
[0043] Based on the various adjusted expanded feature maps, the multi-level feature map with mask markings is obtained.
[0044] In this embodiment of the disclosure, by adding cosine absolute position encoding to the expanded feature map with the mask mark, it is convenient to determine and replace the mask region, and it is also convenient to obtain the feature map before expansion based on the expanded map after feature prediction.
[0045] In one possible implementation, training the first network based on the recovered first multi-level feature map and the second multi-level feature map includes:
[0046] The feature recovery loss between the first multi-level feature map and the second multi-level feature map after the recovery process is determined, and the parameters of the first network are adjusted based on the feature recovery loss.
[0047] In this embodiment of the disclosure, the feature recovery loss between the first multi-level feature map and the second multi-level feature map can guide the first network to further imitate the second multi-level feature map, thereby improving the training efficiency.
[0048] In one possible implementation, the method further includes:
[0049] Based on the first multi-level feature map after the recovery process, determine the task loss of the first network; and / or,
[0050] The global context module determines the first global relation and the second global relation corresponding to the first multi-level feature map and the second multi-level feature map after the recovery process, respectively, and determines the global loss between the first global relation and the second global relation.
[0051] The adjustment of the parameters of the first network based on the feature recovery loss includes:
[0052] The parameters of the first network are adjusted based on the feature recovery loss, the task loss, and / or the global loss.
[0053] In this embodiment of the disclosure, in addition to the feature recovery loss, the parameters of the first network are also adjusted according to the task loss and / or the global loss. In this way, the accuracy of adjusting the parameters of the first network can be improved based on multiple losses, which can further improve the training efficiency and performance of the first network.
[0054] In one possible implementation, acquiring the training sample image and the mask sample image corresponding to the training sample image includes:
[0055] The training sample image is acquired and divided into multiple non-overlapping image patches.
[0056] The target mask image is obtained by random sampling based on a preset mask ratio. The target mask image includes a mask indicator for indicating that the corresponding image block is masked.
[0057] Based on the mask indicator in the target mask image, the training sample image is masked to obtain a mask sample image corresponding to the training sample image.
[0058] In this embodiment of the disclosure, during the masking process of the training sample image, the target mask image is obtained by random sampling, which can avoid learning the features of the mask itself during feature extraction and help improve the accuracy of feature extraction.
[0059] In one possible implementation, the method further includes:
[0060] The image to be detected is acquired, and an image detection task is performed on the image based on the trained first network; the image detection task includes an object detection task, a semantic segmentation task, or an instance segmentation task.
[0061] In this embodiment of the disclosure, an image detection task is performed on the image to be detected based on the trained first network, which can realize the detection of various dense visual tasks.
[0062] This disclosure provides a network training device, including:
[0063] An image acquisition module is used to acquire training sample images and mask sample images corresponding to the training sample images; wherein, the mask sample images are generated from the training sample images and the target mask images;
[0064] The first extraction module is used to input the mask sample image into the first network, and perform feature extraction on the mask sample image based on the first feature extraction network of the first network to obtain a first multi-level feature map corresponding to the mask sample image.
[0065] The second extraction module is used to input the sample image into the second network and extract features from the sample image based on the second feature extraction network of the second network to obtain a second multi-level feature map corresponding to the sample image; the size of the second network is larger than the size of the first network.
[0066] The feature prediction module is used to perform feature recovery processing on the first multi-level feature map based on the decoder and the mask image corresponding to each feature map in the first multi-level feature map, to obtain the recovered first multi-level feature map; wherein, the mask image corresponding to each feature map in the first multi-level feature map is obtained by scaling the target mask image respectively.
[0067] The network training module is used to train the first network based on the first multi-level feature map and the second multi-level feature map after the recovery processing, and repeat the above steps until the training result of the first network meets the preset requirements, so as to obtain the trained first network.
[0068] In one possible implementation, the first extraction module is specifically used for:
[0069] The masked sample image is input into the first network, and features are extracted from the masked sample image based on the feature extraction module to obtain intermediate multi-level feature maps, and the intermediate multi-level feature maps are used as the first multi-level feature map; wherein, the intermediate multi-level feature map includes multiple intermediate feature maps of different sizes.
[0070] In one possible implementation, the first extraction module is specifically used for:
[0071] The mask sample image and the target mask image are input into the first network, and the feature extraction module is used to extract features from the mask sample image to obtain intermediate multi-level feature maps, wherein the intermediate multi-level feature maps include multiple intermediate feature maps of different sizes.
[0072] Based on the multiple mask convolutional modules and the target mask image, each intermediate feature map in the intermediate multi-level feature map is masked, and the masked intermediate multi-level feature map is used as the first multi-level feature map.
[0073] In one possible implementation, the first extraction module is specifically used for:
[0074] Based on the size of each intermediate feature map in the intermediate multi-level feature maps, the target mask image is scaled to obtain a mask image corresponding to each intermediate feature map in the intermediate multi-level feature maps.
[0075] For each intermediate feature map, the intermediate feature map and the corresponding mask image are multiplied by the mask convolution module to obtain the masked intermediate feature map, and the masked multi-level feature map is obtained based on each masked intermediate feature map.
[0076] In one possible implementation, the decoder includes a spatial alignment module, a decoding module, and a spatial recovery module, wherein the feature prediction module is specifically used for:
[0077] Based on the spatial alignment module, the feature maps of different sizes in the first multi-level feature map are aligned to the same spatial resolution, so that the size of each feature map in the first multi-level feature map is aligned, resulting in a spatially aligned multi-level feature map.
[0078] Based on the mask images corresponding to the spatially aligned multi-level feature maps, the mask regions in the spatially aligned multi-level feature maps are replaced with mask markers to obtain multi-level feature maps with mask markers. Then, based on the decoding module, feature prediction processing is performed on the multi-level feature maps with mask markers to obtain multi-level feature maps with feature prediction processing.
[0079] Based on the spatial restoration module, the multi-level feature map of the feature prediction processing with the same spatial resolution is restored to the original size multi-level feature map, thus obtaining the first multi-level feature map after restoration processing.
[0080] In one possible implementation, the feature prediction module is specifically used for:
[0081] For each feature map in the first multi-level feature map, the feature map is compared with the target image;
[0082] When the size of the feature map is larger than the size of the target image, the feature map is subjected to dimensionality reduction processing so that its size is the same as that of the target image; or...
[0083] When the size of the feature map is smaller than the size of the target image, nearest neighbor interpolation is used to upsample the feature map so that the size of the feature map is consistent with the size of the target image.
[0084] In one possible implementation, the feature prediction module is further configured to:
[0085] Align the number of channels in the first multi-level feature map with the number of channels in the second multi-level feature map; and / or perform layer normalization on the first multi-level feature map and the second multi-level feature map.
[0086] In one possible implementation, the feature prediction module is specifically used for:
[0087] For each spatially aligned feature map, the spatially aligned feature map is expanded to obtain a one-dimensional expanded feature map, and based on the mask image corresponding to the spatially aligned feature map, the mask region that needs to be replaced in the expanded feature map is determined.
[0088] The mask region is replaced with the mask mark to obtain an expanded feature map with the mask mark, and based on each expanded feature map with the mask mark, a multi-level feature map with the mask mark is obtained.
[0089] In one possible implementation, the feature prediction module is further configured to:
[0090] Add cosine absolute position encoding to the expanded feature map with the mask mark, and adaptively adjust the expanded feature map with the mask mark by interpolation based on a preset absolute scale to obtain the adjusted expanded feature map.
[0091] Based on the various adjusted expanded feature maps, the multi-level feature map with mask markings is obtained.
[0092] In one possible implementation, the network training module is specifically used for:
[0093] The feature recovery loss between the first multi-level feature map and the second multi-level feature map after the recovery process is determined, and the parameters of the first network are adjusted based on the feature recovery loss.
[0094] In one possible implementation, the network training module is further configured to:
[0095] Based on the first multi-level feature map after the recovery process, the task loss of the first network is determined; and / or, the first global relationship and the second global relationship corresponding to the first multi-level feature map and the second multi-level feature map after the recovery process are determined by the global context module, and the global loss between the first global relationship and the second global relationship is determined.
[0096] The parameters of the first network are adjusted based on the feature recovery loss, the task loss, and / or the global loss.
[0097] In one possible implementation, the image acquisition module is specifically used for:
[0098] The training sample image is acquired and divided into multiple non-overlapping image patches.
[0099] The target mask image is obtained by random sampling based on a preset mask ratio. The target mask image includes a mask indicator for indicating that the corresponding image block is masked.
[0100] Based on the mask indicator in the target mask image, the training sample image is masked to obtain a mask sample image corresponding to the training sample image.
[0101] In one possible implementation, the device further includes:
[0102] The task detection module is used to acquire the image to be detected and perform an image detection task on the image to be detected based on the trained first network; the image detection task includes an object detection task, a semantic segmentation task, or an instance segmentation task.
[0103] This disclosure provides an electronic device including a processor, a memory, and a bus. The memory stores machine-readable instructions executable by the processor. When the electronic device is running, the processor communicates with the memory via the bus. When the machine-readable instructions are executed by the processor, they perform the steps of the network training method as described in any of the above possible embodiments.
[0104] This disclosure provides a computer-readable storage medium storing a computer program that, when executed by a processor, performs the steps of the network training method as described in any of the possible embodiments above.
[0105] To make the above-mentioned objects, features and advantages of this disclosure more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description
[0106] To more clearly illustrate the technical solutions of the embodiments of this disclosure, the accompanying drawings used in the embodiments will be briefly described below. These drawings are incorporated in and constitute a part of this specification. They illustrate embodiments conforming to this disclosure and, together with the specification, serve to explain the technical solutions of this disclosure. It should be understood that the following drawings only show some embodiments of this disclosure and should not be considered as limiting the scope. Those skilled in the art can obtain other related drawings based on these drawings without creative effort.
[0107] Figure 1 A flowchart of a network training method provided by an embodiment of this disclosure is shown;
[0108] Figure 2 A schematic diagram of an image masking process provided by an embodiment of this disclosure is shown;
[0109] Figure 3 A schematic diagram of a network training process provided by an embodiment of this disclosure is shown;
[0110] Figure 4 A schematic diagram illustrating a decoding module processing procedure provided in an embodiment of this disclosure is shown;
[0111] Figure 5 A schematic diagram of the structure of a network training device provided in an embodiment of this disclosure is shown;
[0112] Figure 6 A schematic diagram of another network training device provided in an embodiment of this disclosure is shown;
[0113] Figure 7 A schematic diagram of an electronic device provided in an embodiment of the present disclosure is shown. Detailed Implementation
[0114] To make the objectives, technical solutions, and advantages of the embodiments of this disclosure clearer, the technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this disclosure, and not all of them. The components of the embodiments of this disclosure described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of this disclosure provided in the accompanying drawings is not intended to limit the scope of the claimed disclosure, but merely represents selected embodiments of this disclosure. All other embodiments obtained by those skilled in the art based on the embodiments of this disclosure without inventive effort are within the scope of protection of this disclosure.
[0115] It should be noted that similar labels and letters in the following figures indicate similar items. Therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures.
[0116] In this document, the term "and / or" merely describes a relationship, indicating that three relationships can exist. For example, A and / or B can represent three cases: A alone, A and B simultaneously, and B alone. Furthermore, the term "at least one" in this document means any combination of at least two of any one or more elements. For example, including at least one of A, B, and C can mean including any one or more elements selected from the set consisting of A, B, and C.
[0117] First, the relevant terms and concepts involved in the embodiments of this application will be introduced and explained:
[0118] Artificial intelligence (AI) is the theory, methods, technology, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. In other words, AI is a comprehensive technology within computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a way similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to possess the functions of perception, reasoning, and decision-making.
[0119] Artificial intelligence (AI) is a comprehensive discipline encompassing a wide range of fields, including both hardware and software technologies. Fundamental AI technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies primarily include computer vision, speech processing, natural language processing, and machine learning / deep learning.
[0120] Computer vision (CV) is a science that studies how to enable machines to "see." More specifically, it refers to machine vision, which uses cameras and computers to replace human eyes in recognizing, tracking, and measuring targets, and then performs image processing to create images more suitable for human observation or transmission to instruments. As a scientific discipline, computer vision studies related theories and technologies, attempting to build artificial intelligence systems capable of extracting information from images or multidimensional data. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (OCR), video processing, video semantic understanding, video content / behavior recognition, 3D object reconstruction, 3D technology, virtual reality, augmented reality, simultaneous localization and mapping (SLAM), and common biometric recognition technologies such as facial recognition and fingerprint recognition.
[0121] Machine learning (ML) is a multidisciplinary field involving probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It specifically studies how computers can simulate or implement human learning behavior to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to endow computers with intelligence; its applications span all areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, and inductive learning.
[0122] Decision intelligence, which includes fully optimized training configurations, efficient algorithm implementations, and pre-trained model libraries, can help researchers and engineers quickly begin learning reinforcement learning, validating ideas, and developing production business baseline models.
[0123] Knowledge distillation, also known as dark knowledge extraction, refers to the process of using a complex, computationally intensive, but high-performance teacher network to guide the training of a relatively simple student network, thereby improving the performance of the student network and achieving knowledge transfer. Knowledge distillation can make the model lighter (easier to deploy) while minimizing performance loss.
[0124] Research has found that for dense visual detection tasks, which are more sensitive to image localization information, current knowledge distillation methods mainly rely on imitating teacher feature maps. However, this feature map-based knowledge distillation typically inputs the complete image into the student network and then performs pixel-by-pixel spatial imitation. This imitation process is relatively simple, allowing the distillation loss to converge quickly. Consequently, the learning ability of the student model is not well explored, resulting in poor performance. For example, in practical applications, this makes the student model highly dependent on the image quality of the input image and unable to detect and recognize images with defects (such as partial blurring or missing parts).
[0125] Based on the above research, this disclosure provides a network training method. First, training sample images and corresponding mask sample images are acquired; wherein the mask sample images are generated from the training sample images and target mask images; then, the mask sample images are input into a first network, and feature extraction is performed on the mask sample images based on a first feature extraction network of the first network to obtain a first multi-level feature map corresponding to the mask sample images; next, the sample images are input into a second network, and feature extraction is performed on the sample images based on a second feature extraction network of the second network to obtain a second multi-level feature map corresponding to the sample images. The first network is constructed by first constructing a feature map and then performing feature restoration processing on the first multi-level feature map based on the decoder and the mask image corresponding to each feature map in the first multi-level feature map. The restored first multi-level feature map is obtained by scaling the target mask image. Finally, the first network is trained based on the restored first multi-level feature map and the second multi-level feature map, and the above steps are repeated until the training result of the first network meets the preset requirements, thus obtaining the trained first network.
[0126] In this embodiment of the disclosure, during the knowledge distillation process based on features, the training sample images are masked, and the features corresponding to the masked regions are recovered by mimicking the second-level feature map output by the second network. This increases the difficulty of feature mimicking. In other words, without changing the network structure of the first network, the distillation process is enhanced by a separate decoder, thereby improving the learning ability of the first network. Thus, even if the input image to be detected is partially covered, the trained first network can still perform detection and recognition, thereby improving the detection performance of the first network.
[0127] It is understood that the network training method can be applied to a terminal, a server, or an implementation environment consisting of a terminal and a server. Furthermore, the network training method can also be software running on a terminal or server, such as an application with network training functionality.
[0128] The terminal can be a smartphone, tablet, laptop, desktop computer, smart speaker, smartwatch, etc., but is not limited to these. The server can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
[0129] In some possible implementations, this network training method can be implemented by the processor calling computer-readable instructions stored in memory.
[0130] See Figure 1 The diagram shows a flowchart of a network training method provided in this embodiment of the present disclosure. The network training method includes the following steps S101 to S105:
[0131] S101, acquire training sample images and mask sample images corresponding to the training sample images; wherein, the mask sample images are generated from the training sample images and the target mask images.
[0132] For example, the content of the training sample images can be determined according to different application scenarios. For instance, if the application scenario is to classify pets, the training sample images can be images of different categories of pets, such as images of various types of dogs (e.g., Golden Retrievers, Huskies, Teddy Bears, etc.) and images of different types of cats; or, for example, if the application scenario is to detect various targets in the environment in which a vehicle is located, the training image samples can be road images of different environments in which the vehicle is driving. In this embodiment of the disclosure, the content of the training sample images is not specifically limited.
[0133] See Figure 2 As shown, the masked sample image refers to the image obtained after masking the training sample image. That is, the masked sample image is generated from the training sample image and the target mask image.
[0134] As an example, after obtaining the training sample image, the training sample image can be divided into multiple non-overlapping image blocks; then, based on a preset mask ratio, the target mask image is obtained by random sampling. The target mask image includes a mask indicator for indicating that the corresponding image block is masked. Therefore, the training sample image can be masked based on the mask indicator in the target mask image to obtain a mask sample image corresponding to the training sample image.
[0135] The preset mask ratio refers to the proportion of the entire training sample image that is masked. Optionally, the preset mask ratio is 30%, meaning that 30% of a training sample image is masked. It is understood that in other embodiments, the preset mask ratio can be other ratios, such as 40%, 25%, etc., and can be set according to actual needs; no specific limitation is made here.
[0136] Specifically, a mask image can be randomly sampled according to a preset mask ratio to obtain the target mask image. Then, the target mask image is multiplied by the training sample image to obtain the mask sample image. For example, the target mask image is a binary mask image composed of 0s and 1s, where 0 represents being covered and 1 represents not being covered. That is, the "0" in the binary mask image is the mask indicator in the target mask image. It is understood that in other embodiments, other identifiers (such as letters) can also be used to indicate the masked image blocks.
[0137] As an example, the image patch size can be 32 pixels * 32 pixels. This prevents the image patch from being too large or too small, avoiding the problem of learning from neighboring image patches due to an excessively small patch, thus reducing the learning difficulty. It also prevents the complete obscuring of objects (such as road signs) in the training sample image due to an excessively large patch. In other embodiments, the image patch size can be determined based on the size of objects in the actual training sample image. For example, if the objects in the training sample image are all relatively small, the image patch size can be appropriately reduced.
[0138] It should be noted that, in this embodiment of the disclosure, although the target mask image is obtained through random sampling, the masked area of the training sample image should be uniformly distributed within the training sample image. That is, the mask indicators in the target mask image should be uniformly distributed within the target mask image to eliminate potential centering bias. Furthermore, the size of the target mask image should match the size of the training sample image.
[0139] In the above example, after obtaining the training image, the training mask image is masked to obtain the mask sample image. As another example, each training sample image in the sample image set can be masked in advance to obtain a mask sample image corresponding to each training sample image. Then, the training sample image and the mask sample image corresponding to the training sample image can be obtained according to the correspondence between the pre-saved training sample image and the mask sample image.
[0140] S102, the mask sample image is input into the first network, and features are extracted from the mask sample image based on the first feature extraction network of the first network to obtain a first multi-level feature map corresponding to the mask sample image.
[0141] For example, the first network can be a network for performing intensive visual detection tasks. For instance, the first network can be a detection neural network for performing object detection tasks, a segmentation neural network for performing semantic segmentation tasks, or a segmentation neural network for performing instance segmentation tasks.
[0142] See Figure 3 As shown, in one embodiment of this disclosure, in order to accurately extract local feature information of training sample images, the first feature extraction network in this example includes a feature extraction module, which is also called a feature extraction backbone network. Specifically, it can be a convolutional neural network. For example, the feature extraction module can be a Residual Network-50-Feature Pyramid Network (ResNet-50-FPN). After the mask sample image is input into the first network, the feature extraction module can output multiple intermediate feature maps with different receptive fields and different sizes, that is, obtain multiple intermediate feature maps of different sizes. These multiple intermediate feature maps of different sizes form an intermediate multi-level feature map. In this embodiment of the disclosure, the intermediate multi-level feature map is used as the first multi-level feature map.
[0143] Specifically, the feature extraction module can adopt a pyramid scheme design, with different step size factors at different feature extraction stages. That is, the feature extraction module can perform multiple downsampling processes on the mask sample image through multiple feature extraction layers in the pyramid structure to obtain intermediate multi-level feature maps of the mask sample image. Furthermore, it should be noted that since the masked portion in the mask image sample is obtained from a random mask, the feature extraction module can avoid learning the features of the mask itself during feature extraction.
[0144] It should be understood that the first multi-level feature map in the aforementioned embodiments is an intermediate multi-level feature map obtained after feature extraction by the feature extraction module. In other embodiments, the first network further includes multiple mask convolution modules. For step S102, when the mask sample image is input into the first network and the first feature extraction network of the first network is used to extract features from the mask sample image to obtain the first multi-level feature map corresponding to the mask sample image, the following (1) to (2) may be included:
[0145] (1) Input the mask sample image and the target mask image into the first network, and extract features from the mask sample image based on the feature extraction module to obtain intermediate multi-level feature maps;
[0146] (2) Based on the multiple mask convolution modules and the target mask image, each intermediate feature map in the intermediate multi-level feature map is masked, and the masked intermediate multi-level feature map is used as the first multi-level feature map.
[0147] Specifically, the target mask image can be scaled according to the size of each intermediate feature map in the intermediate multi-level feature maps to obtain a mask image corresponding to each intermediate feature map in the intermediate multi-level feature maps. For example, for intermediate features... Figure 1 The target mask image can be scaled to obtain the intermediate features. Figure 1 Size-matched mask image, targeting intermediate features Figure 2 The target mask image can be scaled to obtain the intermediate features. Figure 2 Size-matched mask images are obtained, and so on, to obtain mask images corresponding to each intermediate feature map.
[0148] Then, for each intermediate feature map, a dot product is performed on the intermediate feature map and the corresponding mask image in the mask convolution module to obtain a masked intermediate feature map. Based on each masked intermediate feature map, the masked multi-level feature map is obtained. Each mask convolution module can correspond to the masking of one intermediate feature map. In this way, by masking each intermediate feature map, confusing feature interactions between masked and visible regions can be avoided. That is, since mask convolution is used in the backbone network for feature extraction in the first network, the masked image blocks can be prevented from being affected by other visible image blocks during the convolution process.
[0149] S103, the sample image is input into the second network, and features are extracted from the sample image based on the second feature extraction network of the second network to obtain a second multi-level feature map corresponding to the sample image; the size of the second network is larger than the size of the first network.
[0150] The structure and function of the second network are similar to those of the first network, except that the second network is larger in scale than the first network. The scale refers to the number of parameters; that is, the second network has more parameters than the first network. In this embodiment, the second network is also called the teacher network, and the first network is also called the student network. After the second network is pre-trained, it is used to perform knowledge distillation on the first network. During the distillation process, the model information of the first network itself does not affect the parameter updates of the second network; that is, the second network is a one-way knowledge transfer to the first network, and the first network does not affect the second network.
[0151] S104, based on the decoder and the mask image corresponding to each feature map in the first multi-level feature map, feature recovery processing is performed on the first multi-level feature map to obtain the recovered first multi-level feature map; wherein, the mask image corresponding to each feature map in the first multi-level feature map is obtained by scaling the target mask image respectively.
[0152] like Figure 3 As shown, for example, after obtaining the first multi-level feature map, the first multi-level feature map and the mask image corresponding to each feature map in the first multi-level feature map can be input into the decoder. Then, the decoder can restore the features of the masked area in each feature map in the first multi-level feature map based on the mask image corresponding to each feature map in the first multi-level feature map, and obtain the restored first multi-level feature map.
[0153] It is understood that, for each feature map in the first multi-level feature map, the mask image corresponding to the level feature map is used to indicate the masked area in the level feature map. Therefore, feature recovery processing can be performed on the masked area in each feature map based on the mask image corresponding to each feature map, which helps to improve the accuracy and efficiency of feature recovery.
[0154] Furthermore, the process of feature recovery processing of the first multi-level feature map by the decoder will be described in detail later.
[0155] Furthermore, it should be noted that the second multi-level feature map is also input into the decoder. The decoder can be trained in a supervised manner based on the recovered first multi-level feature map and the second multi-level feature map. That is, the second multi-level feature map can be used as the label of the first multi-level feature map to calculate the supervised loss, and then the parameters of the decoder can be adjusted.
[0156] S105, the first network is trained based on the first multi-level feature map and the second multi-level feature map after the recovery process, and then the process returns to step S101 until the training result of the first network meets the preset requirements, thus obtaining the trained first network.
[0157] As an example, a feature recovery loss can be determined between the first multi-level feature map and the second multi-level feature map after the recovery process, and the parameters of the first network can be adjusted based on the feature recovery loss. Here, the feature recovery loss refers to the mean squared error loss between the first multi-level feature map and the second multi-level feature map after the recovery process.
[0158] As another example, to further improve the network performance of the first network, the task loss of the first network can be determined based on the recovered first multi-level feature map. For example, a test image can be acquired and input into the first network for feature extraction to obtain the feature extraction result. Then, based on the feature extraction result, a target task (such as a classification task, object detection task, etc.) can be performed, and the task loss of the first network can be obtained based on the task execution result and the label of the test image.
[0159] In addition, the first global relationship and the second global relationship corresponding to the first multi-level feature map and the second multi-level feature map after the recovery process can be determined by the global context module, and the global loss between the first global relationship and the second global relationship can be determined.
[0160] Global relationships refer to the relationship between a certain region in an image and other regions in the image.
[0161] Therefore, the parameters of the first network can be updated based on the feature recovery loss, the task loss, and / or the global loss until the training result of the first network meets the preset requirements, thus obtaining a trained first network. That is, the parameters of the first network can be updated based solely on the feature recovery loss, or the parameters of the first network can be updated based on the feature recovery loss and at least one of the global loss and the task loss.
[0162] The preset requirement refers to the conditions under which the training of the first network ends. This preset requirement can be configured according to actual needs. For example, meeting the preset requirement could mean the feature recovery loss is less than a preset value, or that the change in the feature recovery loss approaches stability, meaning the difference between the feature recovery losses of two or more adjacent training iterations is less than a set value, indicating that the feature recovery loss essentially no longer changes. Additionally, the preset requirement could also be that the first network has been trained a preset number of times; this is not specifically limited here.
[0163] The following section details the process of using a decoder to perform feature recovery on the first multi-level feature map.
[0164] In some embodiments, the decoder includes a packet spatial alignment module, a decoding module, and a spatial recovery module. Regarding step S104 above, when performing feature recovery processing on the first multi-level feature map based on the decoder and the mask image corresponding to each feature map in the first multi-level feature map to obtain the recovered first multi-level feature map, the following (a) to (c) may be included:
[0165] (a) Based on the spatial alignment module, the feature maps of different sizes in the first multi-level feature map are aligned to the same spatial resolution, so that the sizes of the feature maps in the first multi-level feature map are aligned, resulting in a spatially aligned multi-level feature map (see...). Figure 4 (As shown).
[0166] It can be understood that after inputting the first multi-level feature map, the mask image corresponding to each feature map in the first multi-level feature map, and the second multi-level feature map into the decoder, the feature maps of different sizes in the first multi-level feature map can be aligned to the same spatial resolution based on the spatial alignment module, so that the size of each feature map in the first multi-level feature map is aligned, resulting in a spatially aligned multi-level feature map.
[0167] The same spatial resolution can be 1 / 32 of the size of the training sample image. That is, the size of the target image can be determined first. For example, the size of the target image can be (H / 32)*(W / 32), where H is the height of the training sample image and W is the width of the training sample image.
[0168] Specifically, for each feature map in the first multi-level feature map, the feature map can be compared with a preset target image. When the size of the feature map is larger than the size of the target image, the feature map is dimensionality reduced to match the size of the target image. When the size of the feature map is smaller than the size of the target image, nearest neighbor interpolation is used to upsample the feature map to match the size of the target image. That is, for the current feature map that needs adjustment, the ratio of the current feature map size to the target image size can be p. When the feature map is larger than the target image (p>1), dimensionality reduction is performed using a p×p convolution with stride p. When the feature map is smaller than the target image (p<1), nearest neighbor interpolation is used to upsample it to the target image size.
[0169] In this embodiment of the disclosure, after aligning the feature maps of different sizes in the first multi-level feature map to the same spatial resolution, it is convenient to replace the mask markers in the subsequent process. For example, during the replacement, each pixel in the spatially aligned feature map can correspond to a mask marker, which helps to improve the replacement efficiency of the mask markers.
[0170] As an example, before spatial alignment of the first multi-level feature maps, a 1×1 convolutional layer can be used to align the number of channels in the first multi-level feature maps with the number of channels in the second multi-level feature maps. As another example, layer normalization can be applied to both the first and second multi-level feature maps, which helps improve the prediction accuracy of features corresponding to the masked regions.
[0171] It's important to explain that each image can be viewed as consisting of three dimensions: height, width, and channels. Each channel is composed of a two-dimensional matrix, whose length and width are the dimensions of the first and second dimensions, respectively. The value of each element in the two-dimensional matrix corresponds to the pixel value of that channel.
[0172] (b) Based on the mask images corresponding to the spatially aligned multi-level feature maps, the mask regions in the spatially aligned multi-level feature maps are replaced with mask markers to obtain multi-level feature maps with mask markers. Based on the decoding module and the second multi-level feature map, feature prediction processing is performed on the multi-level feature maps with mask markers to obtain multi-level feature maps with feature prediction processing.
[0173] For example, see Figure 4 As shown, for each spatially aligned feature map, the spatially aligned feature map can first be expanded to obtain a one-dimensional expanded feature map; then, based on the mask image corresponding to the spatially aligned feature map, the mask region that needs to be replaced in the expanded feature map is determined, and the mask region is replaced using a mask marker M to obtain an expanded feature map with a mask marker; based on each expanded feature map with the mask marker, the multi-level feature map with the mask marker is obtained.
[0174] Next, each feature map with a mask mark is input into the corresponding decoding module, which can predict the features of the region corresponding to the mask mark M and obtain the expanded feature map of the feature prediction. Then, based on the expanded feature maps of each feature prediction, a multi-level feature map of feature prediction processing is obtained.
[0175] Each mask label is a learnable vector, and mask labels in different feature maps cannot be shared.
[0176] It is understandable that, since the input size of the detection task is variable, that is, the size of the training sample image input is not necessarily the same each time, in order to enable the decoding module to adapt to feature prediction of different training sample image sizes, in some embodiments, after obtaining the unfolded feature map with the mask mark, cosine absolute position encoding can be added to the unfolded feature map with the mask mark to represent the position of each sub-image in the unfolded map in the original image; then, based on a preset absolute scale, the unfolded feature map with the mask mark is adaptively adjusted by interpolation to obtain the adjusted unfolded feature map, and based on each adjusted unfolded feature map, the multi-level feature map with the mask mark is obtained.
[0177] The preset absolute scale can be set according to implementation requirements, for example, it can be 28*28, and then the resolution can be adaptively interpolated according to the size of the training sample image.
[0178] Furthermore, in this embodiment of the disclosure, the decoding module is a 4-layer decoder module, which may consist of a normalization layer, a multi-head self-attention layer, and a feedforward layer.
[0179] (c) Based on the spatial restoration module, the multi-level feature map of the feature prediction processing with the same spatial resolution is restored to the original size multi-level feature map to obtain the first multi-level feature map after restoration processing.
[0180] Specifically, the feature map output by the decoding module (a multi-level feature map of feature prediction processing) is deformed (size restored) after the number of channels is changed by a linear layer. The deformed feature map is the complete feature map finally generated by the first network.
[0181] In this embodiment of the disclosure, during the knowledge distillation process based on features, the training sample images are masked, and the features corresponding to the masked regions are predicted by imitating the second-level feature map output by the second network. This increases the difficulty of feature imitation. In other words, without changing the network structure of the first network, the distillation process is enhanced by a separate decoder, thereby improving the learning ability of the first network and thus improving the performance of the first network.
[0182] Furthermore, since the feature maps in the first multi-level feature map have different sizes, the first network can be applied to various dense prediction tasks, such as object detection, instance segmentation, and semantic segmentation, thus improving the practicality of the first network.
[0183] In some embodiments, after obtaining the trained first network, an image to be detected can be acquired, and an image detection task can be performed on the image to be detected based on the trained first network; the image detection task includes an object detection task, a semantic segmentation task, or an instance segmentation task.
[0184] Those skilled in the art will understand that, in the above-described method of the specific implementation, the order in which each step is written does not imply a strict execution order and does not constitute any limitation on the implementation process. The specific execution order of each step should be determined by its function and possible internal logic.
[0185] Based on the same technical concept, this disclosure also provides a network training device corresponding to the network training method. Since the principle of the device in this disclosure for solving the problem is similar to that of the network training method described above, the implementation of the device can refer to the implementation of the method, and the repeated parts will not be described again.
[0186] Reference Figure 5 The diagram shown is a schematic of a network training device 500 provided in an embodiment of this disclosure. The network training device 500 includes:
[0187] Image acquisition module 501 is used to acquire training sample images and mask sample images corresponding to the training sample images; wherein, the mask sample images are generated from the training sample images and the target mask images;
[0188] The first extraction module 502 is used to input the mask sample image into the first network, and perform feature extraction on the mask sample image based on the first feature extraction network of the first network to obtain a first multi-level feature map corresponding to the mask sample image.
[0189] The second extraction module 503 is used to input the sample image into the second network and extract features from the sample image based on the second feature extraction network of the second network to obtain a second multi-level feature map corresponding to the sample image; the size of the second network is larger than the size of the first network.
[0190] The feature prediction module 504 is used to perform feature recovery processing on the first multi-level feature map based on the decoder and the mask image corresponding to each feature map in the first multi-level feature map, to obtain the recovered first multi-level feature map; wherein, the mask image corresponding to each feature map in the first multi-level feature map is obtained by scaling the target mask image respectively.
[0191] The network training module 505 is used to train the first network based on the recovered first multi-level feature map and the second multi-level feature map, and repeat the above steps until the training result of the first network meets the preset requirements, thereby obtaining the trained first network.
[0192] In one possible implementation, the first extraction module 502 is specifically used for:
[0193] The masked sample image is input into the first network, and features are extracted from the masked sample image based on the feature extraction module to obtain intermediate multi-level feature maps, and the intermediate multi-level feature maps are used as the first multi-level feature map; wherein, the intermediate multi-level feature map includes multiple intermediate feature maps of different sizes.
[0194] In one possible implementation, the first extraction module 502 is specifically used for:
[0195] The mask sample image and the target mask image are input into the first network, and the feature extraction module is used to extract features from the mask sample image to obtain intermediate multi-level feature maps, wherein the intermediate multi-level feature maps include multiple intermediate feature maps of different sizes.
[0196] Based on the multiple mask convolutional modules and the target mask image, each intermediate feature map in the intermediate multi-level feature map is masked, and the masked intermediate multi-level feature map is used as the first multi-level feature map.
[0197] In one possible implementation, the first extraction module 502 is specifically used for:
[0198] Based on the size of each intermediate feature map in the intermediate multi-level feature maps, the target mask image is scaled to obtain a mask image corresponding to each intermediate feature map in the intermediate multi-level feature maps.
[0199] For each intermediate feature map, the intermediate feature map and the corresponding mask image are multiplied by the mask convolution module to obtain the masked intermediate feature map, and the masked multi-level feature map is obtained based on each masked intermediate feature map.
[0200] In one possible implementation, the decoder includes a spatial alignment module, a decoding module, and a spatial recovery module, wherein the feature prediction module 504 is specifically used for:
[0201] Based on the spatial alignment module, the feature maps of different sizes in the first multi-level feature map are aligned to the same spatial resolution, so that the size of each feature map in the first multi-level feature map is aligned, resulting in a spatially aligned multi-level feature map.
[0202] Based on the mask images corresponding to the spatially aligned multi-level feature maps, the mask regions in the spatially aligned multi-level feature maps are replaced with mask markers to obtain multi-level feature maps with mask markers. Then, based on the decoding module, feature prediction processing is performed on the multi-level feature maps with mask markers to obtain multi-level feature maps with feature prediction processing.
[0203] Based on the spatial restoration module, the multi-level feature map of the feature prediction processing with the same spatial resolution is restored to the original size multi-level feature map, thus obtaining the first multi-level feature map after restoration processing.
[0204] In one possible implementation, the feature prediction module 504 is specifically used for:
[0205] For each feature map in the first multi-level feature map, the feature map is compared with the target image;
[0206] When the size of the feature map is larger than the size of the target image, the feature map is subjected to dimensionality reduction processing so that its size is the same as that of the target image; or...
[0207] When the size of the feature map is smaller than the size of the target image, nearest neighbor interpolation is used to upsample the feature map so that the size of the feature map is consistent with the size of the target image.
[0208] In one possible implementation, the feature prediction module 504 is further configured to:
[0209] Align the number of channels in the first multi-level feature map with the number of channels in the second multi-level feature map; and / or perform layer normalization on the first multi-level feature map and the second multi-level feature map.
[0210] In one possible implementation, the feature prediction module 504 is specifically used for:
[0211] For each spatially aligned feature map, the spatially aligned feature map is expanded to obtain a one-dimensional expanded feature map, and based on the mask image corresponding to the spatially aligned feature map, the mask region that needs to be replaced in the expanded feature map is determined.
[0212] The mask region is replaced with the mask mark to obtain an expanded feature map with the mask mark, and based on each expanded feature map with the mask mark, a multi-level feature map with the mask mark is obtained.
[0213] In one possible implementation, the feature prediction module 504 is further configured to:
[0214] Add cosine absolute position encoding to the expanded feature map with the mask mark, and adaptively adjust the expanded feature map with the mask mark by interpolation based on a preset absolute scale to obtain the adjusted expanded feature map.
[0215] Based on the various adjusted expanded feature maps, the multi-level feature map with mask markings is obtained.
[0216] In one possible implementation, the network training module 505 is specifically used for:
[0217] The feature recovery loss between the first multi-level feature map and the second multi-level feature map after the recovery process is determined, and the parameters of the first network are adjusted based on the feature recovery loss.
[0218] In one possible implementation, the network training module 505 is further configured to:
[0219] Based on the first multi-level feature map after the recovery process, the task loss of the first network is determined; and / or, the first global relationship and the second global relationship corresponding to the first multi-level feature map and the second multi-level feature map after the recovery process are determined by the global context module, and the global loss between the first global relationship and the second global relationship is determined.
[0220] The parameters of the first network are adjusted based on the feature recovery loss, the task loss, and / or the global loss.
[0221] In one possible implementation, the image acquisition module 501 is specifically used for:
[0222] The training sample image is acquired and divided into multiple non-overlapping image patches.
[0223] The target mask image is obtained by random sampling based on a preset mask ratio. The target mask image includes a mask indicator for indicating that the corresponding image block is masked.
[0224] Based on the mask indicator in the target mask image, the training sample image is masked to obtain a mask sample image corresponding to the training sample image.
[0225] In one possible implementation, see Figure 6 As shown, the device further includes:
[0226] The task detection module 506 is used to acquire the image to be detected and perform an image detection task on the image to be detected based on the trained first network; the image detection task includes an object detection task, a semantic segmentation task, or an instance segmentation task.
[0227] The processing flow of each module in the device and the interaction flow between each module can be referred to the relevant descriptions in the above method embodiments, and will not be detailed here.
[0228] Based on the same technical concept, this disclosure also provides an electronic device. (See also...) Figure 7 The diagram shown is a structural schematic of an electronic device 700 provided in an embodiment of this disclosure, including a processor 701, a memory 702, and a bus 703. The memory 702 is used to store execution instructions and includes a main memory 7021 and an external memory 7022. The main memory 7021, also called internal memory, is used to temporarily store computational data in the processor 701, as well as data exchanged with external memory 7022 such as a hard disk. The processor 701 exchanges data with the external memory 7022 through the main memory 7021.
[0229] In this embodiment, the memory 702 is specifically used to store application code that executes the solution of this application, and its execution is controlled by the processor 701. That is, when the electronic device 700 is running, the processor 701 communicates with the memory 702 through the bus 703, so that the processor 701 executes the application code stored in the memory 702, and then executes the method described in any of the foregoing embodiments.
[0230] The memory 702 may be, but is not limited to, random access memory (RAM), read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), etc.
[0231] Processor 701 may be an integrated circuit chip with signal processing capabilities. The aforementioned processor can be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc.; it can also be a Digital Signal Processor (DSP), an Application-Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this invention. The general-purpose processor can be a microprocessor or any conventional processor.
[0232] It is understood that the structures illustrated in the embodiments of this application do not constitute a specific limitation on the electronic device 700. In other embodiments of this application, the electronic device 700 may include more or fewer components than illustrated, or combine some components, or split some components, or have different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
[0233] This disclosure also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, performs the steps of the network training method described in the above-described method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.
[0234] This disclosure also provides a computer program product carrying program code. The program code includes instructions that can be used to execute the steps of the network training method in the above method embodiments. For details, please refer to the above method embodiments, which will not be repeated here.
[0235] The aforementioned computer program product can be implemented through hardware, software, or a combination thereof. In one optional embodiment, the computer program product is specifically embodied in a computer storage medium; in another optional embodiment, the computer program product is specifically embodied in a software product, such as a software development kit (SDK), etc.
[0236] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems and devices described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here. In the several embodiments provided in this disclosure, it should be understood that the disclosed systems, devices, and methods can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. Furthermore, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Another point is that the displayed or discussed mutual coupling or direct coupling or communication connection may be through some communication interfaces; the indirect coupling or communication connection of devices or units may be electrical, mechanical, or other forms.
[0237] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0238] In addition, the functional units in the various embodiments of this disclosure can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.
[0239] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a processor-executable, non-volatile, computer-readable storage medium. Based on this understanding, the technical solution of this disclosure, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this disclosure. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, external hard drives, ROM, RAM, magnetic disks, or optical disks.
[0240] Finally, it should be noted that the above-described embodiments are merely specific implementations of this disclosure, used to illustrate the technical solutions of this disclosure, and not to limit it. The protection scope of this disclosure is not limited thereto. Although this disclosure has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that any person skilled in the art can still modify or easily conceive of changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features, within the scope of the technology disclosed in this disclosure. Such modifications, changes, or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this disclosure, and should all be covered within the protection scope of this disclosure. Therefore, the protection scope of this disclosure should be determined by the protection scope of the claims.
Claims
1. A network training method, characterized by, include: S101, acquire training sample images and mask sample images corresponding to the training sample images; wherein, the mask sample images are generated from the training sample images and the target mask images; S102, the mask sample image is input into the first network, and features are extracted from the mask sample image based on the first feature extraction network of the first network to obtain a first multi-level feature map corresponding to the mask sample image; S103, the training sample image is input into the second network, and features are extracted from the training sample image based on the second feature extraction network of the second network to obtain a second multi-level feature map corresponding to the training sample image; the size of the second network is larger than the size of the first network; S104, based on the decoder and the mask image corresponding to each feature map in the first multi-level feature map, feature restoration processing is performed on the first multi-level feature map to obtain the restored first multi-level feature map; wherein, the mask image corresponding to each feature map in the first multi-level feature map is obtained by scaling the target mask image respectively; the decoder includes a spatial alignment module, a decoding module, and a spatial restoration module; the decoder performs feature restoration processing on the first multi-level feature map, including: aligning the feature maps of different sizes in the first multi-level feature map to the same spatial resolution based on the spatial alignment module, so that... The dimensions of each feature map in the first multi-level feature map are aligned to obtain a spatially aligned multi-level feature map. Based on the mask images corresponding to the spatially aligned multi-level feature maps, the mask regions in the spatially aligned multi-level feature maps are replaced with mask markers to obtain a multi-level feature map with mask markers. The multi-level feature map with mask markers is then subjected to feature prediction processing by the decoding module to obtain a multi-level feature map with feature prediction processing. Finally, the multi-level feature map with feature prediction processing and the same spatial resolution is restored to the original size multi-level feature map by the spatial restoration module to obtain the restored first multi-level feature map. S105, the first network is trained based on the first multi-level feature map after the recovery process and the second multi-level feature map, and the above steps S101 to S104 are repeated until the training result of the first network meets the preset requirements, and the trained first network is obtained.
2. The method of claim 1, wherein, The first feature extraction network includes a pyramid-structured feature extraction module. The step of inputting the mask sample image into the first network and extracting features from the mask sample image based on the first feature extraction network of the first network to obtain a first multi-level feature map corresponding to the mask sample image includes: The masked sample image is input into the first network, and features are extracted from the masked sample image based on the feature extraction module to obtain intermediate multi-level feature maps, and the intermediate multi-level feature maps are used as the first multi-level feature map; wherein, the intermediate multi-level feature map includes multiple intermediate feature maps of different sizes.
3. The method of claim 1, wherein, The first feature extraction network includes a feature extraction module with a pyramid structure and multiple mask convolution modules; the step of inputting the mask sample image into the first network and extracting features from the mask sample image based on the first feature extraction network of the first network to obtain a first multi-level feature map corresponding to the mask sample image includes: The mask sample image and the target mask image are input into the first network, and the feature extraction module is used to extract features from the mask sample image to obtain intermediate multi-level feature maps, wherein the intermediate multi-level feature maps include multiple intermediate feature maps of different sizes. Based on the multiple mask convolutional modules and the target mask image, each intermediate feature map in the intermediate multi-level feature map is masked, and the masked intermediate multi-level feature map is used as the first multi-level feature map.
4. The method of claim 3, wherein, The step of masking each intermediate feature map in the intermediate multi-level feature maps based on the plurality of mask convolutional modules and the target mask image includes: Based on the size of each intermediate feature map in the intermediate multi-level feature maps, the target mask image is scaled to obtain a mask image corresponding to each intermediate feature map in the intermediate multi-level feature maps. For each intermediate feature map, the intermediate feature map and the corresponding mask image are multiplied by the mask convolution module to obtain the masked intermediate feature map, and the masked multi-level feature map is obtained based on each masked intermediate feature map.
5. The method of claim 1, wherein, The step of aligning the feature maps of different sizes in the first multi-level feature map to the same spatial resolution based on the spatial alignment module includes: For each feature map in the first multi-level feature map, the feature map is compared with the target image; When the size of the feature map is larger than the size of the target image, the feature map is subjected to dimensionality reduction processing so that its size is the same as that of the target image; or... When the size of the feature map is smaller than the size of the target image, nearest neighbor interpolation is used to upsample the feature map so that the size of the feature map is consistent with the size of the target image.
6. The method of claim 1, wherein, Before aligning the feature maps of different sizes in the first multi-level feature maps to the same spatial resolution based on the spatial alignment module, the method further includes: Align the number of channels in the first multi-level feature map with the number of channels in the second multi-level feature map; and / or perform layer normalization on the first multi-level feature map and the second multi-level feature map.
7. The method of claim 1, wherein, The mask images corresponding to the spatially aligned multi-level feature maps are obtained by replacing the mask regions in the spatially aligned multi-level feature maps with mask markers, including: For each spatially aligned feature map, the spatially aligned feature map is expanded to obtain a one-dimensional expanded feature map, and based on the mask image corresponding to the spatially aligned feature map, the mask region that needs to be replaced in the expanded feature map is determined. The mask region is replaced with the mask mark to obtain an expanded feature map with the mask mark, and based on each expanded feature map with the mask mark, a multi-level feature map with the mask mark is obtained.
8. The method of claim 7, wherein, After obtaining the expanded feature map with the mask marker, the method further includes: Add cosine absolute position encoding to the expanded feature map with the mask mark, and adaptively adjust the expanded feature map with the mask mark by interpolation based on a preset absolute scale to obtain the adjusted expanded feature map. The process of obtaining the multi-level feature map with masked markers based on each expanded feature map bearing the masked markers includes: Based on the various adjusted expanded feature maps, the multi-level feature map with mask markings is obtained.
9. The method of claim 1, wherein, Training the first network based on the recovered first multi-level feature maps and the second multi-level feature maps includes: The feature recovery loss between the first multi-level feature map and the second multi-level feature map after the recovery process is determined, and the parameters of the first network are adjusted based on the feature recovery loss.
10. The method of claim 9, wherein, The method further includes: Based on the first multi-level feature map after the recovery process, determine the task loss of the first network; and / or, The global context module determines the first global relation and the second global relation corresponding to the first multi-level feature map and the second multi-level feature map after the recovery process, respectively, and determines the global loss between the first global relation and the second global relation. The adjustment of the parameters of the first network based on the feature recovery loss includes: The parameters of the first network are adjusted based on the feature recovery loss, the task loss, and / or the global loss.
11. The method according to any one of claims 1-10, characterized in that, The step of obtaining the training sample image and the mask sample image corresponding to the training sample image includes: The training sample image is acquired and divided into multiple non-overlapping image patches. The target mask image is obtained by random sampling based on a preset mask ratio. The target mask image includes a mask indicator for indicating that the corresponding image block is masked. Based on the mask indicator in the target mask image, the training sample image is masked to obtain a mask sample image corresponding to the training sample image.
12. The method of any one of claims 1-2, wherein, The method further includes: The image to be detected is acquired, and an image detection task is performed on the image based on the trained first network; the image detection task includes an object detection task, a semantic segmentation task, or an instance segmentation task.
13. A network training apparatus, comprising: The apparatus is used to perform the steps of the network training method as described in any one of claims 1-12.
14. An electronic device, characterized in that, include: The device includes a processor, a memory, and a bus. The memory stores machine-readable instructions executable by the processor. When the electronic device is running, the processor communicates with the memory via the bus. When the machine-readable instructions are executed by the processor, they perform the steps of the network training method as described in any one of claims 1-12.
15. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, performs the network training steps as described in any one of claims 1-12.
Citation Information
Patent Citations
Training method of image classification model, and image classification method and device
CN115457329A