An end-to-end foggy image multi-target detection model based on knowledge embedding
By combining an image defogging subnetwork and an object detection subnetwork, and employing a semantic association feature learning method with knowledge embedding, the problem of decreased detection accuracy in multi-object detection of foggy images is solved, achieving high-precision detection in multi-object scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NAT UNIV OF DEFENSE TECH
- Filing Date
- 2022-08-11
- Publication Date
- 2026-06-26
AI Technical Summary
In complex imaging scenarios such as foggy weather, existing end-to-end target detection models face challenges in data acquisition and annotation for multi-target detection, leading to decreased detection accuracy. This is especially true in scenarios with multiple target types, where missed detections and mislabeling are common, impacting environmental perception capabilities.
An end-to-end multi-target detection model for foggy images based on knowledge embedding is adopted, which combines an image defogging sub-network and a target detection sub-network. Through semantic association feature learning and RetinaNet network structure, feature extraction is shared, and the detection accuracy is improved by using a knowledge-guided semantic association feature learning method.
It improves the accuracy of multi-target detection in foggy images, reduces the noise impact of missed detections and mislabeling in the dataset, and enhances the model's generalization ability and detection performance.
Smart Images

Figure CN115424026B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of pattern recognition technology, specifically to an end-to-end multi-target detection model for foggy images based on knowledge embedding. Background Technology
[0002] In complex imaging scenarios like foggy weather, the image quality acquired by the image acquisition system suffers severe degradation, impacting the performance of target detection algorithms and leading to missed detections and false detections. This, in turn, affects the environmental perception capabilities of unmanned systems platforms in the air, on the ground, or at sea. Target detection methods in foggy scenarios can be divided into two categories: One is a two-stage method, namely a dehazing-detection-based unrelated target detection method. This method first uses image enhancement and restoration methods to sharpen the foggy image, and then uses target detection methods to detect targets. However, the first-stage dehazing process may introduce artifacts and color distortion, thus not improving target detection accuracy for all images and generally not suitable for scenarios with high real-time requirements. The other category is an end-to-end method, which jointly optimizes and trains the dehazing network and the target detection network, performing dehazing and target detection tasks simultaneously. By sharing feature extraction, it reduces the impact of image degradation and improves target detection accuracy in foggy images.
[0003] End-to-end target detection models for foggy images mainly include the KODNet and DONet network models. KODNet designs the anchor box aspect ratio in the depth detection model to guide target detection in real foggy scenes. The DONet dehazing model is cascaded with the target detection model and jointly learned, effectively solving the problems of difficult target detection and low detection accuracy in foggy scenes, while avoiding artifacts, loss of detail, and color distortion problems caused by image enhancement and restoration methods. However, end-to-end target detection for foggy images is difficult to collect and label, especially when multiple different types of targets exist in the scene. The dataset contains noise such as missed detections and mislabeled targets, which degrades the target detection performance in foggy scenes. How to achieve high target detection accuracy with a limited training set is an urgent problem to be solved in multi-target detection of foggy images. Summary of the Invention
[0004] (a) Technical problems to be solved
[0005] To address the shortcomings of existing technologies, this invention provides an end-to-end multi-target detection model for foggy images based on knowledge embedding, which solves the problem of decreased target detection performance in foggy scenes.
[0006] (II) Technical Solution
[0007] To achieve the above objectives, the present invention provides the following technical solution: an end-to-end multi-target detection model for foggy images based on knowledge embedding, comprising an image defogging sub-network, a target detection sub-network, and semantic association feature learning based on knowledge embedding. The image defogging sub-network comprises a common module and a feature recovery module. The feature recovery module comprises an upsampling sub-module, a multi-scale mapping sub-module, and an image generation sub-module.
[0008] Preferably, the fog image defogging sub-network is used to generate features and shares features with the detection sub-network during joint training to improve target detection accuracy under foggy conditions. The fog image defogging sub-network is completed based on the atmospheric light intensity scattering model.
[0009] Preferably, the common module extracts features from the input image including important feature information for simultaneous visual enhancement, target recognition, and localization.
[0010] Preferably, the upsampling sub-model is a feature recovery sub-network where the output image and the input image are of similar size, but the size of the feature map extracted by the common module is one-quarter of the input image.
[0011] Preferably, the multi-scale mapping submodule is a feature The resolution is increased by the upsampling submodule, and the resulting feature map is passed to the multi-scale mapping submodule for multi-scale feature extraction.
[0012] Preferably, the image generation submodule is the final stage of the image restoration subnetwork, through which scene restoration is completed.
[0013] Preferably, the target detection sub-network model for foggy images uses RetinaNet as the backbone network. RetinaNet utilizes a feature pyramid network to provide a top-down path. Lateral connections enable the construction of higher-resolution network layers from a rich semantic layer, thereby significantly improving the detection accuracy of small targets in foggy scenes. This is because deep networks contain rich semantic information but lack positional information after pooling. Lateral connections between deep networks and corresponding shallow networks can enrich positional information and improve accuracy.
[0014] Preferably, the knowledge-embedded semantic association feature learning adopts a knowledge-guided semantic association feature learning method, which learns features that cover more comprehensive discriminative information by structurally expressing prior knowledge of category-attribute association and category association and embedding it into a deep network model.
[0015] Working principle: First, the common module of the end-to-end foggy image multi-target detection model is used to extract features from the input image, including important feature information for learning visual enhancement, target recognition, and localization. Then, a feature inpainting model is used to repair the degraded foggy image. Second, the feature map of the entire image is extracted using the RetinaNet network structure. Then, at the top of the network structure, the Feature Pyramid Network (FPN) is used to attach to the RetinaNet structure to construct multi-scale features, solving the problem of feature construction for targets at different scales. A knowledge-embedded feature learning representation method is adopted to solve the problem that in scenarios with few samples, the labeled samples only cover a very small number of appearance features, resulting in poor model representation and generalization ability. This avoids the influence of noise from missed detections and mislabeled samples in the dataset for multi-target detection in foggy scenes. This invention can be used for multi-target detection in foggy scenes and has positive significance for promoting the application of foggy image scene understanding.
[0016] (III) Beneficial Effects
[0017] This invention provides an end-to-end multi-target detection model for foggy images based on knowledge embedding. It has the following advantages:
[0018] This invention provides an end-to-end multi-target detection model for foggy images based on knowledge embedding. By jointly optimizing the dehazing network and the detection network, the image restoration result is reconstructed under the guidance of target detection information. The detection network learns the target structural details and color features restored after image dehazing, thereby improving the accuracy of target detection. At the same time, the feature learning representation method of knowledge embedding is adopted to solve the problem that in the case of few samples, the labeled samples only cover the appearance features of a very small number of cases, resulting in poor expressive power and generalization ability of the learned model. This also avoids the impact of noise such as missed detections and mislabeling in the dataset on target detection in foggy scenes. Attached Figure Description
[0019] Figure 1 This is a flowchart of the target detection network structure for foggy images according to the present invention;
[0020] Figure 2 Flowchart of the knowledge-guided semantic feature learning framework of the present invention;
[0021] Figure 3 This is a flowchart of the target detection results of the end-to-end foggy image multi-target detection model based on knowledge embedding of the present invention. Detailed Implementation
[0022] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0023] Example:
[0024] like Figure 1-3 As shown, this embodiment of the invention provides an end-to-end multi-target detection model for foggy images based on knowledge embedding, including an image defogging sub-network, a target detection sub-network, and semantic association feature learning based on knowledge embedding. The image defogging sub-network includes a common module and a feature recovery module. The feature recovery module is used because the input image features extracted from the common module are degraded by fog, which leads to a decrease in target detection performance. In order to recover the model features output by the common module, the feature recovery sub-network adopts the FR module. The feature recovery module includes an upsampling sub-module, a multi-scale mapping sub-module, and an image generation sub-module.
[0025] The dehazing subnetwork for foggy images is used to generate features ( The fog defogging subnetwork is built upon an atmospheric light intensity scattering model and shares features with the detection subnetwork during joint training to improve target detection accuracy under foggy conditions.
[0026] Image dehazing is achieved using an atmospheric scattering model, the formula of which is:
[0027] (Formula 1)
[0028] To facilitate the measurement of transmittance and global atmospheric light intensity value To estimate, the formula can be rewritten as:
[0029] (Formula 2)
[0030] here, The image dehazing subnetwork will improve transmittance during the visual enhancement process. and atmospheric light intensity value They are combined and estimated using a network model.
[0031] The common module extracts features from the input image, including important feature information for simultaneous visual enhancement, object recognition, and localization. This common module is not designed independently but rather co-designed with some residual modules of the object detection subnetwork to maintain a simple structure. Specifically, the 16 residual modules in the detection subnetwork are divided into four residual stages (denoted as Conv_2, Conv_3, Conv_4, and Conv_5, as shown in the appendix). Figure 1 As shown in the figure, considering that the features obtained through shallow neural networks may contain more spatial information, which is beneficial to visual enhancement, but the spatial information in deep networks is reduced during pooling, the first 10 convolutional network layers of the detection sub-network are selected to form a common module model, and Conv_2 is used as the output of the model. The feature maps obtained from the common module are synchronously passed to the feature recovery module for visual enhancement and passed to Conv_3 for target detection.
[0032] The upsampling sub-model is a feature recovery sub-network. The output image and the input image are the same size, but the size of the feature map extracted by the common module is one-quarter of the input image. Therefore, the feature recovery sub-network module uses the upsampling sub-module to match the resolution of the input image. In current research on deep learning-based dehazing, bilinear interpolation can be applied to the image dehazing process based on convolutional neural networks, and the pooling feature map is upsampled by bilinear interpolation to generate the dehazed output image. Therefore, the upsampling sub-model in this model first uses convolutional layers to reduce the feature dimension, in particular, the number of feature channels is reduced by 7 times. Then, bilinear interpolation is used to increase the size of the feature to be the same as the size of the input image.
[0033] Multi-scale mapping submodule as features The resolution is increased by the upsampling submodule, and the resulting feature map is passed to the multi-scale mapping submodule for multi-scale feature extraction. Multi-scale feature extraction is widely used in image dehazing methods and is very effective in enhancing visibility. The multi-scale feature submodule consists of four parallel convolutions, including... convolution, convolution, Convolution and Convolutions, each with 4 channels, are used, and the final features are generated by additional... Convolution to estimate Including transmittance and global atmospheric light intensity value .
[0034] The image generation submodule is the final stage of the image restoration subnetwork. This module completes the scene restoration process. As input, the transformation formula 2 is calculated using an element-wise multiplication layer, a feature vector extraction layer, and a feature vector addition layer.
[0035] To train the visibility enhancement subnetwork, the image restoration subnetwork utilizes mean-variance error (MSE) loss, which can be described as:
[0036] (Formula 3)
[0037] here It is the size of the image slice. It is the actual restored image. This is the estimated restored image. To further emphasize, although the restoration subnetwork can directly generate dehazed images, its goal is not to generate a dehazed image of the input to the detection subnetwork, but rather to generate features unaffected by fog from the common modules. Come and learn about the task of enhancing visibility.
[0038] The target detection sub-network model for foggy images uses RetinaNet as the backbone network. RetinaNet utilizes a feature pyramid network (FPN) to provide a top-down path. Lateral connections enable the construction of higher-resolution network layers from a rich semantic layer, thereby significantly improving the detection accuracy of small targets in foggy scenes. This is because deep networks contain rich semantic information but lack positional information after pooling. Lateral connections between deep networks and corresponding shallow networks can enrich positional information and improve accuracy.
[0039] To efficiently detect targets, the detection sub-network follows a specific strategy. First, the RetinaNet network structure is used to extract feature maps from the entire image. Then, at the top of the network structure, the Feature Pyramid Network (FPN) is used to attach to the RetinaNet structure to build multi-scale features, solving the problem of feature construction for targets at different scales. Finally, the multi-target recognition and localization tasks are completed by adding simplified fully convolutional networks (FCNs) to the FPN network layer to complete target detection and bounding box regression.
[0040] To train the target classification and detection network, the focal loss function is used. As a balancing variable, it is described as follows:
[0041] (Formula 4)
[0042] here, For target category 1, For target category -1, It is an adjustable focus parameter ( ), Defined as
[0043]
[0044] here For the determined baseline class ( ), Category tags The target class probability is obtained through model estimation.
[0045] Target localization, detection network in the predicted bounding box ( ) and reference frame ( A smooth loss is applied between the anchor frames () to match the anchor frames () ) and reference frame ( The matching pairs between ) are represented as ,here This represents the number of matching pairs. For each matching anchor box, a baseline box regression is defined as... Then a corresponding prediction box is represented as ,here , , and The positioning loss is expressed as follows: (This refers to the two center coordinates, width, and height of the bounding box.)
[0046] (Formula 5)
[0047] here, , , , .
[0048] Knowledge-embedded semantic association feature learning is a knowledge-guided semantic association feature learning method. It uses a structured representation of prior knowledge about category-attribute associations and category associations, embedding it into a deep network model to learn features covering more comprehensive discriminative information. In the foggy image target detection model, the semantic association feature learning model is connected to the RetinaNet backbone network. First, a knowledge graph of category-attribute associations is constructed for each target class, and its attribute knowledge features are learned using a graph propagation network. Then, a semantic association attention mechanism is introduced, using these attribute knowledge features to guide the learning of attribute association features for each target class. Based on these attribute association features, the coexistence probability of different target classes in the image is learned, and a category association is constructed based on the coexistence probability. Knowledge graphs learn semantic features related to context through graph propagation and interactive networks.
[0049] Assuming the foggy image scene has Object category, Each object attribute, for each category Constructing a knowledge graph that associates categories with attributes , It is a set of nodes, where Indicate category node, Represents attributes Nodes; Represents the node association matrix, where Represents a node and nodes The association probability is then used to construct a knowledge graph of category-category associations. , ,in, Indicate category node; Represents the node association matrix, where Indicate category and categories The probability of coexistence.
[0050] Given a foggy image, a multi-scale global feature extraction method is first used to extract features from an object detection sub-network. For multiple target categories, the GloVe model is used to extract categories. and corresponding Features of each attribute, used to initialize the graph. The corresponding category and attribute nodes, i.e. Then, a Graph Convolutional Network (GCN) is introduced to explore information propagation and interaction among different nodes and update node features, i.e.
[0051] (Formula 6)
[0052] Using prior information Initialize the adjacency matrix Then, during training, the relationship between categories and attributes is jointly optimized and learned. In the convolutional operation, the information between graph nodes is deeply interacted with and explored, which can yield... By cascading the features of each node and mapping them to categories Attribute knowledge representation
[0053] (Formula 7)
[0054] Introduce a knowledge-guided attention mechanism, utilizing each category Attribute knowledge representation Guided learning of semantic attribute association features, specifically for image features. Each position First, the location features and corresponding knowledge representations are integrated, and then the importance factors of that location are learned, i.e. Repeat this operation for each position to obtain the importance factor for each position. The factor is then normalized using the softmax function to obtain the final normalized importance factor. Finally, weighted average pooling is used to obtain the categories. semantic attribute association features
[0055] (Formula 6)
[0056] Perform the above operations on all categories to obtain the features associated with all categories and their corresponding attributes. , where the feature vector The main focus is on learning the features of the regions that cover the attributes associated with this category.
[0057] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. An end-to-end multi-target detection system for foggy images based on knowledge embedding, comprising an image defogging sub-network, a target detection sub-network, and semantic association feature learning based on knowledge embedding, characterized in that: The image dehazing subnetwork includes a common module and a feature recovery module. The feature recovery module includes an upsampling submodule, a multi-scale mapping submodule, and an image generation submodule. The target detection sub-network uses RetinaNet as the backbone network. RetinaNet uses a feature pyramid network to provide a top-down path. Lateral connections enable the construction of a higher resolution network layer from a rich semantic layer, thereby significantly improving the detection accuracy of small targets in foggy scenes. This is because deep networks contain rich semantic information but lack positional information after pooling. Lateral connections between deep networks and corresponding shallow networks can enrich positional information and improve accuracy. The knowledge-embedded semantic association feature learning method adopts a knowledge-guided semantic association feature learning approach. It learns features that cover more comprehensive discriminative information by structurally expressing prior knowledge of category-attribute association and category association and embedding it into a deep network model. Knowledge-embedded semantic association feature learning is a knowledge-guided semantic association feature learning method. It uses a structured representation of prior knowledge about category-attribute associations and category associations, embedding it into a deep network model to learn features covering more comprehensive discriminative information. In the foggy image target detection model, the semantic association feature learning model is connected to the RetinaNet backbone network. First, a knowledge graph of category-attribute associations is constructed for each target class, and its attribute knowledge features are learned using a graph propagation network. Then, a semantic association attention mechanism is introduced, using these attribute knowledge features to guide the learning of attribute association features for each target class. Based on these attribute association features, the coexistence probability of different target classes in the image is learned, and a category association is constructed based on the coexistence probability. Knowledge graphs learn semantic features related to context through graph propagation and interactive networks.
2. The end-to-end multi-target detection system for foggy images based on knowledge embedding according to claim 1, characterized in that: The image defogging subnetwork is used to generate features and shares features with the detection subnetwork during joint training to improve target detection accuracy under foggy conditions. The fog image defogging subnetwork is completed based on the atmospheric light intensity scattering model.
3. The end-to-end multi-target detection system for foggy images based on knowledge embedding according to claim 1, characterized in that: The common module extracts features from the input image, including important feature information for simultaneous visual enhancement, target recognition, and localization.
4. The end-to-end multi-target detection system for foggy images based on knowledge embedding according to claim 1, characterized in that: The upsampling submodule is located in the feature recovery subnetwork. The output image and the input image are the same size, but the size of the feature map extracted by the common module is one-quarter of the input image.
5. The end-to-end multi-target detection system for foggy images based on knowledge embedding according to claim 1, characterized in that: The multi-scale mapping submodule is a feature The resolution is increased by the upsampling submodule, and the resulting feature map is passed to the multi-scale mapping submodule for multi-scale feature extraction.
6. The end-to-end multi-target detection system for foggy images based on knowledge embedding according to claim 1, characterized in that: The image generation submodule is the final stage of the image restoration subnetwork, through which scene restoration is completed.