A five-modal unified object detection method
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NORTHWEST A & F UNIV
- Filing Date
- 2026-03-20
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, the target detection methods for five heterogeneous modal images have problems such as large redundancy of model parameters, high training and deployment costs, and are prone to problems such as inconsistent input structure, large differences in feature distribution, class space conflict and gradient interference during joint training.
A unified detection framework is adopted, consisting of modality-specific input adapters, a shared backbone network, a shared feature fusion network, and a modality-specific detection head. Combined with a single-batch, single-modality round-robin joint training mechanism, it enables the training and inference of five heterogeneous modal images in a unified model.
Sharing intermediate feature extraction and multi-scale semantic fusion within a unified framework reduces the number of model parameters and deployment costs, improves training stability and engineering feasibility, and is suitable for multi-dataset non-registration scenarios.
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computer vision, remote sensing image processing, and artificial intelligence target detection technology, and in particular to a unified target detection method, apparatus, electronic device, and computer-readable storage medium for five heterogeneous modal images: visible light, infrared, synthetic aperture radar, multispectral, and hyperspectral. Background Technology
[0002] With the development of remote sensing, UAV observation, and intelligent visual perception technologies, various modal images, including visible light, infrared, synthetic aperture radar, multispectral, and hyperspectral images, are widely used in target detection tasks. Different modal images exhibit significant differences in imaging mechanisms, texture distribution, signal-to-noise characteristics, channel composition, and target appearance. For example, visible light images possess rich texture and color information, infrared images highlight the target's thermal radiation characteristics, synthetic aperture radar images are significantly affected by scattering characteristics, and multispectral and hyperspectral images contain even richer spectral information. Therefore, while different modal data have their own advantages in target detection tasks, they also present significant challenges for unified modeling.
[0003] In existing technologies, target detection methods for different modalities typically employ two main approaches: one is to build independent detection models for each modality, i.e., training separate target detection networks for visible light, infrared, SAR, multispectral, and hyperspectral images; the other is to perform joint detection using feature-level fusion or decision-level fusion for two or more modalities with spatial registration relationships. While the former approach can adapt to the imaging characteristics of different modalities to some extent, it requires maintaining multiple independent models, resulting in a large number of model parameters, high training and deployment costs, and difficulty in utilizing the shared representation capabilities between different modalities in terms of edges, geometric structures, and semantic targetness. The latter approach usually relies on strict spatial alignment relationships or scene correspondences between multimodal data, which is often difficult to apply directly to heterogeneous modal data from different public datasets, different sensor platforms, and different category systems.
[0004] Furthermore, most existing general-purpose object detection networks are built based on three-channel image input and rely on pre-trained weights from visible light images for initialization. For infrared images, SAR images, and high-dimensional spectral images, directly feeding the network with the raw input can easily lead to problems such as inconsistent input dimensions, significant feature distribution shifts, and difficulty in effectively transferring pre-trained weights. Meanwhile, simply mixing multiple modalities directly into a single shared detection head for training can easily result in training instability and decreased detection performance due to inconsistencies in different modality category sets, label space conflicts, and mutual interference in gradient updates.
[0005] Therefore, existing technologies still lack a unified object detection method that can adapt to the differences in image input from five heterogeneous modalities, achieve shared feature extraction within a unified framework, avoid category space conflicts, and reduce overall training and deployment costs. Summary of the Invention
[0006] To overcome the problems in existing technologies for heterogeneous modal images such as visible light, infrared, synthetic aperture radar, multispectral, and hyperspectral images, which typically require the construction of separate independent target detection models, resulting in large model parameter redundancy, high training and deployment costs, or difficulties in achieving consistent input structures, significant differences in feature distributions, obvious class space conflicts, and mutual interference of multimodal gradients during direct joint training, this invention proposes a five-modal unified target detection method, device, electronic device, and computer-readable storage medium. This invention introduces modality-specific input adapters, a shared backbone network, a shared feature fusion network, and a modality-specific detection head into a unified detection framework, combined with a single-batch, single-modality, round-robin joint training mechanism. This enables five types of heterogeneous modal images to complete training and inference within a single unified model, thus balancing multimodal difference modeling, cross-modal shared representation learning, and engineering deployment efficiency.
[0007] The purpose of this invention is to provide a unified target detection scheme applicable to visible light images, infrared images, synthetic aperture radar images, multispectral images, and hyperspectral images. While preserving the differences in input features and output categories across different modalities, it aims to share intermediate feature extraction and multi-scale semantic fusion capabilities as much as possible, reducing the maintenance and deployment costs of multiple models, and addressing the difficulties of joint training in scenarios with multiple datasets, non-registration, and inconsistent class spaces. Furthermore, this invention aims to enable multiple non-RGB modalities to be compatible with a target detection backbone network initialized with three-channel pre-trained weights through unified three-channel input processing and a modality-specific lightweight adaptation mechanism, thereby improving training stability and the engineering feasibility of the method.
[0008] To achieve the above objectives, the present invention adopts the following technical solution, and the steps are as follows:
[0009] S1. Obtain and construct a five-modal object detection dataset.
[0010] Five modalities of image data were acquired, including visible light mode, infrared mode, synthetic aperture radar mode, multispectral mode, and hyperspectral mode. The original annotation information of each modal dataset was uniformly organized, and the target bounding box labels from different sources and with different organization methods were converted into a unified target detection annotation format. An index file corresponding to the image path and the label path was established to complete the unified reading and joint training of multimodal heterogeneous data.
[0011] S2 performs three-channel input unification processing on five modal images.
[0012] To ensure compatibility with the input structure of the unified detection network and the three-channel pre-training weight initialization method, five modal images are uniformly constructed as three-channel inputs. Visible light images directly use the original three-channel image. For infrared and synthetic aperture radar images, if the original data is single-channel, a three-channel representation can be constructed through single-channel duplication, grayscale enhancement followed by duplication, pseudo-color mapping, or other equivalent methods. For multispectral images, a three-channel representation can be constructed through band selection, band combination, linear projection, feature compression mapping, or preprocessed three-channel results. For hyperspectral images, a three-channel representation can be constructed through principal component analysis, spectral band selection, linear mapping, pseudo-color synthesis, or other spectral band compression methods. This unified three-channel processing allows images from different modalities to be input into the same detection network with a unified dimension, while avoiding limiting a single channel construction method to a single implementation path.
[0013] Furthermore, the unified processing of the three-channel input can be expressed as:
[0014]
[0015] in, Indicates the first The original input data for each modality, This represents the three-channel constructor for the corresponding mode. This represents the three-channel input image after standardization. Modality numbering is performed. This step ensures that data from different modalities have a consistent input dimension before entering the unified object detection network.
[0016] S3, construct a five-modal unified target detection network.
[0017] A unified detection network structure is constructed, consisting of "modality-specific input adapters—shared backbone network—shared feature fusion network—modality-specific detector heads". The network comprises five modality-specific input adapters, one shared backbone network, one shared feature fusion network, and five modality-specific detector heads. Each modality corresponds to one input adapter and one detector head. The shared backbone network and shared feature fusion network are used by all five modalities to learn shared edge, shape, structural, and semantic features across modalities, reducing parameter count and deployment overhead.
[0018] S4 performs modality-specific input adaptation based on the modality identifier.
[0019] For any input sample, a corresponding modality identifier is assigned during the training or inference phase. The system then invokes the corresponding modality-specific input adapter based on the modality identifier to perform lightweight feature correction and input distribution adjustment on the three-channel input of the current modality. Preferably, the modality-specific input adapter employs a residual gating adapter structure, and its output satisfies:
[0020]
[0021] in, Indicates the first A three-channel input image of each modality. This indicates the output characteristics after processing by the adapter. This represents the corrected features extracted by the lightweight convolutional branch. This represents the learnable gating parameters for the corresponding mode. The activation function is denoted by , preferably the sigmoid function. By introducing learnable gating parameters, the input adapter maintains a near-identical mapping in the early stages of training, reducing disruption to the pre-trained feature distribution. As training progresses, the adapter can gradually learn input correction methods suitable for the corresponding modality, thereby enhancing the unified detection network's adaptability to differences in heterogeneous modal input distributions.
[0022] S5 performs shared feature extraction and modality-specific detection output.
[0023] The features adapted to modality-specific inputs are fed into a shared backbone network and a shared feature fusion network to extract multi-scale visual features that can be shared across modalities. Then, based on the modality identifier, the corresponding modality-specific detection head is invoked, and the category prediction result and bounding box regression result corresponding to the current modality are output. This process can be represented as:
[0024]
[0025] in, Indicates the first A modal input adapter, Indicates sharing the backbone network. Indicates a shared feature fusion network. Indicates the first Each mode corresponds to a detection head. This represents the detection output for the current modality. Since each modality detection head configures its classification output dimension according to the number of categories in the corresponding dataset, it effectively avoids the problem of conflicting unified outputs caused by inconsistencies in the category spaces of different modalities.
[0026] S6 is trained using a round-robin single-batch single-modal joint training mechanism.
[0027] The training samples for the five modalities are divided into five sample groups. During training, a batch is selected sequentially from different modal sample groups according to a preset rotation order as the current training batch, ensuring that each training batch contains only samples of the same modality. When a modal batch is fed into the network, only the input adapter and detector head corresponding to that modality are used for forward propagation; the detector heads for other modalities do not participate in the output of the current batch. Preferably, the rotating sampler can adjust the sampling order and sampling ratio of the five modalities according to a preset sampling mode, and perform repeated sampling on modalities with smaller sample sizes using a rollback oversampling method to ensure that single-modal batches remain valid.
[0028] S7 calculates the corresponding detection loss and updates the parameters only for the current modal batch.
[0029] For the current training batch, only the loss function of the detector head corresponding to its modality is calculated, and this loss is used to update the network parameters through backpropagation. The total loss function can be expressed as:
[0030]
[0031] in, Indicates the first Total loss of the current batch in each modality Represents classification loss. This represents the bounding box regression loss. Indicates the distribution focus loss. , and These represent the weight coefficients of the corresponding loss terms. Under this training mechanism, the shared backbone network and the shared feature fusion network receive joint gradient updates from different modal batches, while each modality-specific input adapter and each modality-specific detector head are updated only by their corresponding modality samples.
[0032] S8 performs modal correspondence inference and outputs the target detection results.
[0033] During the inference phase, based on the modality of the image to be detected or an externally provided modality identifier, the corresponding input adapter and modality-specific detection head are invoked. After processing by a shared backbone network and a shared feature fusion network, the target category, bounding box, and confidence score are output. Subsequently, non-maximum suppression is performed on the detection results to obtain the final target detection result. Since the inference phase only activates the input adapter and detection head corresponding to the current modality, this method preserves modality specificity while avoiding the additional overhead of maintaining complete and independent networks for each of the five modalities.
[0034] This invention also provides a five-modal unified target detection device, comprising: a data acquisition and annotation module for acquiring image data of five modalities and performing unified label conversion; an input unification processing module for converting raw data of different modalities into a unified three-channel input; a modality recognition and scheduling module for generating modality identifiers based on the modality of the input sample and selecting the corresponding processing path; a modality-specific input adaptation module for calling the input adapter corresponding to the current modality; a shared feature extraction module and a shared feature fusion module for extracting cross-modal shared features; a modality-specific detection output module for calling the corresponding detection head to output the detection result; and a training control module for performing round-robin single-batch single-modality sampling, loss calculation, and parameter updating. When the modules of the above device work together, the aforementioned method steps can be implemented.
[0035] The present invention also provides an electronic device, including a processor and a memory, wherein the memory stores a computer program, and when the computer program is executed by the processor, the electronic device performs each step of the aforementioned five-modal unified target detection method.
[0036] Compared with the prior art, the present invention has at least the following beneficial effects:
[0037] 1. This invention implements three-channel input unification processing on five heterogeneous modal images and introduces a modality-specific lightweight adapter at the input end, enabling non-RGB modalities such as infrared, SAR, multispectral, and hyperspectral to be compatible with a unified detection network based on three-channel pre-training, thereby improving model initialization compatibility and engineering feasibility;
[0038] 2. This invention adopts a unified structure of "modality-specific input adapter - shared backbone network - shared feature fusion network - modality-specific detection head", which retains the ability to model modal differences while sharing intermediate visual representations, reducing the number of model parameters, training resource consumption and deployment costs;
[0039] 3. By setting up independent detection heads for different modalities, this invention effectively avoids problems such as inconsistent number of categories in different datasets, category space conflicts, and difficulties in unifying output;
[0040] 4. This invention adopts a single-batch, single-modality, rotating joint training mechanism, which enables joint training of five heterogeneous modal data under a unified framework without requiring strict spatial registration or temporal alignment of samples from different modalities. This improves the applicability and practical value of the method for heterogeneous multi-dataset scenarios. Attached Figure Description
[0041] Figure 1 Here is the overall flowchart of the five-modal unified target detection method;
[0042] Figure 2The diagram shows the structure of a five-modal unified target detection network.
[0043] Figure 3 This is a schematic diagram of the modal-specific input adapter structure.
[0044] Figure 4 This is a schematic diagram of the mode selection and forward propagation process;
[0045] Figure 5 This is a flowchart of the rotating multimodal joint training process. Detailed Implementation
[0046] The present invention will be further described below with reference to the accompanying drawings and specific embodiments. It should be understood that the following embodiments are for illustrative purposes only and are not intended to limit the scope of protection of the present invention. Equivalent substitutions, modifications, or improvements made by those skilled in the art based on the disclosure of the present invention without departing from the concept of the present invention should all fall within the scope of protection of the present invention.
[0047] Example 1: Overall Implementation:
[0048] This embodiment provides a five-modal unified target detection method, applicable to unified target detection scenarios using visible light images, infrared images, synthetic aperture radar images, multispectral images, and hyperspectral images. The overall process of this method is as follows: Figure 1 As shown, the process includes the construction of a five-modal object detection dataset, unified processing of three-channel input, construction of a unified object detection network, modality-specific input adaptation, shared feature extraction and modality-specific detection output, round-robin single-batch single-modal joint training, and modality-correspondence inference. The following section combines... Figures 2 to 5 The key steps of the present invention will be further explained below.
[0049] Example 2: Implementation of Three-Channel Input Unified Processing:
[0050] In this embodiment, to be compatible with the input structure of the unified detection network and the three-channel pre-training weight initialization method, the five modal images are uniformly constructed as a three-channel input format.
[0051] For visible light images, the original RGB three-channel image is directly used as input. For infrared images and synthetic aperture radar images, if the original data is a single-channel image, a three-channel representation can be constructed by copying a single channel to a three-channel image, copying it after grayscale enhancement, mapping it to a three-channel image using pseudo-color, or other equivalent methods. For multispectral images, three-channel results obtained through band selection, band combination, linear projection, feature compression mapping, or preprocessing can be used as input. For hyperspectral images, three-channel representations can be generated using principal component analysis, spectral band selection, linear mapping, pseudo-color synthesis, or other spectral band compression methods. The three-channel unification process can be expressed as follows:
[0052]
[0053] in, Indicates the first The original input data for each modality, This represents the three-channel constructor for the corresponding mode. This represents the three-channel input image after standardization. Modalities are assigned numbers. Through the above processing, data from different modalities have a consistent input dimension before entering the unified object detection network.
[0054] Example 3: Implementation of a five-modal unified target detection network:
[0055] like Figure 2 As shown, this embodiment constructs a unified target detection network consisting of "modality-specific input adapter - shared backbone network - shared feature fusion network - modality-specific detection head".
[0056] The network comprises five modality-specific input adapters, a shared backbone network, a shared feature fusion network, and five modality-specific detection heads. The five modality-specific input adapters correspond to the visible light modality, infrared modality, synthetic aperture radar modality, multispectral modality, and hyperspectral modality, respectively. The five modality-specific detection heads correspond to the class spaces of the datasets for each of the five modalities. The shared backbone network and the shared feature fusion network are used by all five modalities to learn shared edge, texture, structural, and semantic features across modalities. In one embodiment, the shared backbone network and the shared feature fusion network can be implemented using the feature extraction and feature fusion parts of a lightweight single-stage object detection network, preferably using a YOLO11s-based network structure, but are not limited thereto.
[0057] Example 4: Implementation of Modal-Specific Input Adapter
[0058] like Figure 3 As shown, for any input sample, based on its modality or an externally provided modality identifier, the corresponding modality-specific input adapter is invoked to perform lightweight feature correction and input distribution adjustment on the three-channel input of the current modality.
[0059] In one embodiment, the modality-specific input adapter employs a residual gated adapter structure, and its output satisfies:
[0060]
[0061] in, Indicates the first A three-channel input image of each modality. This indicates the output characteristics after processing by the adapter. This represents the corrected features extracted by the lightweight convolutional branch. This represents the learnable gating parameters for the corresponding mode. The activation function is represented, preferably the sigmoid function. In one specific embodiment, the lightweight convolutional branch includes a first convolutional layer, a normalization layer, an activation layer, a second convolutional layer, and a normalization layer, wherein the first convolutional layer is used to map the input channels to the intermediate channels, and the second convolutional layer is used to map the intermediate channels back to the three-channel output. The gating parameters are initialized to small values so that the input adapter maintains a near-identical mapping in the early stages of training, and gradually learns an input correction method suitable for the corresponding modality as training progresses.
[0062] Example 5: Implementation of Shared Feature Extraction and Modality-Specific Detection Output:
[0063] like Figure 4 As shown, the features adapted to modality-specific inputs are fed into a shared backbone network and a shared feature fusion network to extract multi-scale visual features shared across modalities. Then, based on the modality identifier, the corresponding modality-specific detection head is invoked, outputting the category prediction result and bounding box regression result for the current modality. This process can be represented as:
[0064]
[0065] in, Indicates the first A modal input adapter, Indicates sharing the backbone network. Indicates a shared feature fusion network. Indicates the first Each mode corresponds to a detection head. This represents the detection output for the current modality. Since each modality detection head configures its classification output dimension according to the number of categories in the corresponding modality dataset, it avoids the problem of conflicting unified outputs caused by inconsistencies in the category spaces of different modalities.
[0066] Example 6: Implementation of Round-Robin Single-Batch Single-Modal Joint Training:
[0067] like Figure 5As shown, during the training phase, the training samples for the five modalities are divided into five sample groups. During training, a batch is selected sequentially from different modal sample groups according to a preset rotation order as the current training batch, ensuring that each training batch contains only samples of the same modality. When a modal batch is fed into the network, only the input adapter and detection head corresponding to that modality are used for the current forward propagation; detection heads of other modalities do not participate in the current batch output. In one embodiment, the rotating sampler can adjust the sampling order and sampling ratio of the five modalities according to a preset sampling mode, and perform repeated sampling on modalities with fewer samples using a rollback oversampling method to ensure that single-modal batches remain valid. For the current training batch, only the loss function of the detection head corresponding to its modality is calculated, and this loss is used to update the network parameters through backpropagation. The total loss function can be expressed as:
[0068]
[0069] in, Indicates the first Total loss of the current batch in each modality Represents classification loss. This represents the bounding box regression loss. Indicates the distribution focus loss. , and These represent the weight coefficients of the corresponding loss terms. Under this training mechanism, the shared backbone network and the shared feature fusion network receive joint gradient updates from different modal batches, while each modality-specific input adapter and each modality-specific detector head are updated only by their corresponding modality samples.
[0070] Example 7: Reasoning Implementation Method:
[0071] During the inference phase, based on the modality of the image to be detected or an externally provided modality identifier, the corresponding input adapter and modality-specific detection head are invoked. The input image is processed through a shared backbone network and a shared feature fusion network, outputting the target category, bounding box, and confidence score for the current modality. Subsequently, non-maximum suppression is performed on the detection results to obtain the final target detection result. Since the inference phase only activates the input adapter and detection head corresponding to the current modality, this method preserves modality specificity while avoiding the additional overhead of maintaining complete and independent detection networks for each of the five modalities.
[0072] The above content is only for illustrating the technical concept of this invention and should not be used to limit the scope of protection of this invention. Any modifications made to the technical solution based on the technical concept proposed in this invention shall fall within the scope of protection of the claims of this invention.
Claims
1. A five-modal unified target detection method, characterized in that, include: A five-modal target detection dataset is acquired and constructed, comprising visible light mode, infrared mode, synthetic aperture radar mode, multispectral mode, and hyperspectral mode, and the data of each modality are uniformly labeled and organized. The images of the five modalities are processed using a three-channel input unification method to obtain input images of a unified dimension. A unified five-modal target detection network is constructed, comprising five modality-specific input adapters, a shared backbone network, a shared feature fusion network, and five modality-specific detection heads. Based on the modality to which the input sample belongs or the modality identifier provided externally, the corresponding modality-specific input adapter is invoked to perform feature correction and distribution adjustment on the current modality input. The corrected features are then fed into the shared backbone network and the shared feature fusion network for feature extraction and fusion. Finally, the corresponding modality-specific detection head is invoked to output the category prediction result and bounding box regression result for the current modality. During the training phase, the training samples of the five modalities are divided into five sample groups, and the current training batch is selected from different modal sample groups in a preset rotation order, so that the same training batch contains only samples of the same modality, and only the loss function of the detection head corresponding to that modality is calculated to update the network parameters; during the inference phase, according to the modality to which the image to be detected belongs or the modality identifier provided externally, the corresponding input adapter and modality-specific detection head are called to output the target detection result.
2. The five-modal unified target detection method according to claim 1, characterized in that, The unified processing of the three-channel input includes: for visible light images, the original three-channel image is directly used as input; for infrared images and synthetic aperture radar images, a three-channel representation is constructed by copying a single channel to a three-channel representation, copying a grayscale image to a three-channel representation after grayscale enhancement, or mapping a pseudo-color image to a three-channel representation; for multispectral images, a three-channel representation is constructed by using band selection, band combination, linear projection, feature compression mapping, or preprocessed three-channel results as input; for hyperspectral images, a three-channel representation is generated by using principal component analysis, spectral band selection, linear mapping, pseudo-color synthesis, or spectral band compression.
3. The five-modal unified target detection method according to claim 1, characterized in that: The unified processing of the three-channel input satisfies the following relationship: in, Indicates the first The original input data for each modality, This represents the three-channel constructor for the corresponding mode. This represents the three-channel input image after standardization. Modal numbering.
4. The five-modal unified target detection method according to claim 1, characterized in that... The modal-specific input adapter adopts a residual gated adapter structure, and its output satisfies the following relationship: in, Indicates the first A three-channel input image of each modality. This indicates the output characteristics after processing by the adapter. This represents the corrected features extracted by the lightweight convolutional branch. This represents the learnable gating parameters for the corresponding mode. This represents the activation function.
5. The five-modal unified target detection method according to claim 4, characterized in that, The lightweight convolutional branch includes a first convolutional layer, a first normalization layer, an activation layer, a second convolutional layer, and a second normalization layer. The first convolutional layer is used to map the input channels to the intermediate channels, and the second convolutional layer is used to map the intermediate channels back to the three-channel output. The learnable gating parameters are initialized to a preset small value so that the modality-specific input adapter maintains a near-identical mapping in the early stages of training.
6. The five-modal unified target detection method according to claim 1, characterized in that, After the features are adapted to the modality-specific input, they are processed by the shared backbone network and the shared feature fusion network, and the detection results are output by the corresponding modality-specific detection head. This process satisfies the following relationship: in, Indicates the first A modal input adapter, Indicates sharing the backbone network. Indicates a shared feature fusion network. Indicates the first Each mode corresponds to a detection head. This indicates the detection output under the current mode.
7. The five-modal unified target detection method according to claim 1, characterized in that: In the rotating joint training, the rotating sampler adjusts the sampling order and sampling ratio of the five modal sample groups according to the preset sampling mode. When a certain modal sample is insufficient to form a complete batch, the modal sample is repeatedly sampled by the rollback oversampling method to ensure that each training batch contains only the same modal sample.
8. The five-modal unified target detection method according to claim 1, characterized in that: For the current training batch, only the loss function of the detection head corresponding to its modality is calculated. The total loss function satisfies the following relationship: in, Indicates the first Total loss of the current batch in each modality Represents classification loss. This represents the bounding box regression loss. Indicates the distribution focus loss. , and These represent the weight coefficients of the corresponding loss terms; the shared backbone network and the shared feature fusion network receive joint gradient updates from different modal batches, while each modality-specific input adapter and each modality-specific detector head are updated only by their corresponding modality samples.