Multi-modal target re-identification system and method for missing modal scene
By explicitly encoding modality loss perception and semantic anchoring mechanisms, the feature learning failure problem in the training phase of multimodal target re-identification is solved, and the recognition performance of the model in modality loss scenarios is improved, making it suitable for applications such as intelligent video surveillance and public safety.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HANGZHOU DIANZI UNIV
- Filing Date
- 2026-04-03
- Publication Date
- 2026-06-26
AI Technical Summary
Existing multimodal target re-identification methods rely on full-modal data during the training phase, which leads to feature learning failure and cross-modal alignment relationship breakdown in actual modality missing scenarios, resulting in decreased recognition performance and difficulty in meeting the needs of practical applications.
By explicitly encoding modality loss awareness cues, the network is guided to adaptively adjust feature learning. A stable identity semantic reference space is constructed through a semantic anchoring mechanism to alleviate the feature distribution shift caused by modality loss. An attention mechanism is used to aggregate complementary semantic information and construct a joint feature representation.
It improves the robustness and generalization ability of the multimodal target re-identification model in modality-deficient scenarios, achieves stable feature representation and identity discrimination, and is applicable to practical engineering scenarios such as intelligent video surveillance and public safety.
Smart Images

Figure CN122289787A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a multimodal target re-identification system and method for modal missing scenarios, applicable to application scenarios requiring cross-scenario target matching and recognition such as intelligent video surveillance, public safety, traffic management, and smart cities, and belongs to the field of computer vision and pattern recognition technology. Background Technology
[0002] Object Re-Identification (ReID) is one of the core technologies in the field of computer vision. Its core objective is to accurately match and identify the same target object in a sequence of images / videos acquired at different times, from different perspectives, and with non-overlapping camera devices. It is a key link in realizing intelligent visual analysis.
[0003] With the development of sensor imaging technology, target re-identification technology has rapidly evolved from a single visible light modality to a multimodal approach. Multimodal target re-identification integrates data sources from different imaging modalities such as visible light, near-infrared, and thermal infrared. By leveraging the complementarity of each modality in terms of color, texture, contour, and thermal radiation characteristics, it effectively solves the problem of drastic performance degradation of single-modality recognition in harsh environments such as low illumination, strong backlight, occlusion, and complex backgrounds. This significantly improves the robustness and discriminative ability of target recognition, making it the mainstream research and application direction in the current field of target re-identification.
[0004] However, existing multimodal target re-identification methods all rely on the ideal assumption that all modal data is complete and synchronized during the training phase. These methods learn multimodal features and perform cross-modal alignment using full-modal samples to construct a unified discriminative feature space. However, in practical engineering applications, various factors such as sensor hardware failures, interference from the data acquisition environment, privacy protection restrictions, and storage / transmission cost constraints can easily lead to missing modal data during the training phase. This causes the feature learning process of existing methods to fail, cross-modal alignment relationships to break down, ultimately resulting in unstable feature representation, poor model generalization ability, and a significant decrease in recognition performance in real-world scenarios with missing modalities, making it difficult to meet the actual needs of engineering applications.
[0005] Research on multimodal target re-identification in modality-deficient scenarios is still in its early stages. Existing improvement methods are mostly simple adaptations for modality-deficient scenarios during the testing phase, failing to fundamentally solve the feature learning and alignment problems caused by modality-deficient scenarios during the training phase. Furthermore, they lack explicit modeling of modality-deficient states, making it difficult for models to adapt to different modality-deficient configurations and maintain stable recognition performance in complex modality-deficient scenarios. Summary of the Invention
[0006] This invention provides a multimodal target re-identification system and method for modality-deficient scenarios. It solves the technical problems of existing multimodal target re-identification methods, which rely on full-modal data during the training phase and suffer from unstable feature representation, cross-modal alignment failure, severe feature distribution shift, and weak identity discrimination ability in actual modality-deficient scenarios. It breaks through the dependence of existing methods on full-modal data during the training phase, and achieves stable and efficient target re-identification in modality-deficient scenarios during both the training and testing phases, thereby improving the robustness and generalization ability of the model in real-world scenarios.
[0007] This invention guides the network to adaptively adjust the feature learning process according to different modality missing scenarios by explicitly encoding missing perception cues during model training. Furthermore, it alleviates the feature distribution shift problem caused by modality missing by performing feature aggregation and semantic anchor-based calibration on incomplete modality samples. This improves the robustness and generalization ability of the multimodal target re-identification model under the condition of modality missing during the training phase, thus meeting the application requirements in real-world scenarios.
[0008] A multimodal target re-identification system for modality-deficient scenarios includes a multimodal data preparation and encoding module, a missing feature extraction module, a conditional feature aggregation module, a semantic anchor feature alignment module, and a joint optimization and identity discrimination learning module that work in sequence. The multimodal data preparation and encoding module is used to acquire and preprocess multimodal target image data in RGB, near-infrared and thermal infrared modes, construct a modality indicator vector for each sample to identify the missing state of each modality, and divide the modality set of the sample into a visible modality set and a missing modality set. The missing feature extraction module is used to generate learnable missing information based on the modal missing state of the sample, inject the missing information into the multimodal feature extraction network, modulate the global feature representation of the backbone network, and output a multimodal feature representation with missing ability. The conditional feature aggregation module is used to model the complementary semantic information of the visible modality for incomplete modality samples through an attention-based cross-modality aggregation mechanism, and construct an initial joint feature representation containing the visible modality and the missing modality. The semantic anchor feature alignment module is used to introduce a semantic anchor mechanism based on the category prototype memory to construct a stable identity semantic reference space. For incomplete modal samples, feature alignment constraints are applied only in the feature subspace corresponding to their observable modalities, guiding the initial joint features to move closer to the semantic anchor of the corresponding identity. The joint optimization and identity discrimination learning module is used to perform joint supervised learning on training samples based on calibrated features, optimize network parameters through backpropagation, and complete target re-identification through the optimized network.
[0009] The conditional feature aggregation module and the semantic anchor feature alignment module work together to form a phased cross-modal feature calibration paradigm, specifically implemented as follows: Sample-based modal indication vector ,in ∈{0,1} represents the availability of the corresponding modality, and the modality set is divided into the visible modality set. With missing mode set ; Learnable query vectors corresponding to missing modalities As a condition, a conditional feature aggregation function based on an attention mechanism is used. Guided visible modal features Perform cross-modal aggregation to obtain Initial joint feature representation is constructed. ; The prototype memory maintains the category-level semantic anchor vectors of each identity in the multimodal space. Feature alignment constraints are applied only within the feature subspace corresponding to the observable modes, and the feature alignment loss is: ; Through mapping function The sample features after staged cross-modal calibration are fused and a discriminative representation is output. .
[0010] A recognition method for a multimodal target re-identification system in modality-deficient scenarios includes the following steps: S1: Collect multimodal target image data in visible light, near infrared and thermal infrared and preprocess it. Construct a modality indicator vector for each sample to label the availability of each modality. At the same time, divide the modality set of the sample into a visible modality set and a missing modality set. S2: Generate learnable missing information based on modal missing state information, inject the missing information containing modal and sample-level missing information into the backbone network and modulate the global feature representation to achieve adaptive feature extraction; S3: For samples with incomplete modalities, based on their visible modal features, an attention mechanism is used to aggregate complementary semantic information from each modality to construct a stable multimodal joint feature representation; S4: Introduce a semantic anchor mechanism to construct a stable identity semantic reference space. For incomplete modal samples, apply feature calibration constraints only in the feature subspace corresponding to their observable modalities to guide the joint features to align with the semantic anchor of the corresponding identity. S5: Identity discrimination learning is carried out under joint supervision of identity classification loss and triplet loss. Cross-modal consistency constraints are applied to samples with complete modalities, and feature alignment loss is applied to samples with incomplete modalities. The joint loss is calculated to optimize the network parameters, and the target re-identification is completed based on the optimized re-identification network.
[0011] Furthermore, S1 specifically includes: S11: Multimodal Target Image Data Acquisition and Construction: Acquire visible light, near-infrared, and thermal infrared multimodal target images in multiple scenes, and perform data processing on the... Let there be n target samples, and denote their modal images as . , Representing visible light, near-infrared, and thermal infrared respectively, and labeled with identity category tags. Build a training dataset , The total number of training samples; S12: Multimodal target image preprocessing: The multimodal target image is subjected to target region cropping, image size normalization, random cropping, random horizontal flipping, and random erasure operations in sequence; S13: Modality Missing Indicator Vector Construction: For the first Construct modality indicator vectors from target samples , When modal images exist When missing ; S14: Modality set partitioning: based on The first The modality set of each sample is divided into the visible modality set. and missing mode set .
[0012] Furthermore, S2 specifically includes: S21: Constructing a learnable cue vector for the modality-missing state: for the th constructing state embedding matrices for different modes , Let the state represent the vector dimension, according to The table lookup yields the first... The sample at the th Modal state cue vectors under various modes ; S22: Generate a global representation of the missing state at the sample level: for the first... The modal state cue vectors of each sample are summed and aggregated, and then transformed by a linear transformation matrix. Generate a global missing state representation vector ; S23: Fusion of modality-level and sample-level missing state information: Add the modality-level state representation to the global missing state representation to obtain the missing state-aware representation vector of the k-th modality. ; S24: Constructing the backbone network feature representation: Input the preprocessed multimodal target image into the VisionTransformer-based backbone network, and take the first... Layer classification label feature vector As a modal branch in the first Global feature representation of the layer; S25: Global characteristics of residual modulation: and After splicing, the input is a feature modulation unit composed of two layers of multilayer sensing mechanism. Generate residual modulation terms and complete global feature modulation. ,in , For a dimension-reduced linear mapping matrix, For an upgraded linear mapping matrix, It is a non-linear activation function.
[0013] Furthermore, S3 specifically includes: S31: Feature Set Partitioning: Obtain the multimodal feature set output from S2. ,according to and Divided into visible modal feature set and the set of invisible features of missing modalities ; S32: Single-modal missing aggregation: when At that time, the learnable query vector for the missing modality is initialized. Through multi-head cross-attention mechanism Aggregation ; S33: Bimodal Missing Aggregation: When Initialize the query vector at that time. Obtained through controlled two-stage aggregation , This is a multi-head self-attention mechanism; S34: Constructing a joint feature representation: Merge the visible modality features with the aggregated missing modality features to obtain a joint feature representation. .
[0014] Furthermore, S4 specifically includes: S41: Construct a category prototype memory: Maintain a prototype memory stored by identity category, for each identity category. Its semantic anchor prototype is , For identity In modality The prototype vector below; S42: Modal complete sample update anchor point: when At that time, the prototype vector is updated using an exponential moving average. , The momentum coefficient; S43: Anchor point for updating incomplete modal samples: when At that time, the prototype vector is updated only in the visible modality subspace. ; S44: Constructing Feature Alignment Constraints: For modal incomplete samples, construct a feature alignment loss in the visible modality subspace. , For vector dot product, It is the vector 2 norm.
[0015] Furthermore, S5 specifically includes: S51: Constructing the identity classification loss: Input the features output from S4 into the classification network to obtain the identity discrimination feature vector. Cross-entropy loss is used to construct the identity classification loss. , The total number of identity categories. For the first A weight vector for each identity; S52: Constructing the metric learning loss: Introduce the triplet loss as a metric learning constraint for the triplet. Loss is defined as , For interval parameters, To perform the operation to extract the non-negative part; S53: Constructing cross-modal consistency loss: when At the same time, for multimodal features Constructing cross-modal consistency loss ; S54: Introducing Feature Calibration Loss: When At that time, the characteristic calibration loss of S44 is introduced. ; S55: Constructing and Optimizing the Total Loss: Constructing the Total Loss Function , , These are the weight coefficients for cross-modal consistency loss and feature calibration loss, respectively. The network is trained end-to-end using gradient descent until convergence.
[0016] Compared with the prior art, the beneficial effects of the present invention are as follows: 1. Breaking through the limitations of full-modal training and improving the applicability of real-world scenarios: This invention is the first to address the core problem of modality missing during the training phase by explicitly modeling the state of modality missing. This makes the model applicable to modality missing scenarios in both the training and testing phases, breaking through the dependence of existing methods on full-modal training data and significantly improving the applicability of multimodal target re-identification technology in real-world engineering scenarios such as intelligent monitoring and public safety. 2. Empowering the model with modal missingness perception capability to achieve adaptive feature extraction: Through a missingness perception cue learning mechanism, discrete modal missing states are transformed into continuous learnable cue information, and residual modulation is injected into the backbone network, enabling the model to perceive modal missing configuration and adaptively adjust the feature learning strategy. This effectively solves the problem that existing methods cannot adapt to different missing modes and ensures the stability of feature representation. 3. Construct an efficient cross-modal aggregation strategy to enhance the robustness of joint features: For different scenarios with missing single / dual modalities, design differentiated attention aggregation strategies to fully explore the complementary semantic information of visible modalities, construct a stable multimodal joint feature representation, provide a reliable feature foundation for subsequent feature calibration, and solve the problem of insufficient feature representation capability for incomplete modal samples; 4. Mitigating feature distribution shift and improving feature discriminativeness and consistency: A semantic anchoring mechanism based on a category prototype memory is introduced, which applies feature alignment constraints only within the observable modality subspace. This effectively mitigates the feature distribution shift problem caused by modality loss and avoids unreliable inferences introduced by missing modalities, significantly improving the identity discriminativeness and cross-modal consistency of features. 5. A unified joint optimization strategy ensures overall model performance: A joint loss function with multiple loss terms is constructed to achieve unified identity discrimination learning for modal complete / incomplete samples. Modal alignment of complete samples is enhanced through cross-modal consistency loss, and feature distribution of incomplete samples is constrained through feature calibration loss. This enables the model to maintain stable and excellent re-identification performance in various modal missing scenarios.
[0017] The multimodal target re-identification system and method of the present invention significantly outperform existing methods in terms of feature expression stability, identity discrimination accuracy, and model generalization ability in modality-deficient scenarios. It can be directly deployed in practical application scenarios such as intelligent video surveillance, public safety investigation, and traffic target tracking, and has important engineering application value and promotion prospects. Attached Figure Description
[0018] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 This is a flowchart of the multimodal target re-identification system and method for modal missing scenarios according to the present invention.
[0020] Figure 2 This is a detailed structural diagram of the feature extraction and cross-modal calibration module of the multimodal target re-identification system and method for modal missing scenarios of the present invention. Detailed Implementation
[0021] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0022] Example 1: This embodiment provides a multimodal target re-identification system for modality missing scenarios, including the following modules: a multimodal data preparation and encoding module, a missing-aware feature extraction module, a conditional feature aggregation module, a semantic anchor feature alignment module, and a joint optimization and identity discrimination learning module. The overall module structure is as follows: Figure 1 As shown.
[0023] The multimodal data preparation and encoding module is used to acquire multimodal target image data and perform unified encoding processing. This module first acquires target image data from different imaging modalities, including visible light, near-infrared, and thermal infrared, and performs preprocessing operations. Then, for each sample, a modality indicator vector is constructed to identify whether each modality is available. Based on this, the modality set of the sample is divided into a visible modality set and a missing modality set, providing the basic input for subsequent processing.
[0024] The missing modality-aware feature extraction module introduces modality missingness awareness during the feature extraction stage, enabling the model to adapt to different modality missing configurations. This module generates learnable missing modality-aware prompts based on the modality indication vectors output by the multimodal data preparation and encoding module. Subsequently, during feature extraction, these prompts are injected into the multimodal feature extraction network to modulate the global feature representation in the backbone network, thereby improving the stability of the feature extraction process under missing modality conditions.
[0025] The conditional feature aggregation module is used to construct a stable multimodal joint feature representation under the condition of incomplete modalities. When a current sample has missing modalities, this module guides the visible modal features to perform semantic aggregation through an attention mechanism, using the learnable query vector corresponding to the missing modality as a condition, to obtain the aggregated feature representation of the missing modality. Subsequently, the visible modal features are merged with the aggregated missing modal features to construct an initial multimodal joint feature representation, providing a stable and observable feature foundation for subsequent feature calibration.
[0026] The semantic anchor feature alignment module is designed to mitigate the distribution shift of multimodal features under modality missing conditions. This module introduces a semantic anchor mechanism based on a category prototype memory to construct a stable identity semantic reference space. For samples with incomplete modalities, feature alignment constraints are applied only within the feature subspace corresponding to their observable modalities, guiding their initial joint features to gradually move closer to the semantic anchor of the corresponding identity, thereby mitigating the feature distribution shift caused by modality missingness.
[0027] The joint optimization and identity discrimination learning module is used to achieve unified identity discrimination feature learning. This module introduces identity classification loss and metric learning loss to all samples for identity discrimination learning; simultaneously, it applies cross-modal consistency constraints to samples with complete modalities and feature calibration constraints to samples with incomplete modalities. By jointly optimizing and updating the network parameters through multiple losses, a multimodal target re-identification network that remains robust even under modality-missing conditions during the training phase is finally obtained.
[0028] Example 2: Reference Figure 2 This embodiment provides a multimodal target re-identification method for modality-deficient scenarios, including the following steps: S1: Collect multimodal target image data in visible light, near infrared and thermal infrared and preprocess it. Construct a modality indicator vector for each sample to label the availability of each modality. At the same time, divide the modality set of the sample into a visible modality set and a missing modality set. First, multimodal target image data is acquired and constructed. The multimodal data includes at least visible light, near-infrared, and thermal infrared modes. The acquired raw image data undergoes uniform preprocessing operations, including image size normalization, pixel value standardization, and data augmentation processes such as random cropping and flipping, to ensure consistency in scale and distribution across different modal input data. Then, for each target sample, a corresponding modality indicator vector is constructed based on the availability of images for each modality. This is used to explicitly label the available states of the sample in each modality, and based on this, divide the modality set of the sample into a visible modality set. With missing mode set This process not only completes the standardized input construction of multimodal target images, but also provides stable and distinguishable prior conditions for subsequent processing by uniformly modeling the modality missing cases.
[0029] S1 is implemented by explicitly encoding the mode missing case. The specific steps are as follows: S11. Multimodal target image data acquisition and construction. Multimodal target image data from multiple different scenes are acquired, including visible light, near-infrared, and thermal infrared. For the first... For each target sample, its corresponding target image in each modality is collected, denoted as: ,in These represent visible light, near-infrared light, and thermal infrared light, respectively. Indicates the first The target sample at the th The target image corresponding to each modality; simultaneously, each target sample is labeled with its corresponding identity category label, denoted as . Thus, a multimodal target image training dataset is constructed: ,in, This represents the total number of training samples.
[0030] S12. Multimodal target image preprocessing. The acquired multimodal target image data undergoes preprocessing operations, including the following steps: target region cropping, image size normalization, random cropping, random horizontal flipping, and random erasure. These preprocessing operations ensure consistency in input scale and data distribution across different modalities, thereby improving the stability of subsequent feature extraction.
[0031] S13. Construction of the Modality Missing Indicator Vector. For each target sample in the training dataset, a modality missing indicator vector is constructed to describe its modality availability state, denoted as... ,in When the first The target sample at the th When the target image exists in the modality, let When the first The target sample at the th When the target image is missing in a certain modality, let Through modal indicator vectors Explicit modeling is performed to address the modality loss of target samples during the training phase.
[0032] S14. Modality set partitioning. Based on modality indicator vectors. , will the The modality set corresponding to each target sample is divided into two non-overlapping subsets as follows: (See modality set:) and missing modal sets: Among them, the visible mode set This represents the set of modalities in the current target sample that can be used for feature extraction, and the set of missing modalities. This represents the set of modes that are not available in the current target sample.
[0033] S2: Generate learnable missing information based on modal missing state information, inject the missing information containing modal and sample-level missing information into the backbone network and modulate the global feature representation to achieve adaptive feature extraction; After acquiring, preprocessing, and encoding multimodal image data, the core task of this step is to introduce a missing feature awareness mechanism to adaptively modulate the multimodal feature extraction process. The method first uses the modality indicator vector... Modal-level status prompts were retrieved. Furthermore, the modal cues are aggregated to obtain a sample-level global missing state representation. This is used to characterize the overall missing pattern of the sample; then, modal-level cues and sample-level global cues are fused to form a missing-aware representation for modulation. This missing information representation is injected into the multimodal backbone network, and the global features output by each modality branch are processed. Perform residual modulation: The adapter is mentioned. A two-layer bottleneck MLP is employed. The remaining vectors remain unchanged, and are compared with the modulated... These features are then fed into the subsequent feedforward network for feature extraction. By continuously injecting the aforementioned missing information cues into each layer, this step achieves explicit encoding and transmits modal missing information, providing a stable and missing-aware feature input foundation for subsequent cross-modal feature aggregation.
[0034] In step S2, the multimodal feature extraction of missing sensing modulation is achieved through cue learning and residual modulation. The specific steps are as follows: S21. Construct learnable cue vectors for modal missing states. For the modal indication vectors obtained in S1... For each mode, a learnable modality-state representation mapping matrix is constructed to encode the available states of the mode into continuous vector representations. For the ... For each modality, construct the corresponding state embedding matrix: in, This indicates the dimension of the state representation vector. Based on the modal indication value... , look up the table to get the first The sample at the th Modal state cue vectors corresponding to each modality: The above method enables the mapping of discrete modal available states into continuous learnable feature representations.
[0035] S22. Considering that the missing states of different modalities have a joint impact on the overall feature representation of the same target sample, the state representation vectors corresponding to each modality are aggregated to obtain a global representation of the missing states at the sample level. Specifically, for the first... For each target sample, sum and aggregate its state representation vectors in each modality, and then generate a global missing state representation vector through a linear transformation: in It is a learnable linear transformation matrix; Indicates the first A global missing state representation vector for each sample. This global representation vector is used to characterize the overall modal missing pattern of the current sample.
[0036] S23. Fusing modality-level and sample-level missing state information. To simultaneously retain the missing state information of a single modality and the missing context information of the entire sample, the modality-level state representation and the global missing state representation are fused to obtain the final missing-aware information used for feature modulation. For the first... For each modality, the corresponding missing-aware representation vector is constructed as follows: in, It contains both modality-level missing information and sample-level missing information.
[0037] S24. Constructing the backbone network The input sequence of layer S1 is used to obtain global feature representations. The preprocessed multimodal target image from S1 is then input for feature extraction. For the first layer... Type of modal branch, in the backbone network The input sequence of a layer is constructed as follows: in Indicates the first The layer is used to aggregate classification label feature vectors (global feature representations) of global information; Represents the remaining local feature vectors; Indicates the number of local eigenvectors; Indicates feature dimension; This indicates a splicing operation. Using the above method, As this modal branch in the 1st Global feature representation of the layer.
[0038] S25. After completing the attention calculation, use the missing perception representation vector obtained in S23. With the updated The mixture is fused and input into a feature modulation unit to generate a residual modulation term, thereby achieving adaptive modulation for different missing modes. Specifically, the remaining modulation term is... and After concatenation, the data is input into the feature modulation unit to obtain the modulated global feature representation: Among them, the feature modulation unit The two-layer, multi-layer perceptron structure is represented as follows: in For a dimension-reduced linear mapping matrix, It is a linear mapping matrix of higher dimensions; It is a non-linear activation function. Meanwhile, the remaining local feature vectors in the sequence remain unchanged and are compared with the modulated... The information is fed into subsequent network layers, thereby explicitly injecting missing information into the global feature stream and enabling adaptive feature extraction for different modal missing scenarios.
[0039] S3: For samples with incomplete modalities, based on their visible modal features, an attention mechanism is used to aggregate complementary semantic information from each modality to construct a stable multimodal joint feature representation; After feature extraction for missing modality-aware learning, for incomplete modality samples, the method uses visible modality features as conditional information to perform conditional feature aggregation to obtain the feature representation corresponding to the missing modality. When a sample has only a single missing modality, the method uses visible modality features as keys and values, introduces a learnable query vector corresponding to the missing modality, and aggregates multimodal complementary semantics through a cross-attention mechanism to obtain the feature representation of the missing modality. When a sample has multiple missing modalities, the method first uses a cross-attention mechanism to perform preliminary semantic modeling based on the unique visible modality, and then generates conditional feature representations corresponding to the missing modalities through a self-attention mechanism. Subsequently, the method merges the aggregated missing modality features with the original visible modality features to construct a joint feature representation set for incomplete modality samples. This step utilizes complementary semantic information between visible modalities to construct a stable joint feature representation, providing a reliable input foundation for subsequent feature calibration based on semantic anchors.
[0040] In step S3, conditional feature aggregation is accomplished through a visible modality-based attention mechanism, with the specific steps as follows: S31. Feature Set Partitioning. Obtain the global features of each modal branch output from S2 after missing-perception modulation, as the first... The set of multimodal features of a target sample is denoted as: in, Indicates the first The target sample in the modality The following is the global feature vector output by the feature extraction network of S2. Based on the visible modality set constructed in S1. With missing mode set The multimodal feature set is divided into: the visible modality feature set. The set of invisible features corresponding to the missing modalities .
[0041] S32, Conditional cross-modal aggregation in the case of missing single modality. When a target sample has a single missing modality, let: First, initialize and check the missing modes. Corresponding learnable query vector The visible modality feature vectors are concatenated and used as key-value inputs. Semantic aggregation is then performed through a multi-head cross-attention mechanism to obtain the aggregated feature representation of the missing modality. in, This represents multi-head cross-attention operation.
[0042] S33, Controlled two-stage aggregation in the case of bimodal absence. When a target sample has bimodal missing values, let: Initialize the learnable query vectors corresponding to the missing modes respectively. A controlled two-stage aggregation strategy is adopted: firstly, the unique visible modality features are utilized. Preliminary semantic aggregation is performed on the missing modalities, and then the correlation between the two missing modalities is modeled through self-attention to obtain the aggregated feature representation of the missing modalities: in, This indicates a multi-head self-attention operation.
[0043] S34. Construct a joint feature set representation for incomplete modal samples. After completing the conditional semantic aggregation of missing modalities, merge the visible modal features with the aggregated missing modal features to construct the joint feature set representation of the samples as follows: in, Indicates the first The set of joint feature representations of a target sample in the case of modality absence.
[0044] S4: Introduce a semantic anchor mechanism to construct a stable identity semantic reference space. For incomplete modal samples, apply feature calibration constraints only in the feature subspace corresponding to their observable modalities to guide the joint features to align with the semantic anchor of the corresponding identity. The joint feature representation constructed in the feature calibration stage of 3 Building upon this foundation, an identity-level semantic anchor mechanism is introduced to construct a stable identity semantic reference space, used to constrain feature distributions under different modality combinations. The method maintains a prototype storage structure organized by identity category, where each identity... Corresponding to a set of cross-modal semantic anchor vectors This is used to characterize the stable semantic representation of the identity across various modalities. During training, the method operates within its visible modal set. Within the corresponding subspace, the semantic anchors for the corresponding identity category are updated exponentially using a smoothing method. For samples with incomplete modalities, the method updates the observable modal subspace. Internal computation of feature alignment loss between joint features and their corresponding identity semantic anchors This constraint guides the joint features of incomplete modality samples to gradually move closer to stable identity semantic anchors. This step alleviates the feature distribution drift caused by modality loss, thereby further improving the stability and accuracy of multimodal target re-identification in the case of modality loss during the training phase.
[0045] In step S4, semantic anchor feature calibration is achieved through a prototype memory and feature alignment loss. The specific steps are as follows: S41. Construct a category prototype memory to form a semantic anchor space. To build a stable identity semantic reference space, maintain a prototype memory stored by identity category to store the prototype vectors of each identity in different modalities as semantic anchors. For any identity category... Its semantic anchor prototype is defined as: in, Indicates identity category In modality The prototype vectors under these vectors form a stable semantic reference.
[0046] S42. Update the semantic anchor prototype for modality-complete samples. This is because when the training samples are modality-complete samples, i.e. At that time, the global feature vector of the sample in each modality is used. The prototype vector for the corresponding identity category is updated. The update method uses an exponential moving average, as follows: in For the first The identity category label of each sample; The momentum coefficient; The symbol "" indicates an assignment update operation. Through this method, the semantic anchor prototype fully utilizes complete modal samples to form a stable identity semantic structure.
[0047] S43. When the training samples are modally incomplete samples, i.e. At that time, only its visible mode set is used. The corresponding feature sub-vectors are used for controlled updates of the semantic anchor prototypes; the prototype sub-vectors corresponding to missing modalities are not updated. The update method is expressed as follows: in, This means that updates are performed only within the observable mode subspace, thus avoiding unreliable update noise introduced by missing modes.
[0048] S44. Construct feature calibration constraints within the observable modality subspace. For samples with incomplete modalities, to mitigate the feature distribution shift caused by missing modalities, feature calibration constraints are applied only within their observable modality set. Feature calibration constraints are applied within the corresponding feature subspace to gradually bring the joint features closer to the semantic anchor point of the corresponding identity. Specifically, the first... The feature alignment loss for each sample within the visible modality subspace is: in Represents the dot product of vectors; The second norm of a vector; Indicates identity category In modality The semantic anchor prototype is described above. By doing so, incomplete modal samples are aligned and calibrated only within the observable modality subspace, thereby improving feature stability and avoiding unreliable inferences about missing modalities.
[0049] S5: Identity discrimination learning is carried out under joint supervision of identity classification loss and triplet loss. Cross-modal consistency constraints are applied to samples with complete modalities, and feature alignment loss is applied to samples with incomplete modalities. The joint loss is calculated to optimize the network parameters, and the target re-identification is completed based on the optimized re-identification network.
[0050] After completing the feature calibration based on semantic anchors, a set of joint feature representations is obtained through missing feature awareness learning and cross-modal feature calibration. These feature representations are composed of visible modal features and missing modal features obtained through conditional cross-modal feature calibration, which can stably characterize the identity semantic information of the target sample. For the calibrated joint feature representations, the method introduces an identity classification loss. and metric learning loss This is used to enhance the discriminative power of identity. Furthermore, for complete samples, the method further imposes cross-modal consistency constraints among multimodal features. When the training samples are modally incomplete samples, an additional feature alignment loss is applied. To avoid unreliable interference from missing modalities in the optimization process, the re-identification network is trained end-to-end under the constraint of joint loss, resulting in a parameter-optimized re-identification model. This model is used to extract features from the target samples to be identified and calculate similarity with features in the target database. The final target re-identification result is output based on the similarity ranking. This step achieves robust multimodal target re-identification in scenarios with missing modalities, which can be directly applied to practical monitoring retrieval or downstream analysis tasks.
[0051] S5 optimizes and updates network parameters and achieves target re-identification through joint loss supervision. The specific implementation process is as follows: S51. Construct the identity discrimination supervised loss. For all training samples, perform identity discrimination learning based on the joint feature representation. Input the sample feature representation output from S4 into the classification network, and after processing by pooling layers and fully connected layers, obtain the feature vector used for identity classification, denoted as: in, Indicates the first The identity discrimination feature representation corresponding to each target sample. Based on the feature vector, an identity classification loss is constructed, which is expressed in the form of cross-entropy loss as follows: in Indicates the first The identity category label of each sample; This represents the total number of identity categories in the training set; Indicates the first classifier in the classifier The weight vector corresponding to each identity category.
[0052] S52. Constructing the Metric Learning Loss. To further enhance the feature compactness among samples of the same identity and increase the feature distance between samples of different identities, a triplet loss is introduced as a metric learning constraint. For any triplet consisting of anchor samples, positive samples, and negative samples... The triplet loss is defined as: in These represent the feature vectors of the anchor sample, positive sample, and negative sample, respectively. For interval parameters; This indicates the operation of taking the non-negative part.
[0053] S53. Construct cross-modal consistency constraints for modal complete samples. When a training sample is a modal complete sample, i.e., its missing modality set... To ensure the consistency of features across different modalities in the semantic space, cross-modal consistency constraints are introduced for its multimodal features. Let the feature representation of a complete modal sample in different modalities be: Its cross-modal consistency loss is defined as: This constraint ensures that modal complete samples maintain a consistent identity semantic representation in the multimodal feature space.
[0054] S54. Introduce feature calibration loss for incomplete modal samples. When the training samples are incomplete modal samples, the feature calibration loss defined in S4 is introduced. This is used to constrain samples to maintain consistency with their corresponding identity semantic anchors within the observable modality subspace.
[0055] S55. Construct the overall loss function and update the network parameters. Combining the various loss terms mentioned above, construct the overall loss function for training the multimodal object re-identification network: in and These are the weighting coefficients for the cross-modal consistency loss and the feature calibration loss, respectively. Based on the overall loss function, the multimodal target re-identification network, consisting of the feature extraction network and the classification network, is trained end-to-end using the gradient descent method until the preset number of iterations or the loss convergence condition is met.
[0056] This invention addresses the scenario of multimodal target re-identification where modality is missing during the training phase. It breaks through the limitation of existing methods that generally rely on the integrity of modality during the training phase. By explicitly modeling the state of modality missing during the model training process, the multimodal target re-identification method can be applied to complex application environments where modality is missing in both the training and testing phases, significantly improving the applicability and generalization ability of the method in real-world scenarios. This invention introduces a missing-aware cue learning mechanism, which explicitly encodes the modal availability of a sample into learnable cue information and injects it into the multimodal feature extraction process. This enables the model to perceive the modal missing situation of the current sample and adaptively adjust the feature learning strategy, thereby obtaining an identity feature representation with missing-aware capabilities, effectively mitigating the adverse effects of modal missingness on the stability of feature expression. This invention proposes a conditional feature aggregation strategy based on visible modalities for samples with incomplete modalities. By using an attention mechanism to model the complementary semantic information between observable modalities, a stable multimodal joint feature representation is constructed, providing a reliable and observable feature foundation, thereby enhancing the feature robustness of the model under modality missing conditions. This invention introduces a feature calibration mechanism based on semantic anchors. By applying feature alignment constraints only within the observable modal subspace of incomplete modal samples, it effectively alleviates the feature distribution shift problem caused by modality loss during the training phase, significantly improves feature discriminativeness and cross-modal consistency, and enables the model to maintain stable and excellent re-identification performance under different loss conditions.
[0057] The embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. For those skilled in the art, various changes, modifications, substitutions, and variations can be made to these embodiments without departing from the principles and spirit of the present invention, and these variations still fall within the protection scope of the present invention.
Claims
1. A multimodal target re-identification system for modality-deficient scenarios, characterized in that: The modules include a multimodal data preparation and encoding module, a missing-aware feature extraction module, a conditional feature aggregation module, a semantic anchor feature alignment module, and a joint optimization and identity discrimination learning module, which work in sequence. The multimodal data preparation and encoding module is used to acquire and preprocess multimodal target image data in RGB, near-infrared and thermal infrared modes, construct a modality indicator vector for each sample to identify the missing state of each modality, and divide the modality set of the sample into a visible modality set and a missing modality set. The missing feature extraction module is used to generate learnable missing information based on the modal missing state of the sample, inject the missing information into the multimodal feature extraction network, modulate the global feature representation of the backbone network, and output a multimodal feature representation with missing ability. The conditional feature aggregation module is used to model the complementary semantic information of the visible modality for incomplete modality samples through an attention-based cross-modality aggregation mechanism, and construct an initial joint feature representation containing the visible modality and the missing modality. The semantic anchor feature alignment module is used to introduce a semantic anchor mechanism based on the category prototype memory to construct a stable identity semantic reference space. For incomplete modal samples, feature alignment constraints are applied only in the feature subspace corresponding to their observable modalities, guiding the initial joint features to move closer to the semantic anchor of the corresponding identity. The joint optimization and identity discrimination learning module is used to perform joint supervised learning on training samples based on calibrated features, optimize network parameters through backpropagation, and complete target re-identification through the optimized network.
2. The multimodal target re-identification system for modality-deficient scenarios according to claim 1, characterized in that: The conditional feature aggregation module and the semantic anchor feature alignment module work together to form a phased cross-modal feature calibration paradigm, specifically implemented as follows: Sample-based modal indication vector ,in ∈{0,1} represents the availability of the corresponding modality, and the modality set is divided into the visible modality set. With missing mode set ; Learnable query vectors corresponding to missing modalities As a condition, a conditional feature aggregation function based on an attention mechanism is used. Guided visible modal features Perform cross-modal aggregation to obtain Initial joint feature representation is constructed. ; The prototype memory maintains the category-level semantic anchor vectors of each identity in the multimodal space. Feature alignment constraints are applied only within the feature subspace corresponding to the observable modes, and the feature alignment loss is: ; Through mapping function The sample features after staged cross-modal calibration are fused and a discriminative representation is output. .
3. A recognition method for a multimodal target re-identification system for modal missing scenarios according to claim 1 or 2, characterized in that: Includes the following steps: S1: Collect multimodal target image data in visible light, near infrared and thermal infrared and preprocess it. Construct a modality indicator vector for each sample to label the availability of each modality. At the same time, divide the modality set of the sample into a visible modality set and a missing modality set. S2: Generate learnable missing information based on modal missing state information, inject the missing information containing modal and sample-level missing information into the backbone network and modulate the global feature representation to achieve adaptive feature extraction; S3: For samples with incomplete modalities, based on their visible modal features, an attention mechanism is used to aggregate complementary semantic information from each modality to construct a stable multimodal joint feature representation; S4: Introduce a semantic anchor mechanism to construct a stable identity semantic reference space. For incomplete modal samples, apply feature calibration constraints only in the feature subspace corresponding to their observable modalities to guide the joint features to align with the semantic anchor of the corresponding identity. S5: Identity discrimination learning is carried out under joint supervision of identity classification loss and triplet loss. Cross-modal consistency constraints are applied to samples with complete modalities, and feature alignment loss is applied to samples with incomplete modalities. The joint loss is calculated to optimize the network parameters, and the target re-identification is completed based on the optimized re-identification network.
4. The multimodal target re-identification method for modality-deficient scenarios according to claim 3, characterized in that: S1 specifically includes: S11: Multimodal Target Image Data Acquisition and Construction: Acquire visible light, near-infrared, and thermal infrared multimodal target images in multiple scenes, and perform data processing on the... Let there be n target samples, and denote their modal images as . , Representing visible light, near-infrared, and thermal infrared respectively, and labeled with identity category tags. Build a training dataset , The total number of training samples; S12: Multimodal target image preprocessing: The multimodal target image is subjected to target region cropping, image size normalization, random cropping, random horizontal flipping, and random erasure operations in sequence; S13: Modality Missing Indicator Vector Construction: For the first Construct modality indicator vectors from target samples , When modal images exist When missing ; S14: Modality set partitioning: based on The first The modality set of each sample is divided into the visible modality set. and missing mode set .
5. The multimodal target re-identification method for modality-deficient scenarios according to claim 3, characterized in that: S2 specifically includes: S21: Constructing a learnable cue vector for the modality-missing state: for the th constructing state embedding matrices for different modes , Let the state represent the vector dimension, according to The table lookup yields the first... The sample at the th Modal state cue vectors under various modes ; S22: Generate a global representation of the missing state at the sample level: for the first... The modal state cue vectors of each sample are summed and aggregated, and then transformed by a linear transformation matrix. Generate a global missing state representation vector ; S23: Fusion of modality-level and sample-level missing state information: Add the modality-level state representation to the global missing state representation to obtain the missing state-aware representation vector of the k-th modality. ; S24: Constructing the backbone network feature representation: Input the preprocessed multimodal target image into the VisionTransformer-based backbone network, and take the first... Layer classification label feature vector As a modal branch in the first Global feature representation of the layer; S25: Global characteristics of residual modulation: and After splicing, the input is a feature modulation unit composed of two layers of multilayer sensing mechanism. Generate residual modulation terms and complete global feature modulation. ,in , For a dimension-reduced linear mapping matrix, For an upgraded linear mapping matrix, It is a non-linear activation function.
6. The multimodal target re-identification method for modality-deficient scenarios according to claim 3, characterized in that: S3 specifically includes: S31: Feature Set Partitioning: Obtain the multimodal feature set output from S2. ,according to and Divided into visible modal feature set and the set of invisible features of missing modalities ; S32: Single-modal missing aggregation: when At that time, the learnable query vector for the missing modality is initialized. Through multi-head cross-attention mechanism Aggregation ; S33: Bimodal Missing Aggregation: When Initialize the query vector at that time. Obtained through controlled two-stage aggregation , This is a multi-head self-attention mechanism; S34: Constructing a joint feature representation: Merge the visible modality features with the aggregated missing modality features to obtain a joint feature representation. .
7. The multimodal target re-identification method for modality-deficient scenarios according to claim 3, characterized in that: S4 specifically includes: S41: Construct a category prototype memory: Maintain a prototype memory stored by identity category, for each identity category. Its semantic anchor prototype is , For identity In modality The prototype vector below; S42: Modal complete sample update anchor point: when At that time, the prototype vector is updated using an exponential moving average. , The momentum coefficient; S43: Anchor point for updating incomplete modal samples: when At that time, the prototype vector is updated only in the visible modality subspace. ; S44: Constructing Feature Alignment Constraints: For modal incomplete samples, construct a feature alignment loss in the visible modality subspace. , For vector dot product, It is the vector 2 norm.
8. The multimodal target re-identification method for modality-deficient scenarios according to claim 7, characterized in that: S5 specifically includes: S51: Constructing the identity classification loss: Input the features output from S4 into the classification network to obtain the identity discrimination feature vector. Cross-entropy loss is used to construct the identity classification loss. , The total number of identity categories. For the first A weight vector for each identity; S52: Constructing the metric learning loss: Introduce the triplet loss as a metric learning constraint for the triplet. Loss is defined as , For interval parameters, To perform the operation to extract the non-negative part; S53: Constructing cross-modal consistency loss: when At the same time, for multimodal features Constructing cross-modal consistency loss ; S54: Introducing Feature Calibration Loss: When At that time, the characteristic calibration loss of S44 is introduced. ; S55: Constructing and Optimizing the Total Loss: Constructing the Total Loss Function , , These are the weight coefficients for cross-modal consistency loss and feature calibration loss, respectively. The network is trained end-to-end using gradient descent until convergence.