A multimodal target perception and re-identification method for images

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining modal embedding and modal awareness enhancement loss functions, and utilizing the transformer model, the problem of insufficient utilization of modal information in cross-modal target re-identification is solved, and better modal feature learning and recognition results are achieved.

CN116168418BActive Publication Date: 2026-06-12BEIJING JIAOTONG UNIV

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: BEIJING JIAOTONG UNIV
Filing Date: 2023-01-29
Publication Date: 2026-06-12

Application Information

Patent Timeline

29 Jan 2023

Application

12 Jun 2026

Publication

CN116168418B

IPC: G06V40/10; G06V10/42; G06V10/80; G06V10/82; G06V10/764; G06N3/0455; G06N3/084

CPC: G06V40/10; G06V10/42; G06V10/806; G06V10/82; G06V10/764; G06N3/084; Y02T10/40

AI Tagging

Application Domain

Internal combustion piston engines Biometric pattern recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN116168418B_ABST

Patent Text Reader

Abstract

The application provides a multimodal target perception and re-identification method of images. The method comprises the following steps: pre-processing cross-modal image data to obtain a block vector sequence, learning the modal information of the cross-modal image data through ME, fusing the block data, position information and modal information of the cross-modal image data together to obtain serialized image data, inputting the serialized image data into a ViT model, outputting the feature information of the cross-modal image data by the ViT model, calculating a modal perception enhancement loss value, adjusting the model parameters through back propagation according to the loss value, obtaining a trained target re-identification model, and using the model to perform cross-modal target re-identification on a to-be-identified pedestrian image. The method of the application introduces a learnable modal embedding into a network, directly encodes modal information, can effectively be used to alleviate the gap between heterogeneous images, and realizes target perception and re-identification of cross-modal images.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of target re-identification technology, and in particular to a multimodal target perception and re-identification method for images. Background Technology

[0002] With the increasing number of surveillance cameras and rising public safety demands, target re-identification has attracted significant interest from industry and holds substantial research value. While deep learning has led to the development of methods that have achieved excellent performance, these methods are only applicable under ideal lighting conditions and cannot solve the problems of low-light environments in real-world scenarios. To address this issue, a large number of infrared cameras have been deployed in video surveillance systems, demonstrating significant application value. Therefore, researchers have begun to focus on the cross-modal target re-identification problem.

[0003] Visible light and infrared images are generated by cameras that capture light in different wavelength ranges. Visible light images consist of three channels (red, green, and blue) containing color information, while infrared images contain only one channel with infrared radiation; they are fundamentally different. Therefore, reducing modal differences for the same target object or identity is crucial for solving cross-modal target re-identification tasks. Currently, existing target re-identification methods can be broadly categorized into two directions: modality transformation-based methods and representation and metric learning-based methods. Modality transformation-based methods attempt to eliminate modality differences by converting images from one modality to another, learning modality transformation mappings through generative adversarial networks (GANs). However, since the mapping process is not one-to-one, the generation process produces images with inconsistent colors, and there is no reliable mapping relationship to support the generative model. Therefore, researchers focus on the structural design of CNN (Convolutional Neural Networks) models, extracting modality-shared features through representation learning and metric learning to reduce inter-modality differences. Based on the two-stream architecture, the corresponding method uses a shallow layer with non-shared weights to extract shared features of different modalities and a deep layer with shared weights to learn distinguishing features. However, this learning strategy cannot fully perceive and deeply mine the built-in modal features and cannot learn good modality-invariant feature representations.

[0004] Compared to CNNs, transformer models can obtain a global receptive field with a self-attention module and complete spatial features. Therefore, this invention aims to design a novel system and method for multimodal object perception and re-identification based on transformers, which can capture modal features through learnable feature vectors and generate more effective matching vectors based on these features.

[0005] Cross-modal target re-identification is a challenging task in computer vision, aiming to match target objects or pedestrians in images in visible and infrared modes. One existing cross-modal target re-identification method includes: Figure 1 The DFLN-ViT scheme is shown below. This scheme uses a two-stream network to extract features. Transformer blocks with spatial feature awareness and channel feature enhancement modules are inserted after different convolutional blocks to mine positional dependencies and obtain refined features. In the backbone network, skip connections from the first stage to the last stage combine high-level and low-level information to form a robust feature representation. The output I of each channel feature enhancement module is fed into the next stage of the network, and the output S of each spatial feature awareness module is combined with I in the last stage by element-wise addition to obtain the final feature representation. Both the spatial feature awareness module and the channel feature enhancement module introduce transformer structures to capture the spatial location and long-term channel dependencies between features. Finally, the network is trained using a designed ternary auxiliary heterogeneous center loss (THC loss) and classification loss (ID loss).

[0006] The spatial feature perception module takes as input features extracted from the convolution stage. To input the feature map into the transformer and obtain refined spatial features, convolution operations are used to combine information from neighboring pixels. For integration in the final stage, the output features of all spatial feature perception modules have the same size as the output features of the final stage. The Transformer excels at modeling dependencies and can capture correlations between channels. The channel feature enhancement module employs an attention-like mechanism to generate attention weights for each channel. To obtain the sequence input, GAP is used for patch encoding, and the attention weights obtained from the transformer are multiplied by the original feature map to obtain the final output features.

[0007] The disadvantages of the DFLN-ViT scheme in the prior art include: the scheme learns modality invariant features implicitly, while ignoring the direct mining and utilization of modality information, and cannot learn good, discriminative modality invariant features.

[0008] The existing loss function directly affects the extracted features, and this approach does not consider the effective mining and rational utilization of modal information, which limits the model performance.

[0009] To achieve multi-level fusion of semantic information, the DFLN-ViT model uses a multi-layer transformer structure, which makes computation more complex. Summary of the Invention

[0010] Embodiments of the present invention provide a multimodal target perception and re-identification method for images, so as to effectively perform target perception and re-identification on cross-modal images.

[0011] To achieve the above objectives, the present invention adopts the following technical solution.

[0012] A multimodal target perception and re-identification method for images includes:

[0013] The cross-modal image data is preprocessed, segmented, and vectorized to obtain a sequence of segmented vectors. Category and location information are added to each segmented vector.

[0014] Modal information of the cross-modal image data is learned by modal embedding (ME), and the block data, location information and modal information of the cross-modal image data are fused together by embedding and superimposing to obtain serialized image data;

[0015] The serialized image data is input into the ViT model, and the ViT model outputs the feature information of the cross-modal image data;

[0016] The modality perception enhancement loss value is calculated based on the feature information of the cross-modal image data. Backpropagation is performed based on the modality perception enhancement loss value to adjust the parameters of the ViT model and obtain the trained target re-identification model.

[0017] The trained target re-identification model is used to perform cross-modal target re-identification on the pedestrian images to be identified, and the target re-identification results of the pedestrian images are output.

[0018] Preferably, the preprocessing, segmentation, and vectorization of the cross-modal image data to obtain a segmented vector sequence, and the addition of category and location information to each segmented vector, includes:

[0019] The image data of the training and test sets in the cross-modal pedestrian dataset are loaded into the graphics processor. The graphics processor performs normalization operations on the images of the training and test sets, scaling the pixel value range of the images to between 0 and 1, cropping the images according to the set size, and performing random horizontal flipping, random cropping, and random erasing data augmentation operations on the images.

[0020] The image data is grouped into batches according to the set batch size. Each batch of image data is divided into a sequence of overlapping small blocks according to the step size. The small blocks are vectorized by compression and linearly mapped using a linear transformation matrix to obtain a block vector sequence. Category information and location information are added to each block vector.

[0021] Preferably, the step of learning the modal information of the cross-modal image data through ME, and fusing the block data, location information, and modal information of the cross-modal image data together in an embedded overlay manner to obtain serialized image data, includes:

[0022] The design includes a ViT model with modal embedding (ME). The block vectors in the cross-modal image data, along with the category and location information of the block vectors, are input into the ViT model. The ViT model learns the modal information of the cross-modal image data through ME. The modal information includes visible light RGB mode or infrared mode, which is used to perceive and encode different types of information.

[0023] The image block vectors, along with their location, category, and modality information, are fused together using an embedding and overlay method to obtain serialized image data.

[0024] Preferably, the step of inputting the serialized image data into the ViT model, wherein the ViT model outputs the feature information of the cross-modal image data, includes:

[0025] The serialized image data is input into the ViT model, which extracts features from the serialized image data using a multi-layer self-attention module and outputs a feature vector for each image in the serialized image data.

[0026] Preferably, the step of calculating the modality perception enhancement loss value based on the feature information of the cross-modal image data includes:

[0027] Calculate the classification loss (ID loss) and the weighted regularized triplet loss (WRT loss) based on the feature vector of each image in the serialized image data.

[0028] The modal perception enhancement loss L is calculated according to the feature vector of each image in the serialized image data using formulas (1)-(5). MAE ;

[0029]

[0030]

[0031]

[0032]

[0033] L MAE =L MAC +L MAID (5)

[0034] This represents the feature extracted from the k-th image of the m-mode;

[0035] This represents an identity label;

[0036] φ represents the mapping for mining modal embedding knowledge;

[0037] e m Represents modal embedding;

[0038] The central feature vector representing the q identity is the average value of the image features after modality removal;

[0039] Labels indicating predictions;

[0040] L MAID Calculate the cross-entropy between the prediction and the target; L MAC This represents the loss of the modal perception center.

[0041] In a batch, K images of Q identities are selected, and a fully connected layer is selected as the mapping φ for mining modality embedding knowledge;

[0042] The classification loss, weighted regularized triplet loss, and modality perception enhancement loss are weighted and fused according to the set hyperparameter λ to obtain the final loss value.

[0043] Preferably, the step of backpropagating based on the modality-aware enhancement loss value to adjust the parameters of the ViT model to obtain a trained target re-identification model includes:

[0044] Based on the final loss value, the gradient of the current parameters of the ViT model is calculated using the objective function. The first and second momentum are calculated based on the historical gradients. The descent gradient at the current moment is calculated and updated based on the descent gradient. The optimized parameters in the ViT model are updated using the calculated gradient. The above process is repeated until the set number of training rounds for the hyperparameters of the iterative optimization process is reached. After the set number of training rounds is reached, the iterative optimization training process of the ViT model parameters is stopped. The ViT model is evaluated using evaluation metrics. After passing the evaluation, the trained target re-identification model is obtained.

[0045] Preferably, the step of using the trained target re-identification model to perform cross-modal target re-identification on the pedestrian image to be identified, and outputting the target re-identification result of the pedestrian image, includes:

[0046] The pedestrian image to be identified is input into the trained target re-identification model. The trained target re-identification model performs cross-modal target re-identification on the pedestrian image to be identified and outputs the target re-identification result of the pedestrian image. The target re-identification result includes the pedestrian being identified as the most likely classification result, and outputs the label and probability value of the most likely classification result.

[0047] As can be seen from the technical solutions provided by the embodiments of the present invention described above, the method of the embodiments of the present invention leverages the advantage of transformers in capturing global context information and enhances modal information perception using the transformer architecture; it introduces learnable modal embeddings (MEs) into the network to directly encode modal information, which can effectively alleviate the gap between heterogeneous images; and it forces MEs to capture more useful features of the modality through the MAE loss function and adjusts the distribution of the extracted embeddings.

[0048] Additional aspects and advantages of the invention will be set forth in part in the description which follows, and will become apparent from the description or may be learned by practice of the invention. Attached Figure Description

[0049] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0050] Figure 1 This is a schematic diagram of the DFLN-ViT method in the prior art;

[0051] Figure 2 This is a schematic diagram illustrating the implementation principle of a multimodal target perception and re-identification method for images provided in an embodiment of the present invention.

[0052] Figure 3 A flowchart illustrating a multimodal target perception and re-identification method for images provided in an embodiment of the present invention;

[0053] Figure 4 This is a schematic diagram of the structure of input data for a ViT model provided in an embodiment of the present invention;

[0054] Figure 5 This is a schematic diagram of a modal sensing enhancement loss provided in an embodiment of the present invention. Detailed Implementation

[0055] Embodiments of the present invention are described in detail below, examples of which are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention.

[0056] Those skilled in the art will understand that, unless specifically stated otherwise, the singular forms “a,” “an,” “the,” and “the” used herein may also include the plural forms. It should be further understood that the term “comprising” as used in this specification means the presence of the stated features, integers, steps, operations, elements, and / or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups thereof. It should be understood that when we say an element is “connected” or “coupled” to another element, it can be directly connected or coupled to the other element, or there may be intermediate elements. Furthermore, “connected” or “coupled” as used herein can include wireless connections or couplings. The term “and / or” as used herein includes any and all combinations of one or more of the associated listed items.

[0057] It will be understood by those skilled in the art that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. It should also be understood that terms such as those defined in general dictionaries should be understood to have the same meaning as in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless defined as herein.

[0058] To facilitate understanding of the embodiments of the present invention, the following will provide further explanation and description with reference to the accompanying drawings and several specific embodiments. These embodiments do not constitute a limitation on the embodiments of the present invention.

[0059] This invention focuses on addressing the core challenge of cross-modal target re-identification, namely the significant modal differences between visible light and infrared images. It proposes a system and method for multimodal target perception and re-identification. To capture the inherent features of each modality, a novel Modality Embeddings (ME) is designed to directly encode modal information. To enhance the constraint on the learnable ME and optimize the distribution of extracted features, a Modality-aware Enhancement (MAE) loss is further designed. By subtracting learned modality-specific features from the ME, the ME is forced to capture more useful features for each modality, and the distribution of extracted features is adjusted, overcoming the shortcomings of existing losses. The network proposed in this invention can be jointly optimized in an end-to-end manner, generating more effective and discriminative features.

[0060] The implementation principle diagram of a multimodal target perception and re-identification method for images provided in this embodiment of the invention is shown below. Figure 2 As shown, the model training process includes: image preprocessing, data serialization input, feature extraction, loss calculation, model iterative optimization, and model testing and evaluation. The first stage is model training. Pedestrian image data is input into the training set and undergoes various image preprocessing operations, including image standardization, image resizing, random horizontal flipping, random cropping, and random erasing. Then, the image data is passed through the forward propagation of the designed network model to obtain the image classification results. Next, the loss is calculated, and the loss is used for backpropagation to update the model weights. This process is repeated until the set number of iterations is reached.

[0061] During the testing phase, image data from the test set is loaded. The neural network layer for classification is removed from the trained model to directly obtain the features of the test samples. Feature similarity is calculated and compared to complete the retrieval process. Then, evaluation metrics are calculated to determine the model's performance. If the expected requirements are not met, the model returns to the training phase for further adjustments and training. If the expected performance has been achieved, the model weights are saved, completing the entire invention process and obtaining the final solution.

[0062] Cross-modal image data: The image data exists in two different modalities: visible light (RGB) and infrared (Infrared). Each image belongs to either the visible light (RGB) modality or the infrared modality, and each image also corresponds to multiple labeled target categories. This cross-modal image data includes training images, search images, and a search library. The training images are used to train the target re-identification model's ability to extract features, while the search images and search library are used to validate the performance of the target re-identification model.

[0063] For the infrared modality, the ViT (vision transformer) model was used, with pre-trained weights on the ImageNet dataset. Visible and infrared images were resized to 3×256×128 (C×H×W), and a single channel of the infrared image was repeated three times to contain three channels. The input image was divided into 16×16 blocks with a stride S of 8. The AdamW optimizer was used. The base learning rate was initialized to 0.001, and the learning rate for all pre-trained layers was 0.1 times the base learning rate. Model hyperparameters included image cropping size, batch size during training, number of iterations, learning rate, input image block stride S, and the balancing coefficient λ for modality-aware enhancement loss.

[0064] The specific processing flow of a multimodal target perception and re-identification method for images provided in this embodiment of the invention is as follows: Figure 3 As shown, the processing steps include the following:

[0065] Step S10: Image preprocessing stage

[0066] The image data (including training and test sets) of the cross-modal pedestrian dataset is loaded into the GPU (graphics processing unit) memory;

[0067] The images in the training and test sets are standardized to scale the pixel value range to between 0 and 1, cropped according to the set size, and data augmentation operations such as random horizontal flipping, random cropping, and random erasing are used as appropriate.

[0068] The data is grouped into batches according to the set batch size for use as input to the model algorithm later.

[0069] Step S20: Data serialization input stage

[0070] The preprocessed image is divided into a sequence of small patches, and overlapping patches are segmented according to the stride.

[0071] The small patches are vectorized through a compression flatten operation, and then a linear transformation matrix is used to linearly map the vectorized patches to obtain a block vector sequence.

[0072] Add category and location information to each block vector.

[0073] The Transformer model is an NLP (Natural Language Processing) model that uses a self-attention mechanism to obtain global information. The ViT (Vision Transformer) model applies the Transformer model to vision tasks, specifically classification tasks.

[0074] Figure 4 This is a schematic diagram of the input data structure of a ViT model provided in an embodiment of the present invention. In the ViT model of this embodiment, a modal encoding is designed, introducing ME (Modal Encoding) to learn each modal information. Different types of embeddings are fused together additively. The positional embedding is different for different image blocks, while the modal embedding also serves as a learnable parameter, varying between image modalities to perceive and encode different types of information. The modal information includes both visible light and infrared information.

[0075] Modal information can be integrated into the transformer framework as naturally as positional encoding. For example... Figure 4 As shown, the input data of the ViT model consists of three parts: image block vectors, location information, and category information. Corresponding to the visible light and infrared modes, this invention defines two learnable embeddings for each mode, which are used to learn the information for each mode, facilitating the subsequent learning process of mode-invariant features. The image in each mode shares the same embedding with all image blocks. The image block vectors, location information, and modal information are fused together using embedding overlay to obtain serialized image data. The location information differs for different image blocks, while the modal information varies between image modes, used to perceive and encode different types of information.

[0076] Step S30: Feature Extraction Stage

[0077] In the feature extraction stage, the ViT model is used as the backbone extractor. The ViT model, through the use of multi-layer self-attention modules, can perceive global features more effectively than CNN-based methods. The serialized image data described above is input into the ViT model for feature extraction.

[0078] Corresponding to the position of the category vector, the feature vector of each image is obtained from the output data of the ViT model. These feature vectors are then passed through batch normalization layers and fully connected (FC) layers, followed by calculation of various losses such as MAE loss and classification loss. These losses collectively constrain the distribution of the extracted vectors.

[0079] Step S40: Loss Calculation Stage

[0080] Calculate the classification loss (ID loss);

[0081] The re-identification training process is viewed as an image classification problem, where each different label represents a different class. The classification loss is calculated using the labeled input image and the predicted probability of being identified as a certain class, through cross-entropy.

[0082] Calculate the weighted regularization triplet (WRT) loss;

[0083] The re-identification training process is treated as a retrieval and ranking problem, where the distance between positive pairs should be smaller than that between negative pairs by a predefined margin. A triplet contains a positive sample, a negative sample, and an anchor point. The weighted regularized triplet loss uses a softmax function to assign a weight to all positive and negative samples. The weighted regularized triplet loss is shown below:

[0084]

[0085]

[0086] Where i, j, k represent triples in each training batch. For anchor i, P i It is the corresponding positive set, N i It is a negative set. This represents the pairwise distance between positive and negative sample pairs.

[0087] The modal sensing enhancement loss (MAE loss) is calculated according to formulas (1)-(5);

[0088] The total loss is calculated, and the three losses are weighted and fused using the set hyperparameter λ to obtain the final loss value.

[0089] Step S50: Model Iterative Optimization Stage

[0090] The code implementation is based on the PyTorch deep learning framework, which can perform backpropagation from the final calculated loss value and automatically calculate the gradient values of the parameters in the target re-identification model.

[0091] Using the gradients calculated in the previous steps, the parameter values in the target re-identification model are updated using an optimizer (such as PyTorch's Adam optimizer).

[0092] During iterative optimization, the gradient of the objective function with respect to the current parameters is calculated, the first-order momentum and second-order momentum are calculated based on the historical gradients, the descent gradient at the current moment is calculated, and the function is updated based on the descent gradient.

[0093] Repeat all the above steps before the target re-identification model reaches the number of training rounds set by the hyperparameters. After reaching the number of training rounds, stop the training process of the parameters in the target re-identification model to obtain a trained target re-identification model that meets the performance evaluation criteria.

[0094] Step S60: Test and Evaluation Phase

[0095] Read pedestrian images from the test set, load them into GPU memory, and perform the same standardization operations as in the training phase (note that data augmentation operations such as random horizontal flipping are not required during testing).

[0096] We used CMC (Cumulative Matching Characteristics) and mAP (Mean Average Precision), commonly used evaluation metrics for pedestrian re-identification, to conduct a preliminary evaluation of the merits of the target re-identification model by assessing the calculated metric values.

[0097] If the evaluation results do not meet the requirements, the parameters of the target re-identification model need to be adjusted, and the process should return to the first step of the execution steps to retrain the target re-identification model. If the evaluation results meet the requirements, the parameters of the target re-identification model can be saved, resulting in a well-trained target re-identification model. The trained target re-identification model described above can serve as a solution for visible light and infrared cross-modal pedestrian re-identification tasks.

[0098] Then, the trained target re-identification model can be used to perform cross-modal target re-identification on the pedestrian image to be identified, and output the target re-identification result of the pedestrian image. The target re-identification result includes the pedestrian being identified as the most likely classification result, and outputs its label and probability value.

[0099] Figure 5 This diagram illustrates a modality-aware enhancement loss method provided in this embodiment of the invention. The modality-aware enhancement loss comprises two parts: modality-aware center loss (Formula 1) and modality-aware ID loss (Formula 5), aiming to narrow intra-class distances and widen inter-class distances. Modality-aware center loss focuses on reducing the gap between different modalities under the same identity and utilizes knowledge learned from the ME to reduce intra-class feature distances. First, it is necessary to calculate the center feature vectors of all identities in a batch (Formula 2). Here, modality-specific information is removed to filter out modality-invariant features, resulting in the average value of the image features after modality removal. Cosine distance D is used to calculate the distance between the extracted image features and their center feature vectors. Through the constraint of modality-aware center loss, more compact cross-modal features can be extracted for each identity. Modality-aware ID loss (Formula 3) aims to learn discriminative features between different identities, widening the distance between image features of different identities based on information learned from the ME. It also involves modality removal, classifying input images of different identities by calculating cross-entropy loss. Under the constraint of missing modality-aware IDs, the features extracted by the ViT model have stronger recognition capabilities and can achieve more accurate matching.

[0100]

[0101]

[0102]

[0103]

[0104] L MAE =L MAC +L MAID (5)

[0105] This represents the feature extracted from the k-th image of the m-modality.

[0106] It represents an identity label.

[0107] φ represents the mapping for mining modal embedding knowledge.

[0108] e m This indicates modal embedding.

[0109] The central feature vector represents the identity q, which is the average value of the image features after modality removal.

[0110] Labels indicating predictions.

[0111] L MAID Calculate the cross-entropy between the prediction and the target. MAC This represents the loss of the modal perception center.

[0112] K images with Q identities are selected in a batch, and a fully connected layer is chosen as the mapping φ for mining modality embedding knowledge. Modality-aware enhancement loss can use the modality removal process to force ME to mine more useful modality-specific features. The loss function based on ME can adjust the distribution of feature vectors to generate more discriminative features for image retrieval, unaffected by huge modality differences.

[0113] In summary, the method of this invention utilizes modal coding to better perceive modal information and learn better modality-invariant feature representations.

[0114] Modality-aware enhancement loss function better adjusts feature distribution, resulting in compact intra-class distances and larger inter-class distances, thus enhancing the learning ability of modality embedding.

[0115] The method of this invention leverages the advantage of transformers in capturing global contextual information and enhances modal information perception using the transformer architecture; it introduces learnable modal embeddings (MEs) into the network, which directly encode modal information and can be effectively used to alleviate the gap between heterogeneous images; and it designs a new MAE loss function that forces the ME to capture more useful features of the modality and adjusts the distribution of the extracted embeddings.

[0116] Those skilled in the art will understand that the accompanying drawings are merely schematic diagrams of one embodiment, and the modules or processes shown in the drawings are not necessarily essential for implementing the present invention.

[0117] As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that the present invention can be implemented by means of software plus necessary general-purpose hardware platforms. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments or some parts of the embodiments of the present invention.

[0118] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, for apparatus or system embodiments, since they are basically similar to method embodiments, the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments. The apparatus and system embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without creative effort.

[0119] The above description is merely a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A multimodal target perception and re-identification method for images, characterized in that, include: The cross-modal image data is preprocessed, segmented, and vectorized to obtain a sequence of segmented vectors. Category and location information are added to each segmented vector. Modal information of the cross-modal image data is learned by modal embedding (ME), and the block data, location information and modal information of the cross-modal image data are fused together by embedding and superimposing to obtain serialized image data; The serialized image data is input into the ViT model, and the ViT model outputs the feature information of the cross-modal image data; The modality perception enhancement loss value is calculated based on the feature information of the cross-modal image data. Backpropagation is performed based on the modality perception enhancement loss value to adjust the parameters of the ViT model and obtain the trained target re-identification model. The trained target re-identification model is used to perform cross-modal target re-identification on the pedestrian images to be identified, and the target re-identification results of the pedestrian images are output. The calculation of the modality perception enhancement loss value based on the feature information of the cross-modal image data includes: Calculate the classification loss (ID loss) and the weighted regularized triplet loss (WRT loss) based on the feature vector of each image in the serialized image data. The modal perception enhancement loss is calculated according to the feature vector of each image in the serialized image data using formulas (1)-(5). ; (1) (2) (3) (4) (5) express Features extracted from the k-th image of the modality; This represents an identity label; This represents a mapping for mining modal embedding knowledge; Represents modal embedding; The central feature vector representing the q identity is the average value of the image features after modality removal; Labels indicating predictions; Calculate the cross-entropy between the prediction and the target; This represents the modal sensing center loss; K images representing Q identities are selected from a batch, and a fully connected layer is chosen as the mapping for mining modality embedding knowledge. ; According to the set hyperparameters The classification loss, weighted regularized triplet loss, and modality-aware enhancement loss are weighted and fused to obtain the final loss value. The step of backpropagating based on the modality-aware enhancement loss value to adjust the parameters of the ViT model and obtain a trained target re-identification model includes: Based on the final loss value, the gradient of the current parameters of the ViT model is calculated using the objective function. The first and second momentum are calculated based on the historical gradients. The descent gradient at the current moment is calculated and updated based on the descent gradient. The optimized parameters in the ViT model are updated using the calculated gradient. The above process is repeated until the set number of training rounds for the hyperparameters of the iterative optimization process is reached. After the set number of training rounds is reached, the iterative optimization training process of the ViT model parameters is stopped. The ViT model is evaluated using evaluation metrics. After passing the evaluation, the trained target re-identification model is obtained.

2. The method according to claim 1, characterized in that, The process of preprocessing, segmenting, and vectorizing cross-modal image data to obtain a sequence of segmented vectors, and adding category and location information to each segmented vector, includes: The image data of the training and test sets in the cross-modal pedestrian dataset are loaded into the graphics processor. The graphics processor performs normalization operations on the images of the training and test sets, scaling the pixel value range of the images to between 0 and 1, cropping the images according to the set size, and performing random horizontal flipping, random cropping, and random erasing data augmentation operations on the images. The image data is grouped into batches according to the set batch size. Each batch of image data is divided into a sequence of overlapping small blocks according to the step size. The small blocks are vectorized by compression and linearly mapped using a linear transformation matrix to obtain a block vector sequence. Category information and location information are added to each block vector.

3. The method according to claim 2, characterized in that, The method of learning the modal information of the cross-modal image data through ME, and fusing the block data, location information, and modal information of the cross-modal image data together in an embedded overlay manner to obtain serialized image data, includes: The design includes a ViT model with modal embedding (ME). The block vectors in the cross-modal image data, along with the category and location information of the block vectors, are input into the ViT model. The ViT model learns the modal information of the cross-modal image data through ME. The modal information includes visible light RGB mode or infrared mode, which is used to perceive and encode different types of information. The image block vectors, along with their location, category, and modality information, are fused together using an embedding and overlay method to obtain serialized image data.

4. The method according to claim 3, characterized in that, The step of inputting the serialized image data into the ViT model, and the ViT model outputting the feature information of the cross-modal image data, includes: The serialized image data is input into the ViT model, which extracts features from the serialized image data using a multi-layer self-attention module and outputs a feature vector for each image in the serialized image data.

5. The method according to claim 1, characterized in that, The method of using a trained target re-identification model to perform cross-modal target re-identification on pedestrian images to be identified, and outputting the target re-identification results of the pedestrian images, includes: The pedestrian image to be identified is input into the trained target re-identification model. The trained target re-identification model performs cross-modal target re-identification on the pedestrian image to be identified and outputs the target re-identification result of the pedestrian image. The target re-identification result includes the pedestrian being identified as the most likely classification result, and outputs the label and probability value of the most likely classification result.