Cross-modal pedestrian re-identification method and device, electronic equipment and storage medium
By splitting the batch normalization layer of the convolutional neural network and designing the total loss function, the architecture parameters were optimized, solving the problems of time consumption and structural neglect in cross-modal person re-identification, and achieving efficient cross-modal data utilization and accurate person re-identification.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JINGDONG TECH HLDG CO LTD
- Filing Date
- 2021-05-10
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies are time-consuming in cross-modal pedestrian re-identification and may ignore the optimal network structure, making it difficult to effectively utilize cross-modal data captured by visible light and infrared cameras.
By splitting the batch normalization layer of a convolutional neural network into two branches to process visible light and infrared image data respectively, and designing a total loss function to optimize architecture parameters, the optimal network structure is automatically searched. Combined with feature matching and similarity comparison, pedestrian re-identification across modal data is achieved.
It improves the accuracy of cross-modal pedestrian re-identification, optimizes the network structure search process, reduces time consumption, and makes full use of visible light and infrared information for cross-modal pedestrian re-identification.
Smart Images

Figure CN115346231B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing technology, and in particular to a cross-modal pedestrian re-identification method, apparatus, electronic device, and non-transitory computer-readable storage medium. Background Technology
[0002] Currently, pedestrian re-identification technology is receiving increasing attention due to its broad potential applications in the security field. Pedestrian re-identification technology refers to matching information about the same person from multiple discontinuous surveillance camera feeds.
[0003] With the development of deep learning, pedestrian re-identification technology has made great progress. However, some problems and challenges still exist in this field, such as the difficulty of visible light cameras working well in nighttime scenes. To overcome the shortcomings of visible light cameras in handling nighttime scenes, researchers have adopted a combination of visible light and infrared cameras for nighttime photography. The data types captured by visible light and infrared cameras are inconsistent; the data collected by both cameras constitutes cross-modal data.
[0004] Current mainstream learning methods involve manually designing neural networks to learn from both visible light and infrared data. This process cannot automatically learn the optimal structure using architecture search, as it relies too heavily on manual work and experience, is excessively time-consuming, and may overlook potentially optimal network structures. Furthermore, existing architecture search methods primarily address single-modal data and are not well-suited for cross-modal search tasks. Summary of the Invention
[0005] This invention provides a cross-modal pedestrian re-identification method, apparatus, electronic device, and non-transitory computer-readable storage medium to address the shortcomings of existing cross-modal pedestrian re-identification technologies, such as long processing time and potential neglect of optimal network structure, thereby optimizing the cross-modal pedestrian re-identification function.
[0006] This invention provides a cross-modal pedestrian re-identification method, comprising: acquiring cross-modal data to be detected, the cross-modal data to be detected including visible light image-infrared image pairs; inputting the cross-modal data to be detected into a preset pedestrian re-identification model to obtain feature data of the cross-modal data to be detected, so as to obtain a pedestrian re-identification result based on the feature data; wherein, the pedestrian re-identification model is trained based on a sample dataset, and the pedestrian re-identification model is used to extract feature data from the cross-modal data to be detected.
[0007] According to a cross-modal pedestrian re-identification method provided by the present invention, the pedestrian re-identification method further includes: dividing a sample dataset into three parts to form a first training set, a validation set, and a test set, wherein the data in the sample dataset includes visible light image-infrared image pairs; training the model parameters of an initial model using a total loss function and the first training set, and updating the architecture parameters using the total loss function and the validation set until the initial model converges, obtaining the architecture parameters at the convergence of the initial model, wherein each batch normalized BN layer of the initial model includes a first branch and a second branch, the first branch including two identical sub-BN layers, respectively used to process visible light images and infrared images, the second branch including one sub-BN layer, the first branch and the second branch being configured with their respective corresponding architecture parameters; selecting and retaining one of the first branch and the second branch according to the architecture parameters to form an intermediate model; training the model parameters of the intermediate model using a second training set, and testing the intermediate model using the test set to obtain a pedestrian re-identification model, wherein the second training set consists of the first training set and the validation set.
[0008] According to a cross-modal person re-identification method provided by the present invention, the step of training the model parameters of an initial model using a total loss function and a first training set includes: weighting and summing the outputs of the first branch and the second branch according to the architecture parameters to obtain the output of the BN layer; wherein the output of the first branch is the concatenation of the outputs of the two sub-BN layers of the first branch, and the output of the second branch is the output of one sub-BN layer of the second branch; obtaining the forward propagation output of the initial model based on the initial model and the output of the BN layer; obtaining a total loss value based on the total loss function and the forward propagation output of the initial model; and updating the model parameters of the initial model through backpropagation based on the total loss value until the total loss value is less than a set threshold.
[0009] According to a cross-modal pedestrian re-identification method provided by the present invention, the step of selecting one of the first branch and the second branch to retain based on the architecture parameters includes: when the initial model converges, if the architecture parameters of the first branch are greater than the architecture parameters of the second branch, then the second branch of the initial model is removed to form the intermediate model; when the initial model converges, if the architecture parameters of the second branch are greater than the architecture parameters of the first branch, then the first branch of the initial model is removed to form the intermediate model.
[0010] According to a cross-modal person re-identification method provided by the present invention, the step of training the model parameters of an initial model using a total loss function and a first training set includes: calculating a base loss function, a loss function representing the differences in data distribution across different modalities, and a loss function representing the correlation between data across different modalities, and summing them to obtain a total loss value; adjusting the model parameters of the initial model according to the total loss value until the total loss value is less than a set threshold. According to the cross-modal person re-identification method provided by the present invention, the total loss function... The calculation formulas include:
[0011]
[0012] in, As the basis loss function, For a special loss function,
[0013]
[0014] λ1 and λ2 are adjustment parameters. The loss function is used to characterize the differences in data distribution across different modalities.
[0015]
[0016] f c,vis f c,ir These are the features obtained after forward propagation of the visible light image and infrared image numbered c in the sample dataset, respectively, through a hypernetwork. c n c They are f c,vis f c,ir The quantity, and m c =n c , (,) represents the sum of the dot products of vectors;
[0017] To characterize the correlation of data in different modalities,
[0018]
[0019]
[0020]
[0021]
[0022]
[0023] According to a cross-modal pedestrian re-identification method provided by the present invention, after obtaining the feature data of the cross-modal data to be detected, the pedestrian re-identification method further includes: matching the feature data with the identity information of persons in the database to obtain the identity information of the person corresponding to the feature data; and / or comparing the similarity of the first feature data corresponding to the visible light image and the second feature data corresponding to the infrared image to determine whether the first feature data and the second feature data correspond to the same person.
[0024] The present invention also provides a cross-modal pedestrian re-identification device, comprising: an acquisition unit for acquiring cross-modal data to be detected, the cross-modal data to be detected including a visible light image-infrared image pair; and an identification unit for inputting the cross-modal data to be detected into a preset pedestrian re-identification model to obtain feature data of the cross-modal data to be detected, so as to obtain a pedestrian re-identification result based on the feature data; wherein the pedestrian re-identification model is trained based on a sample dataset, and the pedestrian re-identification model is used to extract feature data from the cross-modal data to be detected.
[0025] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of any of the cross-modal pedestrian re-identification methods described above.
[0026] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the cross-modal pedestrian re-identification method as described above.
[0027] The present invention provides a cross-modal pedestrian re-identification method, apparatus, electronic device and non-transitory computer-readable storage medium. By using a pedestrian re-identification model to extract features from cross-modal data and performing pedestrian re-identification based on the extracted feature data, it can make full use of visible light and infrared information for cross-modal pedestrian re-identification. Attached Figure Description
[0028] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0029] Figure 1 This is a flowchart illustrating the cross-modal pedestrian re-identification method provided by the present invention;
[0030] Figure 2 This is a schematic diagram of the BN layer splitting structure provided by the present invention;
[0031] Figure 3A This is one of the flowcharts illustrating the training process of the pedestrian re-identification model provided by the present invention;
[0032] Figure 3B This is the second flowchart illustrating the process of training the pedestrian re-identification model provided by the present invention;
[0033] Figure 4 This is a schematic diagram illustrating the classification accuracy of the pedestrian re-identification model provided by this invention on the SYSU-MM01 and RegDB datasets;
[0034] Figure 5 This is a schematic diagram of the cross-modal pedestrian re-identification device provided by the present invention;
[0035] Figure 6 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation
[0036] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0037] The terminology used in one or more embodiments of the present invention is for the purpose of describing particular embodiments only and is not intended to limit the scope of the invention. The singular forms “a,” “the,” and “the” used in one or more embodiments of the invention and in the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” used in one or more embodiments of the invention refers to and includes any or all possible combinations of one or more associated listed items.
[0038] It should be understood that although the terms first, second, etc., may be used to describe various information in one or more embodiments of the present invention, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first may also be referred to as second without departing from the scope of one or more embodiments of the present invention, and similarly, second may also be referred to as first. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to a determination."
[0039] Data comes from a wide range of sources and takes many forms across different industries. Each source or form of data can be considered a modality, such as video, images, voice, and sensor data like infrared and acoustic spectra in industrial scenarios.
[0040] In related technologies, when using visible light images and infrared images for pedestrian re-identification, the traditional method of manually designing neural networks to achieve feature matching tasks has poor performance.
[0041] To address this problem, embodiments of the present invention provide a cross-modal pedestrian re-identification scheme. The following is a detailed explanation... Figures 1 to 6 The exemplary embodiments of the present invention will be described in detail below.
[0042] like Figure 1 The diagram shows a flowchart of a cross-modal pedestrian re-identification method according to an embodiment of the present invention. The method provided by this embodiment can be executed by any electronic device with computer processing capabilities, such as a terminal device and / or a server. Figure 1 As shown, this cross-modal person re-identification method includes:
[0043] Step 102: Obtain the cross-modal data to be detected, which includes visible light image-infrared image pairs.
[0044] Specifically, the cross-modal data to be detected includes visible light images and infrared images. The two types of data in this cross-modal data are inconsistent, therefore they represent data from different modalities.
[0045] Step 104: Input the cross-modal data to be detected into a pre-set pedestrian re-identification model to obtain the feature data of the cross-modal data to be detected, and then obtain the pedestrian re-identification result based on the feature data; wherein, the pedestrian re-identification model is trained based on the sample dataset, and the pedestrian re-identification model is used to extract feature data from the cross-modal data to be detected. Specifically, the pedestrian re-identification model is a convolutional neural network model. Convolutional Neural Networks (CNNs) are a type of feedforward neural network that includes convolutional computation and has a deep structure. Figure 2 As shown, existing convolutional neural network structures can include convolutional layers 201, batch normalization (BN) layers 202, and activation function layers 203. Feature data refers to image feature data obtained by extracting features from visible light and infrared images.
[0046] The technical solution of this invention, by training a cross-modal pedestrian re-identification model, enables the cross-modal data to be detected to be re-identified using the pedestrian re-identification model.
[0047] Before step 104, the sample dataset needs to be trained to obtain the person re-identification model. For example... Figure 3A As shown, the steps for training a pedestrian re-identification model include:
[0048] Step 1031: Divide the sample dataset into three parts to form the first training set, validation set and test set. The data in the sample dataset includes visible light images and infrared images.
[0049] Specifically, the sample dataset includes visible light images and infrared images, with a one-to-one correspondence between the two. The two types of data in this sample dataset are inconsistent, representing different modalities. The first training set is a sample dataset used for fitting the convolutional neural network model; the validation set is a sample dataset used to adjust the hyperparameters of the convolutional neural network model and to conduct a preliminary evaluation of its capabilities; and the test set is a sample dataset used to evaluate the generalization ability of the final neural network model.
[0050] Step 1032: Train the model parameters of the initial model using the total loss function and the first training set, and update the architecture parameters using the total loss function and the validation set until the initial model converges, obtaining the architecture parameters when the initial model converges. Each batch normalization (BN) layer of the initial model includes a first branch and a second branch. The first branch includes two identical sub-BN layers, which are used to process visible light images and infrared images, respectively. The second branch includes one sub-BN layer. The first branch and the second branch are configured with their respective architecture parameters.
[0051] Specifically, the initial model is a custom hypernetwork structure model. The initial model has multiple batch normalization (BN) layers, each BN layer including a first branch and a second branch. The first branch includes two identical sub-BN layers, and the second branch includes one sub-BN layer. Input data is processed through both branches, and the results are weighted and summed according to the architecture parameters before being output. Architecture parameters are a set of optimizable parameters; each parameter is assigned to a branch, representing the probability of that branch being retained. Model parameters are the parameters of the convolutional, pooling, and linear operation layers in the hypernetwork structure. The total loss function is a loss function used to evaluate the degree to which the predicted values of the convolutional neural network model differ from the true values. Model convergence refers to the error falling below a pre-set small threshold, or the weight change between two iterations being minimal, or the maximum number of iterations being reached. Training is the process of finding the parameters of the convolutional neural network model based on known data, while validation involves adjusting the parameters of the convolutional neural network model and conducting a preliminary assessment of its capabilities.
[0052] Based on manually designed neural network structures, determining that appropriately splitting BN layers can bring better cross-modal matching performance is a method that can be called cross-modal neural network architecture search. Its purpose is to find the optimal solution for whether each BN layer should be split.
[0053] This network structure can simultaneously learn different modal data captured by visible light cameras and infrared cameras in pedestrian re-identification tasks, so as to make full use of visible light and infrared information for cross-modal data processing, and has superior performance in pedestrian re-identification tasks.
[0054] like Figure 2 As shown, in this embodiment of the invention, the original BN layer 202 is split into a left branch and a right branch, namely the first branch o1 and the second branch o2, respectively. The architecture parameters of the left branch are as follows: The architecture parameters of the right branch are:
[0055] The data composed of visible light image and infrared image pairs is input into the BN layer. In the second branch, sub-BN layer 2023 processes both visible light and infrared image data simultaneously. However, during forward propagation in the first branch, the visible light image and infrared image pairs are separated. The visible light image data passes through one sub-BN layer 2021 in the first branch, and the infrared image data passes through another sub-BN layer 2022. The calculation results are then concatenated as the output of the first branch o1. Here, the names of the three sub-BN layers are merely for differentiation from BN layer 202; the functions of the three sub-BN layers are the same as those of BN layer 202.
[0056] Step 1033: Select one of the first branch and the second branch to retain based on the architecture parameters to form an intermediate model.
[0057] Specifically, retaining one of the two branches means removing one of the two branches and keeping the other to form an intermediate model. Step 1034: Train the model parameters of the intermediate model using the second training set and test the intermediate model using the test set to obtain the pedestrian re-identification model. The second training set consists of the first training set and the validation set.
[0058] Specifically, the second training set consists of all the data from the first training set and the validation set, where the ratio of the number of samples in the first training set to the number in the validation set can be 8:2. Testing evaluates the generalization ability of the final neural network model.
[0059] In the technical solution of this invention embodiment, cross-modal feature matching task is achieved by splitting the BN layer, and a targeted architecture parameter search method is designed in combination with the loss function to find the optimal network structure in the designed search space, thus realizing a solution for finding the optimal splitting of the BN layer of the convolutional neural network through architecture search.
[0060] When building the initial model, we split each Batch Normalization (BN) layer in the convolutional neural network into two branches, o1 and o2, where o1 includes two identical sub-BN layers, and o2 includes one sub-BN layer. Each branch is assigned an architecture parameter. and Let x be the data input to this BN layer. l They will propagate forward through o1 and o2 respectively, and the output of this BN layer is based on and Perform a weighted summation.
[0061] like Figure 2 As shown, the output of BN layer 202 is the output of the first branch and the output of the second branch. and This is obtained by weighted summation of the corresponding weights.
[0062] Specifically, the output function x of the BN layer l+1 for:
[0063]
[0064]
[0065] in, Let x be the architecture parameter of the i-th branch, where i is a natural number, and here i is 1 or 2. l For the input of the BN layer, o1(x) l ) represents the output of the first branch, o2(x) l ) represents the output of the second branch. Here, according to and Calculated and and That is, the weight of the weighted sum of the outputs of the first branch and the outputs of the second branch.
[0066] Thus, in step 1032, when training the model parameters of the initial model using the total loss function and the first training set, the outputs of the first branch and the second branch can be weighted and summed according to the architecture parameters to obtain the output of the BN layer. Then, based on the output of the initial model and the output of the BN layer, the output of the forward propagation of the initial model is obtained, and based on the total loss function and the output of the forward propagation of the initial model, the total loss value is obtained. Then, based on the total loss value, the model parameters of the initial model are updated through backpropagation until the total loss value is less than the set threshold.
[0067] The output of the first branch is the concatenation of the outputs of the two sub-BN layers of the first branch, and the output of the second branch is the output of one sub-BN layer of the second branch.
[0068] Specifically, in step 1032, the architecture parameters determined in step 104 are fixed, the first training set is input into the initial model, and the model parameters of the initial model are trained using the total loss function. The model parameters of the trained initial model are then fixed, the validation set is input into the initial model, and the architecture parameters are updated using the total loss function. Afterwards, the updated architecture parameters are fixed, the model parameters of the initial model are trained again, and the architecture parameters are updated again. This process is repeated multiple times until the architecture parameters finally converge, that is, the initial model converges.
[0069] Step 1032 is used to search for architecture parameters. In this embodiment of the invention, a cross-modal loss function that considers the differences in data distribution across different modalities and the correlation between modalities is designed to guide the search process. Here, the final cross-modal loss function is the total function, which is the sum of several loss functions.
[0070] Specifically, the total loss function includes a base loss function and a special loss function. The special loss function includes a loss function part that characterizes the differences in data distribution across different modalities and a loss function part that characterizes the correlation between data across different modalities.
[0071] In step 1032, the basis loss function, the loss function characterizing the differences in data distribution across different modalities, and the loss function characterizing the correlation between data across different modalities are calculated respectively, and the total loss value is obtained by summing them up.
[0072] After obtaining the total loss value, adjust the model parameters of the initial model based on the total loss value until the total loss value is less than the set threshold.
[0073] Total loss function The calculation formulas include:
[0074]
[0075] in, As the basis loss function, For a special loss function,
[0076]
[0077] λ1 and λ2 are adjustment parameters. The loss function is used to characterize the differences in data distribution across different modalities.
[0078]
[0079] f c,vis f c,ir These are the features obtained after forward propagation of the visible light image and infrared image numbered c in the sample set, respectively, through a hypernetwork. c n cThey are f c,vis f c,ir The quantity, and m c =n c , (,) represents the sum of the dot products of vectors;
[0080] To characterize the correlation of data in different modalities,
[0081]
[0082]
[0083]
[0084]
[0085]
[0086] To characterize the differences in data distribution across different modalities, a loss function is designed during the process of designing the loss function. This can reduce the distributional differences between the two modalities. Simultaneously, it's necessary to constrain the consistency of similarity associations between images from different modalities; that is, the feature similarity of images of the same identity across different modalities should remain consistent. Therefore, the design... The loss function is used to characterize the correlation between data in different modalities.
[0087] In summary, a special loss function for cross-modal data can be obtained. Building upon this, a classification and triplet loss function, i.e., a basis function, is added to learn the features. Finally, the total loss function is obtained.
[0088] In step 1032, during the process of determining the intermediate model based on the architecture parameters when the initial model converges, if the architecture parameters of the first branch are greater than those of the second branch when the initial model converges, the second branch of the initial model is removed to form the intermediate model; if the architecture parameters of the second branch are greater than those of the first branch when the initial model converges, the first branch of the initial model is removed to form the intermediate model.
[0089] Specifically, given a batch of data including visible light images and infrared images, during forward propagation through the BN layer, after the architecture search process ends (i.e., after the architecture parameters converge), according to... and This determines whether the BN layer uses the first branch o1 or the second branch o2. Specifically, if... In the final convolutional neural network structure, this BN layer takes the form of the first branch o1, if In the final convolutional neural network structure, the BN layer takes the form of the second branch o2.
[0090] In step 1034, the architecture parameters obtained by the search are fixed, the corresponding branches are retained according to the architecture parameters, the searched convolutional network structure is trained on the second training set, and then tested on the test set to obtain the final pedestrian re-identification model.
[0091] like Figure 3B As shown, a method for training a pedestrian re-identification model according to an embodiment of the present invention includes the following steps:
[0092] Step 301: Divide the sample dataset into the first training set, validation set, and test set.
[0093] Step 302: Obtain the defined hypernetwork model.
[0094] Specifically, the Batch Normalization (BN) layer of the hypernetwork model is split into a first branch and a second branch, where each BN layer of the hypernetwork model can be split. The first branch processes visible light image data and infrared image data respectively, while the second branch processes both visible light image data and infrared image data simultaneously. The first and second branches have corresponding architectural parameters.
[0095] Step 303: Train the model parameters of the hypernetwork model on the first training set.
[0096] Step 304: Update the architecture parameters of the first and second branches on the validation set.
[0097] Step 305: Determine if the architecture parameters have converged. If yes, proceed to step 306; otherwise, proceed to step 303.
[0098] Step 306: Train the searched neural network architecture on the second training set. This neural network architecture is determined based on the converged architecture parameters obtained from the last execution of step 304.
[0099] The second training set consists of the first training set and the validation set. In step 306, the architecture parameters are fixed to determine the BN layer structure of the neural network, and the model parameters are trained.
[0100] Step 307: Test the trained neural network structure on the test set.
[0101] In each iteration of the loop consisting of steps 303 and 304, we first fix the architecture parameters and then use them on the first training set. Train the model parameters, then fix the model parameters and compute on the validation set. Update architecture parameters.
[0102] After obtaining the final pedestrian re-identification model, performance testing is conducted according to the testing protocol. Specifically, Cumulative Match Characteristics (CMC) and Mean Average Precision (mAP) can be used as evaluation metrics.
[0103] Cumulative matching features are a classic evaluation metric in person re-identification problems. Mean accuracy is a performance metric for algorithms that predict target location and category, and is very useful for evaluating target localization models, target detection models, and instance segmentation models.
[0104] like Figure 4 As shown, on the SYSU-MM01 dataset, the classification accuracy for the most likely class is improved by 6.70%, and the average classification accuracy (mAP) is improved by 6.13%. On the RegDB dataset, the classification accuracy for the most likely class is improved by 12.17%, and the average classification accuracy (mAP) is improved by 11.23%. The classification accuracy for the most likely class in the evaluation report refers to the probability that the model correctly classifies the first likely class.
[0105] The SYSU-MM01 dataset is a popular cross-modal person re-identification dataset, containing 491 pedestrians from 4 visible light cameras and 2 infrared cameras. The training set contains 19,659 visible light images and 12,792 infrared images of 395 people, while the test set contains 96 people. The RegDB dataset collects heads from two aligned cameras, one visible light camera and one infrared camera. The RegDB dataset contains a total of 412 people, with 10 visible light images and 10 infrared images for each person.
[0106] like Figure 4 As shown, the results are explained according to the classification of single-occurrence scenarios and multiple-occurrence scenarios. In multiple-occurrence scenarios, each person contains multiple images. In single-occurrence scenarios, each person contains only one trajectory image.
[0107] In cross-dataset experiments, we searched for network structures on the SYSU-MM01 dataset and trained and tested the searched network structures on the RegDB dataset. We still achieved state-of-the-art results, demonstrating the algorithm's generalization ability.
[0108] After step 104, the feature data can be matched with the identity information of the people in the database to obtain the identity information of the people corresponding to the feature data. Alternatively, the first feature data corresponding to the visible light image and the second feature data corresponding to the infrared image can be compared for similarity to determine whether the first feature data and the second feature data correspond to the same person.
[0109] Specifically, when matching with the identity information of people in the database, the similarity of the features of the people in the database can be compared, and the information can be sorted according to the similarity. The corresponding identity information of the people can be found from the sorted list to obtain the recognition result.
[0110] The cross-modal pedestrian re-identification method provided by this invention extracts features from cross-modal data using a pedestrian re-identification model and performs pedestrian re-identification based on the extracted features. This method can make full use of visible light and infrared information for pedestrian re-identification of cross-modal data.
[0111] The cross-modal pedestrian re-identification device provided by the present invention will be described below. The cross-modal pedestrian re-identification device described below can be referred to in correspondence with the cross-modal pedestrian re-identification method described above.
[0112] like Figure 5 As shown, the cross-modal pedestrian re-identification device includes:
[0113] The acquisition unit 502 is used to acquire cross-modal data to be detected, which includes visible light image-infrared image pairs.
[0114] The recognition unit 504 is used to input the cross-modal data to be detected into a preset pedestrian re-identification model to obtain the feature data of the cross-modal data to be detected, so as to obtain the pedestrian re-identification result based on the feature data.
[0115] The pedestrian re-identification model is trained based on a sample dataset and is used to extract feature data from the cross-modal data to be detected.
[0116] Since the functional modules of the cross-modal pedestrian re-identification device of the example embodiment of the present invention correspond to the steps of the example embodiment of the cross-modal pedestrian re-identification method described above, for details not disclosed in the device embodiment of the present invention, please refer to the embodiment of the cross-modal pedestrian re-identification method described above.
[0117] The cross-modal pedestrian re-identification device provided by the present invention extracts features from cross-modal data using a pedestrian re-identification model and performs pedestrian re-identification based on the extracted feature data. It can make full use of visible light and infrared information to perform pedestrian re-identification of cross-modal data.
[0118] Figure 6 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 6As shown, the electronic device may include a processor 610, a communications interface 620, a memory 630, and a communication bus 640, wherein the processor 610, communications interface 620, and memory 630 communicate with each other via the communication bus 640. The processor 610 can call logical instructions in the memory 630 to execute a cross-modal person re-identification method. This method includes: acquiring cross-modal data to be detected, the cross-modal data to be detected including a visible light image-infrared image pair; inputting the cross-modal data to be detected into a preset person re-identification model to obtain feature data of the cross-modal data to be detected, and obtaining a person re-identification result based on the feature data; wherein the person re-identification model is trained based on a sample dataset, and the person re-identification model is used to extract feature data from the cross-modal data to be detected.
[0119] Furthermore, the logical instructions in the aforementioned memory 630 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, essentially, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0120] On the other hand, the present invention also provides a computer program product, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, wherein when the program instructions are executed by a computer, the computer is able to execute the cross-modal person re-identification method provided by the above methods, the method comprising: acquiring cross-modal data to be detected, the cross-modal data to be detected comprising visible light image-infrared image pairs; inputting the cross-modal data to be detected into a preset person re-identification model to obtain feature data of the cross-modal data to be detected, so as to obtain a person re-identification result based on the feature data; wherein, the person re-identification model is trained based on a sample dataset, the person re-identification model being used to extract feature data from the cross-modal data to be detected.
[0121] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the aforementioned cross-modal pedestrian re-identification methods. The method includes: acquiring cross-modal data to be detected, the cross-modal data to be detected including a visible light image-infrared image pair; inputting the cross-modal data to be detected into a preset pedestrian re-identification model to obtain feature data of the cross-modal data to be detected, so as to obtain a pedestrian re-identification result based on the feature data; wherein, the pedestrian re-identification model is trained based on a sample dataset, and the pedestrian re-identification model is used to extract feature data from the cross-modal data to be detected.
[0122] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0123] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0124] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A cross-modality pedestrian re-identification method, characterized in that, The method includes: Acquire cross-modal data to be detected, the cross-modal data to be detected including visible light image-infrared image pairs; The cross-modal data to be detected is input into a preset pedestrian re-identification model to obtain feature data of the cross-modal data to be detected, and pedestrian re-identification results are obtained based on the feature data. The pedestrian re-identification model is trained based on a sample dataset and is used to extract feature data from the cross-modal data to be detected. The training method for the pedestrian re-identification model includes: The sample dataset is divided into three parts to form a first training set, a validation set, and a test set. The data in the sample dataset includes visible light image-infrared image pairs. The model parameters of the initial model are trained using the total loss function and the first training set, and the model parameters are updated using the total loss function and the validation set until the initial model converges, thus obtaining the model parameters at the convergence point. Each batch normalized BN layer of the initial model includes a first branch and a second branch. The first branch includes two identical sub-BN layers, which are used to process visible light images and infrared images, respectively. The second branch includes one sub-BN layer. The first branch and the second branch are configured with their respective corresponding model parameters. Based on the model parameters, one of the first branch and the second branch is selected to be retained to form an intermediate model; The model parameters of the intermediate model are trained using the second training set, and the intermediate model is tested using the test set to obtain a pedestrian re-identification model, wherein the second training set consists of the first training set and the validation set; The model parameters for training the initial model using the total loss function and the first training set include: The outputs of the first branch and the second branch are weighted and summed according to the model parameters to obtain the output of the BN layer; Wherein, the output of the first branch is the concatenation of the outputs of the two sub-BN layers of the first branch, and the output of the second branch is the output of one sub-BN layer of the second branch; The forward propagation output of the initial model is obtained based on the initial model and the output of the BN layer; The total loss value is obtained based on the total loss function and the output of the forward propagation of the initial model. The model parameters of the initial model are updated through backpropagation based on the total loss value until the total loss value is less than a set threshold.
2. The pedestrian re-identification method according to claim 1, characterized in that, The step of selecting one of the first branch and the second branch to retain based on the model parameters includes: When the initial model converges, if the model parameters of the first branch are greater than the model parameters of the second branch, then the second branch of the initial model is removed to form the intermediate model. When the initial model converges, if the model parameters of the second branch are greater than those of the first branch, the first branch of the initial model is removed to form the intermediate model.
3. The pedestrian re-identification method according to claim 1, characterized in that, The model parameters for training the initial model using the total loss function and the first training set include: Calculate the basis loss function, the loss function representing the differences in data distribution across different modalities, and the loss function representing the correlation between data across different modalities, and sum them to obtain the total loss value; The model parameters of the initial model are adjusted based on the total loss value until the total loss value is less than a set threshold.
4. The pedestrian re-identification method according to claim 3, characterized in that, The total loss function Calculation formula include: in, As the basis loss function, For a special loss function, and It's about adjusting parameters. The loss function is used to characterize the differences in data distribution across different modalities. , These are the features obtained after forward propagation of the visible light image and infrared image numbered c in the sample dataset through a hypernetwork. They are , The quantity, and , The summation of the dot product of vectors; C represents the total number of visible light and infrared images; i represents the i-th visible light image, and j represents the j-th infrared image; To characterize the correlation of data in different modalities, 。 5. The pedestrian re-identification method according to claim 1, characterized in that, After obtaining the feature data of the cross-modal data to be detected, the pedestrian re-identification method further includes: The feature data is matched with the identity information of individuals in the database to obtain the identity information of the individuals corresponding to the feature data; and / or, The similarity of the first feature data corresponding to the visible light image and the second feature data corresponding to the infrared image is compared to determine whether the first feature data and the second feature data correspond to the same person.
6. A cross-modal pedestrian re-identification device, characterized in that, The pedestrian re-identification device includes: An acquisition unit is used to acquire cross-modal data to be detected, the cross-modal data to be detected including visible light image-infrared image pairs; The identification unit is used to input the cross-modal data to be detected into a preset pedestrian re-identification model to obtain the feature data of the cross-modal data to be detected, so as to obtain the pedestrian re-identification result based on the feature data; The pedestrian re-identification model is trained based on a sample dataset and is used to extract feature data from the cross-modal data to be detected. The training device for the pedestrian re-identification model includes: The sample dataset is divided into three parts to form a first training set, a validation set, and a test set. The data in the sample dataset includes visible light image-infrared image pairs. The model parameters of the initial model are trained using the total loss function and the first training set, and the model parameters are updated using the total loss function and the validation set until the initial model converges, thus obtaining the model parameters at the convergence point. Each batch normalized BN layer of the initial model includes a first branch and a second branch. The first branch includes two identical sub-BN layers, which are used to process visible light images and infrared images, respectively. The second branch includes one sub-BN layer. The first branch and the second branch are configured with their respective corresponding model parameters. Based on the model parameters, one of the first branch and the second branch is selected to be retained to form an intermediate model; The model parameters of the intermediate model are trained using the second training set, and the intermediate model is tested using the test set to obtain a pedestrian re-identification model, wherein the second training set consists of the first training set and the validation set; The model parameters for training the initial model using the total loss function and the first training set include: The outputs of the first branch and the second branch are weighted and summed according to the model parameters to obtain the output of the BN layer; Wherein, the output of the first branch is the concatenation of the outputs of the two sub-BN layers of the first branch, and the output of the second branch is the output of one sub-BN layer of the second branch; The forward propagation output of the initial model is obtained based on the initial model and the output of the BN layer; The total loss value is obtained based on the total loss function and the output of the forward propagation of the initial model. The model parameters of the initial model are updated through backpropagation based on the total loss value until the total loss value is less than a set threshold.
7. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the pedestrian re-identification method as described in any one of claims 1 to 5.
8. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the pedestrian re-identification method as described in any one of claims 1 to 5.