Domain generalization pedestrian re-identification method based on frequency domain enhancement and dynamic convolution cooperation
By combining frequency domain enhancement and dynamic convolution, the problems of feature robustness and low cross-domain recognition accuracy in cross-camera pedestrian re-identification are solved. This method achieves accurate matching under different lighting, angles and resolutions, thereby improving recognition accuracy and robustness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHENGDU UNIV OF INFORMATION TECH
- Filing Date
- 2026-05-18
- Publication Date
- 2026-06-19
AI Technical Summary
Existing generalized pedestrian re-identification technologies suffer from poor feature robustness and low cross-domain recognition accuracy in cross-camera image matching, especially in images with different lighting, angles, and resolutions. Furthermore, they lack a precise weight allocation mechanism for multi-scale features.
We employ a method that combines frequency domain enhancement and dynamic convolution. By using a frequency domain feature extraction module, a statistical feature module, and a multi-scale dynamic space module, we enhance feature extraction and fusion capabilities, enabling in-depth mining of fine-grained features and dynamic adaptation of multi-scale features.
It improves the recognition accuracy and robustness across different scenarios, alleviates the problem of poor generalization of cross-domain features, and enhances the model's recognition ability in complex environments.
Smart Images

Figure CN122244910A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image technology, and specifically to a domain generalization pedestrian re-identification method based on frequency domain enhancement and dynamic convolution synergy. Background Technology
[0002] Domain-generalized person re-identification is a subtask in computer vision, based on deep learning research. Its core advantage is that it allows models to generalize directly to new scenes without requiring target scene data, learning universal pedestrian features across scenes. This makes it highly effective in handling pedestrian images with different lighting, angles, and backgrounds, unlike ordinary person re-identification techniques that rely on target scene data for training.
[0003] Currently, domain-generalized pedestrian re-identification technology has become mainstream in the fields of intelligent security and smart cities. It is crucial for cross-camera personnel tracking and public safety incident tracing—especially in core businesses such as cross-regional suspect tracking, checkpoint deployment and verification, and public safety incident tracing, where it is a key supporting technology.
[0004] The core challenge of cross-camera pedestrian image matching lies in the robustness of features under domain shift: the pedestrian feature distribution in a single camera image is stable, and the matching criteria are clear. However, the feature distribution of images collected from different cameras varies greatly. For example, there are bright, high-definition images indoors in shopping malls, backlit, blurry images at intersections, and low-resolution images at subway entrances. This results in weak generalization ability of ordinary pedestrian re-identification models. Even with domain generalization-based techniques, it is necessary to balance source domain accuracy and target domain generalization simultaneously. In the public security field, the core requirements for cross-camera pedestrian matching are "cross-scene robustness" and "identification timeliness": on the one hand, it is necessary to accurately match the same suspect in different scenes without target camera data, avoiding recognition failures due to scene differences; on the other hand, it is necessary to quickly process massive amounts of surveillance data to buy time for case investigation and emergency response.
[0005] In related technologies, when performing cross-domain generalization pedestrian re-identification, the inherent differences between training and testing scenarios lead to "domain shift," and fine-grained pedestrian features are easily masked by complex environmental noise. This makes it difficult for the model to capture differentiated visual information and susceptible to interference from inter-domain differences, ultimately reducing the accuracy of cross-domain recognition. The image encoder primarily uses a general pre-trained ViT model, which has limited ability to capture fine-grained pedestrian features. Furthermore, the lack of a precise weight allocation mechanism for multi-scale features makes it unable to adaptively strengthen key-scale feature information or effectively suppress feature distribution differences caused by domain shift, resulting in poor feature generalization in cross-domain scenarios. Summary of the Invention
[0006] The purpose of this invention is to provide a domain-generalized pedestrian re-identification method based on frequency domain enhancement and dynamic convolution collaboration, so as to solve the technical problems existing in related technologies.
[0007] To achieve the above objectives, this invention provides a domain-generalized person re-identification method based on frequency domain enhancement and dynamic convolution, comprising:
[0008] The image to be identified is acquired and input into the target person recognition model to obtain the target person image of the image to be identified. The target person recognition model includes a convolution module, a frequency domain reconstruction module, a multi-scale dynamic space module and an image encoder connected in sequence. The frequency domain reconstruction module includes a frequency domain feature extraction module, a statistical feature module and a fusion feature module.
[0009] The target person recognition model's processing method for the image to be recognized includes:
[0010] The first feature image is obtained by extracting features from the image to be identified through the first convolution module;
[0011] The second feature image is extracted from the first feature image by the second convolution module, and frequency domain features are extracted from the second feature image by the frequency domain feature extraction module. Statistical features are extracted from the second feature image by the statistical feature module. The first convolution module and the second convolution module have different sizes.
[0012] The frequency domain features, statistical features, first feature image, and second feature image are fused by the fusion feature module to obtain the first fused feature;
[0013] The first fusion feature is spatially adaptively fused using a multi-scale dynamic spatial module to obtain the second fusion feature;
[0014] The second fused feature is input into the image encoder to obtain the third image feature;
[0015] Based on the third image features, the target person image in the image to be identified is determined.
[0016] Optionally, the step of extracting frequency domain features from the second feature image through the frequency domain feature extraction module includes:
[0017] The second feature image is mapped to the frequency domain by using a two-dimensional fast Fourier transform to obtain the complex spectral features in the frequency domain.
[0018] The real part of the complex spectral features in the frequency domain is processed by the first convolution branch to obtain a learnable filter for the real part;
[0019] The imaginary part of the complex spectral features in the frequency domain is processed by the second convolution branch to obtain a learnable filter for the imaginary part. The first and second convolution branches have the same state processing structure and each has a first convolution layer, a normalization and linear function activation layer and a second convolution layer connected in series. The first convolution layer is used to transform channel C into channel C / 4, and the second convolution layer is used to transform channel C / 4 into C.
[0020] The real part learnable filter and the imaginary part learnable filter are combined to form a new complex feature. The new complex feature is then processed by the inverse fast Fourier transform to obtain spatial domain feature information.
[0021] The spatial domain feature information is processed by average pooling and convolution to obtain the channel weight vector;
[0022] Based on the channel weight vector, the second feature image is recalibrated to obtain the frequency domain features.
[0023] Optionally, the channel weight vector is expressed by the following formula:
[0024] ;
[0025] in, For channel weight vectors, It is the sigmoid activation function. This is for convolution processing with a 1×1 kernel. It is a rectified linear activation function. For batch normalization, For adaptive average pooling, This refers to spatial domain feature information.
[0026] Optionally, the step of extracting statistical features from the second feature image through the statistical feature module includes:
[0027] The second feature image is divided into multiple regional feature images according to a preset ratio. The channel mean and variance of each regional feature image are calculated. The channel mean and variance of multiple feature images are concatenated to obtain a statistical description vector.
[0028] A statistical feature sequence is obtained by performing a nonlinear mapping on the statistical description vector using a one-dimensional convolution method.
[0029] The statistical feature sequence is dynamically fused with the second feature image to obtain the statistical features.
[0030] Optionally, the statistical feature sequence is expressed by the following calculation formula:
[0031] ;
[0032] in, For statistical characteristic sequences, For statistical description vectors, For one-dimensional convolution, It is a rectified linear activation function. For batch normalization;
[0033] The statistical characteristics are expressed by the following formula:
[0034] ;
[0035] in, For statistical characteristics, For the second feature image, Calculate the average for the channel dimension. For element-wise multiplication, It is the sigmoid activation function. This is for calculation of the third dimension.
[0036] Optionally, the first fusion feature is expressed by the following calculation formula:
[0037] ;
[0038] ;
[0039] in, For splicing features, For the reshape operation, This is a tensor transpose operation. For frequency domain characteristics, For feature splicing operations, The first fusion feature, The first feature image, For linear transformation, , and yes Obtained through three independent linear transformations For the number of channels, The second feature image is represented by Softmax, which is used for normalization.
[0040] Optionally, the step of performing spatial adaptive fusion processing on the first fusion feature through a multi-scale dynamic spatial module to obtain the second fusion feature includes:
[0041] The global context vector is extracted from the first fusion feature by a global average pooling layer. The global context vector is then processed by a two-layer convolutional network to obtain the parameters of the dynamic convolutional kernel, which is then split into multiple dynamic convolutional kernels of different scales.
[0042] The average feature is obtained by averaging the first fused feature along the channel dimension.
[0043] Each dynamic convolution kernel is convolved with the average feature, and spatial attention features are calculated based on the convolution results.
[0044] Multiple spatial attention features are summed using a weighted method to obtain a dynamic spatial attention map;
[0045] Based on the dynamic spatial attention map, the first fusion feature is weighted to obtain the second fusion feature.
[0046] Optionally, the second fusion feature is expressed by the following calculation formula:
[0047] ;
[0048] in, This is the second fusion feature. This is the overall dynamic spatial attention graph. These are independent, learnable parameters that have undergone the Softmax operation. The dynamic spatial attention map corresponding to the convolution of the i-th dynamic kernel. For the i-th dynamic convolution kernel, This represents the total number of dynamic convolution kernels. For average characteristics, For convolution operations, This is element-wise multiplication.
[0049] Optionally, determining the target person image in the image to be identified based on the third image features includes:
[0050] Identify multiple portrait images;
[0051] Calculate the cosine similarity between the third image features and each person image to obtain multiple cosine similarities;
[0052] The image of the person with the highest cosine similarity is identified as the target person image in the image to be identified.
[0053] Optionally, the target person recognition model is trained using pedestrian description text and person images with learnable variables. The target person recognition model includes a text encoder and a concatenated convolutional module, a frequency domain reconstruction module, a multi-scale dynamic space module, and an image encoder. The training method includes:
[0054] Set the image encoder to a thawed state and pre-train the image encoder through a preset period to obtain a sub-person recognition model;
[0055] Both the image encoder and the text encoder are set to a frozen state. The pedestrian description text of the learnable variable is trained by the sub-person recognition model to obtain the pedestrian description text of the target learnable variable.
[0056] The text encoder is set to a frozen state, and the image encoder is set to a thawed state. The image encoder in the sub-person recognition model is trained using pedestrian description text of the target learnable variable to obtain the trained target person recognition model.
[0057] The above technical solution adds a frequency domain feature extraction module, a statistical feature module, a fusion feature module, and a multi-scale dynamic space module to the target person recognition model. During the image recognition process, a second convolution module extracts a second feature image from a first feature image, and the frequency domain feature extraction module extracts frequency domain features from the second feature image. The statistical feature module extracts statistical features from the second feature image. The fusion feature module fuses the frequency domain features, statistical features, the first feature image, and the second feature image to obtain a first fused feature. The multi-scale dynamic space module performs spatial adaptive fusion processing on the first fused feature to obtain a second fused feature. The second fused feature is input into the image encoder to obtain a third image feature. Based on the third image feature, the person information in the image to be recognized is determined. This dual-module collaboration achieves in-depth mining of fine-grained features and dynamic adaptation and fusion of multi-scale features, alleviating the general problems of insufficient fine-grained features and poor cross-domain feature generalization in the CLIP-ReID architecture. It also improves the recognition accuracy and robustness of the target person recognition model in complex cross-scene scenarios.
[0058] Other features and advantages of the present invention will be described in detail in the following detailed description section. Attached Figure Description
[0059] Figure 1 This is a training structure diagram of CLIP-ReID, which is in Phase 1 of the first related technologies.
[0060] Figure 2 This is a training structure diagram of CLIP-ReID, which is in stage 2 of the first related technologies.
[0061] Figure 3 This is a structural diagram of the channel attention module in the second related technology.
[0062] Figure 4 This is a structural diagram of dynamic spatial attention in the third related technology.
[0063] Figure 5 This is a schematic diagram illustrating a domain generalization pedestrian re-identification method based on frequency domain enhancement and dynamic convolution collaboration according to an exemplary embodiment of the present invention.
[0064] Figure 6 This is a flowchart illustrating a domain generalization pedestrian re-identification method based on frequency domain enhancement and dynamic convolution collaboration according to an exemplary embodiment of the present invention.
[0065] Figure 7 This is a structural diagram of a frequency domain reconstruction module according to an exemplary embodiment of the present invention.
[0066] Figure 8 This is a structural diagram of a multi-scale dynamic spatial module according to an exemplary embodiment of the present invention.
[0067] Figure 9 This is a schematic diagram illustrating the training of a target person recognition model according to an exemplary embodiment of the present invention. Detailed Implementation
[0068] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, so as to provide a better understanding of the concept of the present invention, the technical problem solved, the technical features constituting the technical solution, and the technical effects brought about.
[0069] In the first related technology, such as Figure 1 and Figure 2 As shown, a neural network structure for person re-identification based on text-image alignment is disclosed. Compared with the single-modal network that relies solely on visual information in traditional person re-identification (ReID) tasks, CLIP-ReID (CLIP-based Person Re-Identification) innovatively introduces person description text with learnable variables, such as... Figure 1 "a" "Pictures of pedestrians, among which..." These are learnable pedestrian feature variables. Through two stages of training, the bidirectional matching of text features and image features is strengthened, allowing the model to utilize both the semantic information of the text and the visual information of the image, thus fundamentally alleviating the pain point of "easy confusion of visual features of pedestrians with similar clothing" in traditional ReID.
[0070] In Stage 1, to avoid compromising the feature representation capabilities of the pre-trained encoder, the model adopts a strategy of "freezing the pre-trained text encoder and image encoder," training only on pedestrian description text containing learnable variables. This "text-anchored" learning approach allows text descriptions to quickly adapt to the visual feature distribution of pedestrians: when a pedestrian image is input, the model uses the text encoder to transform the variable-containing description into text features, and then performs matching optimization with the visual features output by the image encoder. The loss function involved in this stage... and Similar in form, where the loss function It is expressed as follows:
[0071] ;
[0072] in, This represents the feature of the p-th pedestrian image output by the image encoder. Indicates the first Text descriptions of individual pedestrians. This represents the text feature of the a-th pedestrian description output by the text encoder. This indicates the first output of the text encoder. Textual features describing individual pedestrians Representation of features and text features Similarity score between them Features and text features Similarity score between them Representation and text description The set of matching positive image indices. Let B1 be the number of images in the set, B1 be the total number of text features for the current sample, and b be the traversal index of the text features for the current sample. The total loss for this stage is... for: .
[0073] In Stage 2, the pedestrian description text has already developed good feature representation capabilities, and the model switches to "frozen text encoder," focusing on training the image encoder. The goal is to better adapt image features to the optimized text features. This stage utilizes the category loss mechanism commonly used in pedestrian re-identification tasks. and three-dimensional loss It also uses the image-to-text cross-entropy loss function. The calculation formula is as follows:
[0074] ;
[0075] in, The first image encoder output represents the... Features of a pedestrian image This represents the text feature of the k-th pedestrian description output by the text encoder. This represents the text feature of the a-th pedestrian description output by the text encoder. Representation of features and text features Similarity score between them Features and text features Similarity score between them The label indicates the variable; if the k-th text is a feature... The corresponding correct matching text, then ,otherwise N represents the total number of text features in the current sample. This represents the number of images in the current sample. The overall formula for the second stage is: .
[0076] In the second related technology, such as Figure 3 As shown, this module will input features Three features are obtained through the reshaping and transpose operations, respectively. and as well as ,in, B3 represents the current number of samples in the feature image, C represents the number of channels in the feature image, H represents the height of the feature image, W represents the width of the feature image, and the sum of the normalized dimensions is 1. Then, the features are first... With features The elements are multiplied element by element, and the product is then subjected to a Softmax (normalization) operation to generate channel attention weights. These weights are then multiplied by the features. Multiplying element by element, then reshaping the output to restore it to its original state. The dimensions are then calculated, and finally the reshaped features are added element-wise to the original input A to obtain the final result.
[0077] This module explores the dependencies between channels and spatial dimensions, first decomposing the original features into feature matrices of different dimensions, and then generating precise channel attention weights through matrix operations and Softmax operations. Finally, through weighted feature fusion, this module can enhance the feature representation of key task information while suppressing redundant and invalid feature interference, thereby effectively improving the feature representation ability and robustness.
[0078] In the third related technology, such as Figure 4 As shown, this module will feature The process is divided into two branches. The first branch: compresses the features using global average pooling. The dynamic convolution kernel is obtained through the kernel generator. The second branch: obtaining a single-channel spatial feature map through the Mean. The kernel generator is a convolutional kernel generator, mainly composed of a 1*1 convolutional kernel → a BN layer → a 1*1 convolutional kernel. The number of channels of the first 1*1 convolutional kernel remains C, while the number of channels of the second 1*1 convolutional kernel is changed to... Combine it with a dynamic convolution kernel Perform dynamic convolution, then normalize using the Sigmoid activation function to obtain the spatial attention map. ,at last The final result Y is obtained by element-wise multiplication with the original input feature F.
[0079] This module adaptively generates dynamic convolution kernels based on input features. Compared to traditional fixed-parameter convolution kernels, it can more accurately match the spatial feature distribution of different samples, making the allocation of attention weights more sample-specific. By fusing "single-channel spatial feature map + dynamic convolution", it can effectively capture information from local key regions in the input features.
[0080] However, the inventors discovered that when using first-relevance techniques for pedestrian re-identification, CLIP-ReID, thanks to its pedestrian description text design containing learnable variables, can improve the recognition accuracy of traditional ReID models in complex scenes through cross-modal fusion of text and image alignment. However, this method has significant limitations: its core focus is on cross-modal alignment of visual and textual features and semantic optimization of cue words, resulting in insufficient fine-grained representation capabilities within the visual encoder. Consequently, the identified pedestrians contain certain errors, leading to inaccuracies.
[0081] When using second-relevance techniques for pedestrian re-identification, the CAM (Channel Attention Module) module, while mining the dependencies between channels and spatial dimensions, suffers from limited initial feature sources, thus offering limited improvement. Furthermore, while this approach enhances feature representation capabilities, its feature fusion quality is poor.
[0082] When using third-party correlation techniques for pedestrian re-identification, the DSA module only employs a single-size dynamic convolutional kernel for spatial feature extraction, which cannot simultaneously adapt to spatial targets of different scales. In existing domain-generalized pedestrian re-identification, the scale of the same pedestrian varies greatly in different scenarios (e.g., a large proportion of pedestrians in the foreground and a small proportion of pedestrians in the background). This module cannot simultaneously cover both the detailed features of small-scale pedestrians (such as clothing textures) and the overall outline features of large-scale pedestrians, resulting in incomplete representation of pedestrian features across different domain scenarios.
[0083] In view of this, the present invention provides a domain generalization pedestrian re-identification method based on frequency domain enhancement and dynamic convolution collaboration to solve the technical problems existing in the above-mentioned related technologies.
[0084] like Figure 5 As shown, Figure 5 This is a schematic diagram illustrating a domain generalization pedestrian re-identification method based on frequency domain enhancement and dynamic convolution collaboration according to an exemplary embodiment of the present invention. (Refer to...) Figure 5The method includes;
[0085] The image to be identified is acquired and input into the target person recognition model to obtain the target person image of the image to be identified. The target person recognition model includes a convolution module, a frequency domain reconstruction module, a multi-scale dynamic space module and an image encoder connected in sequence. The frequency domain reconstruction module includes a frequency domain feature extraction module, a statistical feature module and a fusion feature module.
[0086] The target person recognition model's processing method for the image to be recognized includes:
[0087] S501: Extract the features of the image to be recognized through the first convolution module to obtain the first feature image;
[0088] S502: Extract the second feature image from the first feature image through the second convolution module, extract frequency domain features from the second feature image through the frequency domain feature extraction module, and extract statistical features from the second feature image through the statistical feature module. The first convolution module and the second convolution module have different sizes.
[0089] S503: The frequency domain features, statistical features, first feature image and second feature image are fused by the fusion feature module to obtain the first fused feature;
[0090] S504: The first fusion feature is spatially adaptively fused using a multi-scale dynamic spatial module to obtain the second fusion feature;
[0091] S505: Input the second fused feature into the image encoder to obtain the third image feature;
[0092] S506: Based on the third image features, determine the target person image in the image to be identified.
[0093] The above technical solution adds a frequency domain feature extraction module, a statistical extraction module, a fusion feature module, and a multi-scale dynamic space module to the target person recognition model. During the image recognition process, a second convolution module extracts a second feature image from a first feature image, and the frequency domain feature extraction module extracts frequency domain features from the second feature image. The statistical feature module extracts statistical features from the second feature image. The fusion feature module fuses the frequency domain features, statistical features, the first feature image, and the second feature image to obtain a first fused feature. The multi-scale dynamic space module performs spatial adaptive fusion processing on the first fused feature to obtain a second fused feature. The second fused feature is input into the image encoder to obtain a third image feature. Based on the third image feature, the person information in the image to be recognized is determined. This dual-module collaboration achieves in-depth mining of fine-grained features and dynamic adaptation and fusion of multi-scale features, alleviating the general problems of insufficient fine-grained features and poor cross-domain feature generalization in the CLIP-ReID architecture. It also improves the recognition accuracy and robustness of the target person recognition model in complex cross-scene scenarios.
[0094] To enable those skilled in the art to better understand the domain generalization pedestrian re-identification method based on frequency domain enhancement and dynamic convolution collaboration provided by this invention, the above steps are illustrated in detail below.
[0095] For example, the image to be identified can be a photograph or a video. The image can be acquired through surveillance video or a camera; the acquisition module is not specifically limited. For instance, the image to be identified could be a bright, high-definition image of a shopping mall interior, a backlit, blurred image of an intersection, or a low-resolution image of a subway entrance. The first convolution module can have a size of 16*16. In this embodiment of the invention, when the image to be identified is input into the target person recognition model, the image of the target person in the image to be identified can be obtained.
[0096] For example, in a specific processing procedure, such as Figure 6 As shown, before inputting the image to be recognized into the first convolutional module, the image can be randomly cropped and rotated to enhance it. Then, the features in the image can be extracted using the convolutional module to obtain the first feature image.
[0097] For example, after obtaining the first feature image, a second feature image can be extracted from the first feature image through a second convolution module, and frequency domain features can be extracted from the second feature image through a frequency domain feature extraction module, and statistical features can be extracted from the second feature image through a statistical feature module.
[0098] In one possible manner, the extraction of frequency domain features from the second feature image via the frequency domain feature extraction module includes:
[0099] The second feature image is mapped to the frequency domain by using a two-dimensional fast Fourier transform to obtain the complex spectral features in the frequency domain.
[0100] The real part of the complex spectral features in the frequency domain is processed by the first convolution branch to obtain a learnable filter for the real part;
[0101] The imaginary part of the complex spectral features in the frequency domain is processed by the second convolution branch to obtain a learnable filter for the imaginary part. The first and second convolution branches have the same layered processing structure, and each has a first convolutional layer, a normalization and linear function activation layer and a second convolutional layer connected in series. The first convolutional layer is used to transform channel C into channel C / 4, and the second convolutional layer is used to transform channel C / 4 into C.
[0102] The real part learnable filter and the imaginary part learnable filter are combined to form a new complex feature. The new complex feature is then processed by the inverse fast Fourier transform to obtain spatial domain feature information.
[0103] The spatial domain feature information is processed by average pooling and convolution to obtain the channel weight vector;
[0104] Based on the channel weight vector, the second feature image is recalibrated to obtain the frequency domain features.
[0105] It should be understood that, in the embodiments of the present invention, the second convolution module, the frequency domain feature extraction module (FCR), the statistical extraction module (GSE), and the fusion feature module constitute as follows: Figure 7 The frequency domain reconstruction module (FRM) shown can restore an image from the frequency domain back to the spatial domain. In this embodiment of the invention, the network structure diagram of the frequency domain reconstruction module is as follows: Figure 7 As shown, after entering the frequency domain reconstruction module, the second feature image in the first feature image can be extracted through the second convolution module. The size of this convolution layer can be 3*3.
[0106] Specifically, input the first feature image The second feature image is first obtained through a Conv3×3 convolutional layer. Then, in the frequency domain feature extraction module, the second feature image is processed by a two-dimensional Fast Fourier Transform (FFT) to obtain the frequency domain complex spectral features. The frequency domain complex spectral features can be expressed by the following formula.
[0107] ;
[0108] in, Let be the real part of the complex spectral characteristics in the frequency domain. For the complex spectral features in the frequency domain, denoted as . B×C×W×H: B is the number of samples in a batch of data, C is the number of feature channels, W is the width of the feature image, H is the height of the feature image, and FFT is the two-dimensional fast Fourier transform.
[0109] The second feature image is input into the first and second convolutional branches for processing. Each convolutional branch uses two 1×1 convolutional layers with a bottleneck structure. The first convolutional branch transforms the number of channels to C / 4 using C, and the second convolutional branch transforms the number of channels from C / 4 to C. Combined with Batch Normalization (BN) and ReLU activation, this achieves learnable filtering in the frequency domain. The real-part learnable filter can be obtained through the first convolutional branch. It can be expressed by the following formula:
[0110] ;
[0111] The learnable filter for the imaginary part can be obtained through the second convolution branch. It can be expressed by the following formula:
[0112] .
[0113] The enhanced real-part learnable filter will then be applied. Learnable filtering with imaginary part New complex features are formed, and these new complex features are processed by inverse fast Fourier transform to obtain spatial domain feature information. The spatial domain feature information... This can be expressed by the following formula:
[0114] ;
[0115] in, The real and imaginary parts are used to reconstruct the characteristics of complex numbers. This is the inverse fast Fourier transform.
[0116] Then, average pooling and convolution can be performed on the spatial domain feature information to obtain the channel weight vector, which is expressed by the following formula:
[0117] ;
[0118] in, For channel weight vectors, It is the sigmoid activation function. For convolution processing with a 1×1 kernel, is the rectified linear activation function, is batch normalization, is average pooling, is the spatial domain feature information.
[0119] Finally, the channel weight vector is broadcast to the spatial dimension to perform channel-wise recalibration on the second feature image, thereby strengthening the channels that are more sensitive to key frequency components and obtaining the frequency domain features. The frequency domain features can be expressed by the following calculation formula:
[0120] .
[0121] Among possible ways, extracting the statistical features from the second feature image by the statistical feature module includes:
[0122] Dividing the second feature image into multiple regional feature images according to a preset ratio, calculating the channel mean and variance of each regional feature image, and splicing the multiple channel means and variances to obtain a statistical description vector;
[0123] Performing non-linear mapping on the statistical description vector by a one-dimensional convolution method to obtain a statistical feature sequence;
[0124] Dynamically fusing the statistical feature sequence with the second feature image to obtain the statistical features.
[0125] It should be understood that as Figure 7 shown, when extracting the statistical features in the second feature image, they can be extracted by statistical methods in the statistical feature module. The preset ratio can be used to divide the second feature image into three regional feature images of a head image region, a body region, and a leg region. This preset ratio can be 2:5:7. In this regard, the embodiments of the present invention do not make specific limitations. According to this ratio, is divided into three regions, and the channel mean and variance are calculated for each region.
[0126] The channel mean can be expressed by the following calculation formula:
[0127] ;
[0128] Among them, is the mean operation, the dimension is (B, C, H, W), in: indicates taking all elements of this dimension, si:ei indicates taking the interval from si to ei of this dimension (this dimension of height), is for calculating the height and width, is the channel mean of the i-th regional feature image.
[0129] Variance can be expressed by the following formula:
[0130] ;
[0131] in, Let be the variance of the feature image of the i-th region, and std be the variance calculation.
[0132] The mean and variance values of the three channels can then be concatenated to obtain the statistical descriptive vector. Then, a one-dimensional convolution method is used to perform nonlinear mapping to model the statistical relationships between different regions and obtain statistical feature sequences. The statistical feature sequence is treated as a one-dimensional sequence of length C, where 6 represents the number of channels. Therefore, Conv1d is used to explicitly model the relationship between statistical types and channels. The obtained data... and Dynamic combination to obtain statistical characteristics .
[0133] The statistical characteristic sequence is expressed by the following formula:
[0134] ;
[0135] in, For statistical characteristic sequences, For statistical description vectors, For one-dimensional convolution, It is a rectified linear activation function. For batch normalization;
[0136] The statistical characteristics are expressed by the following formula:
[0137] ;
[0138] in, For statistical characteristics, For the second feature image, Calculate the average for the channel dimension. For element-wise multiplication, It is the sigmoid activation function. To The calculation is performed on the third dimension.
[0139] For example, after obtaining the frequency domain features and statistical features, the frequency domain features, statistical features, first feature image, and second feature image can be fused in the fusion feature module to obtain the first fused feature. The first fused feature is expressed by the following calculation formula:
[0140] ;
[0141] ;
[0142] in, For splicing features, For the reshape operation, This is a tensor transpose operation. For frequency domain characteristics, For feature splicing operations, The first fusion feature, The first feature image, For linear transformation, , and yes Obtained through three independent linear transformations For the number of channels, This is the second feature image.
[0143] For example, a multi-scale dynamic spatial module can be a core module in deep learning used to simultaneously capture targets of different sizes, adaptively adjust spatial weights, and fuse multi-scale features. In an embodiment of the invention, the network structure of the multi-scale dynamic spatial module is as follows: Figure 8 As shown, after obtaining the first fusion feature, the first fusion feature can be input into the multi-scale dynamic space module. The multi-scale dynamic space module processes the first fusion feature to obtain the second fusion feature.
[0144] In one possible manner, the step of performing spatial adaptive fusion processing on the first fusion feature through a multi-scale dynamic spatial module to obtain the second fusion feature includes:
[0145] The global context vector is extracted from the first fusion feature by a global average pooling layer. The global context vector is then processed by a two-layer convolutional network to obtain the parameters of the dynamic convolutional kernel, which is then split into multiple dynamic convolutional kernels of different scales.
[0146] The average feature is obtained by averaging the first fused feature along the channel dimension.
[0147] Each dynamic convolution kernel is convolved with the average feature, and spatial attention features are calculated based on the convolution results.
[0148] Multiple spatial attention features are summed using a weighted method to obtain a dynamic spatial attention map;
[0149] Based on the dynamic spatial attention map, the first fusion feature is weighted to obtain the second fusion feature.
[0150] It should be understood that the multi-scale dynamic spatial module first extracts the global context vector in the spatial dimension through a global average pooling layer. This vector is then processed by a miniature two-layer convolutional network with a 1x1 size, using the ReLU activation function in the intermediate layer to generate the parameters of the dynamic convolutional kernel. .
[0151] ;
[0152] in This indicates that the first fused feature is derived from the input. Dynamically generated Convolution kernel batch (three convolution kernels for each sample), K=[1,3,5] represents dynamic convolution kernels at different scales. It is a 1×1 convolution (used for channel compression and non-linear mapping). A dynamic kernel is generated for the 1×1 convolution, and GAP is the global average pooling.
[0153] Subsequently, in order to further utilize the spatial structure information of the input features, the first fused feature was... The average spatial structure characteristics are obtained by averaging along the channel dimension. Next It is split into three dynamic convolutional kernels of different sizes. , and By averaging the features of each sample Convolution is performed on the corresponding dynamic generation kernels to calculate the corresponding dynamic spatial attention map. Finally, the spatial attention features of dynamic convolutional kernels of different sizes are summed using a weighted method to obtain the final dynamic spatial attention map. .
[0154] In one possible manner, the second fusion feature is expressed by the following calculation formula:
[0155] ;
[0156] in, This is the second fusion feature. This is the overall dynamic spatial attention graph. These are independent, learnable parameters that have undergone the Softmax operation. The dynamic spatial attention map corresponding to the convolution of the i-th dynamic kernel. For the i-th dynamic convolution kernel, This represents the total number of dynamic convolution kernels. For average characteristics, For convolution operations, This is element-wise multiplication, i.e., tensor multiplication operation.
[0157] For example, after obtaining the second fusion feature, the second fusion feature can be input into the image encoder to obtain the third image feature, and then the target person image in the image to be identified can be determined based on the third image feature.
[0158] In one possible manner, determining the target person image in the image to be identified based on third image features includes:
[0159] Identify multiple portrait images;
[0160] Calculate the cosine similarity between the third image features and each person image to obtain multiple cosine similarities;
[0161] The image of the person with the highest cosine similarity is identified as the target person image in the image to be identified.
[0162] It should be understood that the target person recognition model also stores multiple person images. After obtaining the third image feature, the third image feature needs to be compared with each person image in turn to obtain the target person image. In this embodiment of the invention, the cosine similarity between each person image and the third image feature can be calculated in turn, and then the multiple cosine similarities obtained can be compared. The person image with the highest cosine similarity is determined as the target person image.
[0163] In one possible manner, the target person recognition model is trained using pedestrian description text and person images with learnable variables. The target person recognition model includes a text encoder and sequentially connected convolutional modules, a frequency domain reconstruction module, a multi-scale dynamic space module, and an image encoder. The training method includes:
[0164] Set the image encoder to a thawed state and pre-train the image encoder through a preset period to obtain a sub-person recognition model;
[0165] Both the image encoder and the text encoder are set to a frozen state. The pedestrian description text of the learnable variable is trained by the sub-person recognition model to obtain the pedestrian description text of the target learnable variable.
[0166] The text encoder is set to a frozen state, and the image encoder is set to a thawed state. The image encoder in the sub-person recognition model is trained using pedestrian description text of the target learnable variable to obtain the trained target person recognition model.
[0167] It should be understood that the training of the target person recognition model is divided into three stages. For example... Figure 9 As shown, this can be represented as a phased process of the overall training.
[0168] In the first phase (Phase 1) of training, the image encoder is set to a thawed state and trained. The thawed state means that the image encoder can be updated during training and is no longer fixed. During the first phase of training, the image encoder is trained over three epochs, with the loss function being... and It can be expressed by the following calculation formula.
[0169] ;
[0170] in, The input sample belongs to the category The true probability, The input sample belongs to the category The predicted probability, For the true probability distribution, The predicted probability distribution output by the model. This represents the total number of categories.
[0171] ;
[0172] in, a , and These represent an anchor sample, a positive sample, and a negative sample from different identities, respectively. Used to measure the Euclidean distance between two samples. This represents the interval, which ensures that the distance between negative sample pairs and positive sample pairs is at least this interval value. For calculation Samples and a The distance between samples.
[0173] Then, during the training phase 2, both the image encoder and the text encoder were set to a frozen state, and training was performed on the pedestrian description text, which was the input learnable variable. For example... Figure 9As shown, GRL stands for Gradient Reversal Layer. Its function is to prevent the domain discriminator from indiscriminately updating parameters based on the direction of feature origin, ultimately allowing the model to learn "cross-domain invariant" features. The frozen state refers to a state where the parameters in the text encoder and image encoder are not updated during model training. The learnable variable, pedestrian description text, is a text representation describing pedestrian appearance features constructed using a set of training parameters. In the specific training process, the first 120 epochs are used to train domain-independent text, and the last 30 epochs are used to train domain-dependent text. During backpropagation, a gradient inversion layer is used to invert gradients and suppress domain-specific information in the text tags. The loss function is used... The constraints are applied to it, and the specific formula is as follows:
[0174] ;
[0175] in, For the number of domain categories, For real labels, For the first The predicted probability of the category. By training the pedestrian description text of the learnable variable in the second stage, the pedestrian description text of the target learnable variable can be obtained.
[0176] Then, in the third stage (stage 3), the text encoder is frozen and the image encoder is unfrozen, and the image encoder is trained again. The image encoder is trained using pedestrian description text of the target learnable variables, and finally, a trained target person recognition model is obtained.
[0177] Specifically, this phase consists of 30 cycles, during which the image encoder is trained using domain-relevant and domain-independent text, where the loss function is... The specific expression is as follows:
[0178] ;
[0179] in, To calculate the similarity between samples, For the output features of the image encoder, For prompt output features that do not contain domain-specific markers, For prompt output features containing domain-specific markers, The margin parameter is set to 0.3.
[0180] In actual processing, the algorithm flow is as follows: Figure 6 As shown, the specific operation steps are as follows.
[0181] Data preprocessing: The identified images are randomly cropped and rotated to enhance the dataset and obtain the images to be identified.
[0182] Feature extraction: The features of the image to be identified are extracted by the convolution module to obtain the first feature image.
[0183] Frequency domain feature processing: The first feature image will be convolved to obtain the second feature image. The frequency domain features in the second feature image will be extracted by the frequency domain feature extraction module of the frequency domain reconstruction module.
[0184] Statistical feature processing: Statistical features in the second feature image are extracted through the statistical feature module in the frequency domain reconstruction module.
[0185] Feature fusion: The frequency domain features, statistical features, second feature images and first feature images are dynamically fused to obtain the first fused feature.
[0186] Multi-scale spatial feature processing: The first fused feature is put into the multi-scale dynamic spatial module (MDSM) for spatial adaptive fusion to obtain the second fused feature.
[0187] Image Encoder: The second fused feature is input into the image encoder, which is a pre-trained ViT-16 model, to obtain the third image feature. Simultaneously, the model parameters are updated through a loss function. The ViT-16 model is a pre-trained image classification model based on the Transformer architecture. It abandons the core logic of CNNs (Convolutional Neural Networks) that relies on convolutional kernels to extract local features, instead using the Transformer's self-attention mechanism to directly model global image features. It segments the image into fixed-size image patches, converts them into sequential data, and inputs them into the Transformer encoder for feature learning, ultimately achieving tasks such as image classification and feature extraction. "16" represents the model's core parameter: the image patch size is 16×16 pixels.
[0188] The target person's image is obtained by comparing the features of the third image.
[0189] The experiment was conducted using this method, and the specific comparison methods and results are as follows.
[0190] Experiments were conducted using multi-source domains in the generalized pedestrian re-identification dataset, following three different schemes to evaluate the model's generalization ability across multiple domains. In Protocol 1, the model was trained on a combination of datasets: Market1501 (a benchmark dataset containing 1501 pedestrian identities from a "market / campus square" scenario, abbreviated as M), CUHK02 (the Chinese University of Hong Kong Pedestrian Re-identification Dataset No. 02, abbreviated as C2), CUHK03 (the Chinese University of Hong Kong Pedestrian Re-identification Dataset No. 03, abbreviated as C3), and CUHK-SYSU (the Chinese University of Hong Kong-Sun Yat-sen University Joint Pedestrian Search Dataset, abbreviated as CS). The trained model was then tested on four independent datasets to evaluate its generalization ability to unseen domains. These independent datasets could include PRID (Person Re-Identification, an image dataset for person re-identification research), GRID (Ground-based Re-Identification, ground-based surveillance re-identification), VIPeR (Viewpoint Invariant Pedestrian Recognition), and iLIDs (intelligent labelling and imaging for datasets and Surveillance). Protocol 2 employs a single-domain testing approach, reserving one dataset for testing and using the remaining datasets for training. This approach helps understand how a model trained on multiple sources performs when tested in a single unseen domain. Protocol 3 is very similar to Protocol 2, the main difference being whether training and testing data from the source domains are used to train the model. These protocols provide a framework for evaluating the model's generalization ability across different domains. It is important to note that all ablation studies were conducted under Protocol 2. To ensure the validity of the experiment, all experimental parameters were kept consistent. Baseline* represents the baseline experimental data under the condition that all experimental parameters were kept consistent.
[0191] Table 1 compares the methods with the state-of-the-art approach outlined in the protocol. .
[0192] Protocol 1: As shown in Table 1, the method of this invention represents a significant improvement over Baseline*. Specifically, the method of this invention improves mAP by an average of 2.0% and R1 accuracy by 3.2%. The performance improvement is particularly significant on the GRID dataset, with mAP increasing by 3.5% and R1 accuracy by 4.8%. This further demonstrates the effectiveness of the method of this invention in standard evaluation scenarios. Here, R1 represents Rank 1, which reflects the first-order matching accuracy of the model in the unknown domain; mAP reflects the overall retrieval ranking quality of the model in the unknown domain.
[0193] Table 2 Comparison with the most state-of-the-art method under Protocol 2 .
[0194] Table 3 Comparison with the most state-of-the-art method under Protocol 3 .
[0195] Protocols 2 and 3: Tables 2 and 3 show that the method of this invention outperforms other methods in terms of average mAP and average R1. In Protocols 2 and 3, the method of this invention improves the average mAP by 1.1% compared to the Baseline* method, especially in Protocol 3's M+MS+CS→C3, where the average mAP and R1 are 1.7% and 1.7% higher than the Baseline* method, respectively, demonstrating its effectiveness. These experimental results show that the method of this invention significantly improves the generalization performance of the model, achieves state-of-the-art retrieval accuracy under various conditions, and demonstrates strong competitiveness in DG ReID. Among them, QAConv in Tables 1-3 50 For Query-Adaptive Convolution, specifically, it's a query-adaptive convolution method; 50 indicates that ResNet-50 is used as the backbone network. M 3L stands for Memory-based Multi-Source Meta-Learning. MetaBIN stands for Meta Batch-Instance Normalization. META stands for Mimic Embedding via adapTive Aggregation. ACL stands for Adaptive Cross-domain Learning. ReFID stands for Reciprocal Frequency-aware Generalizable Person Re-identification via Decomposition and Filtering. LDA stands for Latent Distribution Alignment. BAU stands for Balancing Alignment and Uniformity. Baseline* stands for CILP-FGDI: Exploiting Vision-Language Model for Generalizable Person Re-Identification, specifically a baseline model that utilizes a visual language model to achieve generalized person re-identification. ReNorm stands for Rethinking Normalization Layers for Domain Generalizable Person Re-identification, specifically a rethinking of normalization layers for domain-generalized person re-identification. Ours refers to the domain-generalized person re-identification method based on the collaboration of frequency domain enhancement and dynamic convolution in this embodiment of the invention. Average represents the average value.
[0196] Table 4 Ablation Experiment .
[0197] To evaluate the effectiveness of the proposed method, a comprehensive analysis was conducted in this embodiment by comparing the performance of each module with the Baseline* model. As shown in Table 4, the specific performance of each module is illustrated in Table 4. The FRM and MDSM modules respectively demonstrate the effectiveness of frequency domain and statistical feature extraction with dynamic convolution kernels. Specifically, adding the FRM and MDSM modules respectively resulted in mAP of 82.6% and 82.9%. Furthermore, the cascading of these two modules significantly enhances the model's generalization and robustness, achieving mAP and R1 of 83.6 and 77.1, respectively.
[0198] The above technical solution improves the recognition accuracy of domain-generalized person re-identification. By integrating a frequency domain reconstruction module and a multi-scale dynamic space module, the shortcomings of the CLIP-ReID architecture in feature extraction, fusion, and low-level feature capture are specifically addressed. This significantly enhances the detail perception, feature representation, and adaptability to complex scenarios of the domain-generalized person re-identification algorithm, effectively improving the model's accuracy and robustness in cross-domain and complex environments. Experimental data shows that compared to existing baseline models, this solution improves average mAP by 2.0% and R1 accuracy by 3.2%, with particularly outstanding performance advantages in complex scenarios. It can serve as a technical reference for intelligent law enforcement assistance in public security, providing efficient and accurate technical support for suspect tracking, checkpoint deployment, and other operations.
[0199] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.
Claims
1. A domain-generalized person re-identification method based on frequency domain enhancement and dynamic convolution collaboration, characterized in that, include: The image to be identified is acquired and input into the target person recognition model to obtain the target person image of the image to be identified. The target person recognition model includes a convolution module, a frequency domain reconstruction module, a multi-scale dynamic space module and an image encoder connected in sequence. The frequency domain reconstruction module includes a frequency domain feature extraction module, a statistical feature module and a fusion feature module. The target person recognition model's processing method for the image to be recognized includes: The first feature image is obtained by extracting features from the image to be identified through the first convolution module; The second feature image is extracted from the first feature image by the second convolution module, and frequency domain features are extracted from the second feature image by the frequency domain feature extraction module. Statistical features are extracted from the second feature image by the statistical feature module. The first convolution module and the second convolution module have different sizes. The frequency domain features, statistical features, first feature image, and second feature image are fused by the fusion feature module to obtain the first fused feature; The first fusion feature is spatially adaptively fused using a multi-scale dynamic spatial module to obtain the second fusion feature; The second fused feature is input into the image encoder to obtain the third image feature; Based on the third image features, the target person image in the image to be identified is determined.
2. The method for domain generalization pedestrian re-identification based on frequency domain enhancement and dynamic convolution collaboration as described in claim 1, characterized in that, The step of extracting frequency domain features from the second feature image through the frequency domain feature extraction module includes: The second feature image is mapped to the frequency domain by using a two-dimensional fast Fourier transform to obtain the complex spectral features in the frequency domain. The real part of the complex spectral features in the frequency domain is processed by the first convolution branch to obtain a learnable filter for the real part; The imaginary part of the complex spectral features in the frequency domain is processed by the second convolution branch to obtain a learnable filter for the imaginary part. The first and second convolution branches have the same layered processing structure, and each has a first convolutional layer, a normalization and linear function activation layer and a second convolutional layer connected in series. The first convolutional layer is used to transform channel C into channel C / 4, and the second convolutional layer is used to transform channel C / 4 into C. The real part learnable filter and the imaginary part learnable filter are combined to form a new complex feature. The new complex feature is then processed by the inverse fast Fourier transform to obtain spatial domain feature information. The spatial domain feature information is processed by average pooling and convolution to obtain the channel weight vector; Based on the channel weight vector, the second feature image is recalibrated to obtain the frequency domain features.
3. The domain generalization pedestrian re-identification method based on frequency domain enhancement and dynamic convolution collaboration as described in claim 2, characterized in that, The channel weight vector is expressed by the following formula: ; in, For channel weight vectors, It is the sigmoid activation function. For convolution processing with a 1×1 kernel, It is a rectified linear activation function. For batch normalization, For adaptive average pooling, This refers to spatial domain feature information.
4. The domain generalization pedestrian re-identification method based on frequency domain enhancement and dynamic convolution collaboration as described in claim 1, characterized in that, The step of extracting statistical features from the second feature image through the statistical feature module includes: The second feature image is divided into multiple regional feature images according to a preset ratio. The channel mean and variance of each regional feature image are calculated. The channel mean and variance of multiple feature images are concatenated to obtain a statistical description vector. A statistical feature sequence is obtained by performing a nonlinear mapping on the statistical description vector using a one-dimensional convolution method. The statistical feature sequence is dynamically fused with the second feature image to obtain the statistical features.
5. The domain generalization pedestrian re-identification method based on frequency domain enhancement and dynamic convolution collaboration as described in claim 4, characterized in that, The statistical feature sequence is expressed by the following formula: ; in, For statistical characteristic sequences, For statistical description vectors, For one-dimensional convolution, It is a rectified linear activation function. For batch normalization; The statistical characteristics are expressed by the following formula: ; in, For statistical characteristics, For the second feature image, Calculate the average for the channel dimension. For element-wise multiplication, It is the sigmoid activation function. To calculate the third dimension.
6. The domain generalization pedestrian re-identification method based on frequency domain enhancement and dynamic convolution collaboration according to claim 1, characterized in that, The first fusion feature is expressed by the following calculation formula: ; ; in, For splicing features, For the reshape operation, This is a tensor transpose operation. For frequency domain characteristics, For feature splicing operations, The first fusion feature, The first feature image, For linear transformation, , and yes Obtained through three independent linear transformations For the number of channels, The second feature image is represented by Softmax, which is used for normalization.
7. The domain generalization pedestrian re-identification method based on frequency domain enhancement and dynamic convolution collaboration as described in claim 1, characterized in that, The process of performing spatial adaptive fusion processing on the first fusion feature using a multi-scale dynamic spatial module to obtain the second fusion feature includes: The global context vector is extracted from the first fusion feature by a global average pooling layer. The global context vector is then processed by a two-layer convolutional network to obtain the parameters of the dynamic convolutional kernel, which is then split into multiple dynamic convolutional kernels of different scales. The average feature is obtained by averaging the first fused feature along the channel dimension. Each dynamic convolution kernel is convolved with the average feature, and spatial attention features are calculated based on the convolution results. Multiple spatial attention features are summed using a weighted method to obtain a dynamic spatial attention map; Based on the dynamic spatial attention map, the first fusion feature is weighted to obtain the second fusion feature.
8. The domain generalization pedestrian re-identification method based on frequency domain enhancement and dynamic convolution collaboration according to claim 7, characterized in that, The second fusion feature is expressed by the following calculation formula: ; in, This is the second fusion feature. This is the overall dynamic spatial attention graph. These are independent, learnable parameters that have undergone the Softmax operation. The dynamic spatial attention map corresponding to the convolution of the i-th dynamic kernel. For the i-th dynamic convolution kernel, This represents the total number of dynamic convolution kernels. For average characteristics, For convolution operations, This is element-wise multiplication.
9. The domain generalization pedestrian re-identification method based on frequency domain enhancement and dynamic convolution collaboration as described in claim 1, characterized in that, The step of determining the target person image in the image to be identified based on the third image features includes: Identify multiple portrait images; Calculate the cosine similarity between the third image features and each person image to obtain multiple cosine similarities; The image of the person with the highest cosine similarity is identified as the target person image in the image to be identified.
10. The domain generalization pedestrian re-identification method based on frequency domain enhancement and dynamic convolution collaboration according to any one of claims 1-9, characterized in that, The target person recognition model is trained using pedestrian description text and person images with learnable variables. The model includes a text encoder and a concatenated convolutional module, a frequency domain reconstruction module, a multi-scale dynamic space module, and an image encoder. The training method includes: Set the image encoder to a thawed state and pre-train the image encoder through a preset period to obtain a sub-person recognition model; Both the image encoder and the text encoder are set to a frozen state. The pedestrian description text of the learnable variable is trained by the sub-person recognition model to obtain the pedestrian description text of the target learnable variable. The text encoder is set to a frozen state, and the image encoder is set to a thawed state. The image encoder in the sub-person recognition model is trained using pedestrian description text of the target learnable variable to obtain the trained target person recognition model.