Pedestrian re-identification method and device, computer device and readable storage medium

By extracting features from pedestrian images and fusing global and local features, the problem of insufficient accuracy in pedestrian re-identification is solved, achieving stronger representation capabilities and recognition accuracy.

CN122244784APending Publication Date: 2026-06-19GUANGZHOU POWER SUPPLY BUREAU GUANGDONG POWER GRID CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUANGZHOU POWER SUPPLY BUREAU GUANGDONG POWER GRID CO LTD
Filing Date
2026-02-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In pedestrian re-identification tasks, the accuracy of pedestrian appearance recognition is insufficient due to factors such as perspective, posture, lighting conditions, and complex backgrounds. It is especially difficult to distinguish similar pedestrians in low-to-medium resolution images and when the perspective changes.

Method used

By extracting features from the target image and obtaining the input feature map, spatial partitioning and multi-head self-attention processing are performed to obtain global features. Convolutional dimensionality reduction, global average pooling, and channel recalibration are then performed to obtain local features. Finally, the global and local features are fused, and the model is trained using the training sample set and distance matrix to improve recognition accuracy.

🎯Benefits of technology

It achieves the coordinated output of global and local information in pedestrian re-identification tasks, improving the accuracy and representation ability of pedestrian re-identification and effectively distinguishing similar pedestrians.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244784A_ABST
    Figure CN122244784A_ABST
Patent Text Reader

Abstract

This application relates to a pedestrian re-identification method, apparatus, computer device, and readable storage medium, belonging to the field of computer vision technology. The method includes: acquiring a target image and performing feature extraction processing on the target image to obtain an input feature map; spatially partitioning the input feature map to obtain multiple spatial locations, and performing multi-head self-attention processing on the multiple spatial locations to obtain global features of the target object in the target image; performing convolutional dimensionality reduction, global average pooling, and channel recalibration processing on the input feature map to obtain local features of the target object in the target image; fusing the global features and the local features to obtain fused features of the target image; and performing pedestrian re-identification processing based on the fused features. This method can improve the accuracy of pedestrian re-identification.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer vision technology, and in particular to a pedestrian re-identification method, a pedestrian re-identification device, a computer device, a computer-readable storage medium, and a computer program product. Background Technology

[0002] Pedestrian re-identification aims to match images of the same pedestrian from a large-scale image database, utilizing computer vision technology to identify the same pedestrian across camera viewpoints, locations, and time periods. Due to its enormous application potential in areas such as intelligent security, public safety monitoring, and behavior analysis, this task has received widespread attention in the field of computer vision. However, pedestrian re-identification still faces many challenges, primarily stemming from significant variations in pedestrian appearance due to factors such as viewpoint, posture, lighting conditions, and complex backgrounds.

[0003] Therefore, there is a need to improve the accuracy of pedestrian re-identification during the process of pedestrian re-identification. Summary of the Invention

[0004] Therefore, it is necessary to provide a pedestrian re-identification method, pedestrian re-identification device, computer equipment, computer-readable storage medium, and computer program product that can improve the accuracy of pedestrian re-identification in response to the above-mentioned technical problems.

[0005] Firstly, this application provides a pedestrian re-identification method, including:

[0006] Acquire the target image and perform feature extraction processing on the target image to obtain the input feature map;

[0007] The input feature map is spatially partitioned to obtain multiple spatial locations, and multi-head self-attention processing is performed on these multiple spatial locations to obtain the global features of the target object in the target image.

[0008] The input feature map is subjected to convolutional dimensionality reduction, global average pooling, and channel recalibration to obtain the local features of the target object in the target image.

[0009] Global and local features are fused to obtain the fused features of the target image;

[0010] Pedestrian re-identification based on fused features.

[0011] In some embodiments, the input feature map is spatially partitioned to obtain multiple spatial locations, and multi-head self-attention processing is performed on these multiple spatial locations to obtain global features of the target object in the target image, including:

[0012] Perform a 1×1 convolution on the input feature map to obtain the dimension-reduced feature map;

[0013] Multi-head attention computation is performed on the dimensionality-reduced feature map to obtain global attention weights;

[0014] The global attention weights are multiplied by the dimensionality-reduced feature map to obtain the global interaction feature map;

[0015] Perform a 1×1 convolution on the global interaction feature map to obtain the global features of the target object.

[0016] In some embodiments, the input feature map is subjected to convolutional dimensionality reduction, global average pooling, and channel recalibration to obtain local features of the target object in the target image, including:

[0017] Perform a 1×1 convolution on the input feature map to obtain the dimension-reduced feature map;

[0018] Global average pooling is performed on the dimensionality-reduced feature map to obtain channel descriptors;

[0019] The channel descriptor is modeled using two fully connected layers to obtain local channel attention weights.

[0020] The local channel attention weights are multiplied with the dimensionality-reduced feature map to obtain the local interaction feature map;

[0021] The local interactive feature map is processed by 1×1 convolution to obtain the local features of the target object.

[0022] In some embodiments, the global attention weights are multiplied by the dimensionality-reduced feature map to obtain a global interaction feature map, including:

[0023] After reshaping the local channel attention weights generated from the local features of the target object in the target image, the reshaped local channel attention weights are obtained.

[0024] The global attention weights are multiplied by the dimensionality-reduced feature map to obtain the initial global interaction feature map;

[0025] The local channel attention weights after the reshaping operation are added to the initial global interaction feature map to obtain the global interaction feature map.

[0026] In some embodiments, the local channel attention weights are multiplied with the dimensionality-reduced feature map to obtain a local interaction feature map, including:

[0027] After performing two reshaping operations on the global attention weights generated from the global features of the target object in the target image, the reshaping global attention weights are obtained.

[0028] The local channel attention weights, the global attention weights after the reshaping operation, and the dimensionality-reduced feature map are multiplied together to obtain the local interactive feature map.

[0029] In some embodiments, the pedestrian re-identification model obtained through training is used to extract features from the target object to obtain an input feature map, and the input feature map is processed to obtain the fused features of the target image; pedestrian re-identification is then performed based on the fused features.

[0030] The methods for training and obtaining a pedestrian re-identification model include:

[0031] Obtain the training sample set and determine the feature vector of each training sample in the training sample set; construct a distance matrix based on the feature vector of each training sample, where the matrix elements are the distance between the feature vectors of two corresponding training samples; extract the anchor-positive sample distance and the anchor-negative sample distance through the positive sample indicator matrix and the negative sample indicator matrix.

[0032] During the training process of any training batch, the following processing is performed:

[0033] Calculate the current maximum and minimum distance difference between positive samples in the current batch of training samples;

[0034] Obtain the maximum and minimum distance difference based on the historical training sample set, and determine the average maximum and minimum distance difference;

[0035] Based on the relationship between the current maximum and minimum distance difference and the average maximum and minimum distance difference, the sample set category of the current batch of training sample set is determined;

[0036] The target training sample set for the current batch is determined from the training sample set of the current batch based on the sample set category;

[0037] The pedestrian re-identification model is trained based on the inter-class distance threshold corresponding to the sample set categories and the target training sample set of the current batch.

[0038] In some of these embodiments:

[0039] If the current maximum-minimum distance difference is greater than the average maximum-minimum distance difference, select difficult positive samples and easy negative samples from the current batch of training sample set to obtain the target training sample set of the current batch, and set the inter-class distance threshold to the first inter-class distance threshold.

[0040] If the current maximum-minimum distance difference is less than the average maximum-minimum distance difference, select easy positive samples and difficult negative samples from the current batch of training sample set to obtain the target training sample set for the current batch, and set the inter-class distance threshold to the first inter-class distance threshold.

[0041] If the current maximum-minimum distance difference is greater than the average maximum-minimum distance difference, select difficult positive samples and easy negative samples from the current batch of training sample set to obtain the target training sample set of the current batch, and set the inter-class distance threshold to the second inter-class distance threshold.

[0042] When the current maximum and minimum distance difference is equal to the average maximum and minimum distance difference, select semi-hard positive samples and semi-hard negative samples from the current batch of training sample set to obtain the target training sample set of the current batch, and set the inter-class distance threshold to the third inter-class distance threshold.

[0043] Among them, the distance between the first class is less than the distance threshold between the third class, and the distance between the third class is less than the distance threshold between the second class.

[0044] Secondly, this application also provides a pedestrian re-identification device, comprising:

[0045] The input processing module is used to acquire the target image and perform feature extraction processing on the target image to obtain the input feature map;

[0046] The global feature processing module is used to spatially partition the input feature map, obtain multiple spatial locations, and perform multi-head self-attention processing on multiple spatial locations to obtain the global features of the target object in the target image.

[0047] The local feature processing module is used to perform convolutional dimensionality reduction, global average pooling, and channel recalibration on the input feature map to obtain the local features of the target object in the target image.

[0048] The fusion module is used to fuse global and local features to obtain the fused features of the target image;

[0049] The recognition and processing module is used for pedestrian re-identification processing based on fused features.

[0050] Thirdly, this application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the pedestrian re-identification method of any of the above embodiments.

[0051] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the pedestrian re-identification method of any of the above embodiments.

[0052] Fifthly, this application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the pedestrian re-identification method of any of the above embodiments.

[0053] The aforementioned pedestrian re-identification method, apparatus, computer equipment, computer-readable storage medium, and computer program product, in the process of pedestrian re-identification processing, after performing feature extraction processing on the target image to obtain an input feature map, on the one hand, spatially divide the input feature map to obtain multiple spatial locations, and perform multi-head self-attention processing on multiple spatial locations to obtain global features of the target object in the target image; on the other hand, perform convolutional dimensionality reduction processing, global average pooling operation, and channel recalibration processing on the input feature map to obtain local features of the target object in the target image, and fuse global features and local features to obtain fused features of the target image, realizing the coordinated output of global and local information of the target image, thereby obtaining stronger representation ability in pedestrian re-identification tasks and improving the accuracy of pedestrian re-identification. Attached Figure Description

[0054] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0055] Figure 1 This is an application environment diagram of the pedestrian re-identification method in one embodiment;

[0056] Figure 2 This is a flowchart illustrating a pedestrian re-identification method in one embodiment;

[0057] Figure 3 This is a flowchart illustrating the process of extracting global features in one embodiment;

[0058] Figure 4 This is a flowchart illustrating the process of extracting local features in one embodiment;

[0059] Figure 5 This is a schematic diagram of the process for obtaining a global interaction feature map in one embodiment;

[0060] Figure 6 This is a flowchart illustrating the process of obtaining a local interaction feature map in one embodiment;

[0061] Figure 7 This is a flowchart illustrating the training process for a training batch in an example.

[0062] Figure 8 This is a schematic diagram of the model architecture for pedestrian re-identification in a specific example.

[0063] Figure 9This is a structural block diagram of a pedestrian re-identification device in one embodiment;

[0064] Figure 10 This is an internal structural diagram of a computer device in one embodiment;

[0065] Figure 11 This is a diagram of the internal structure of a computer device in another embodiment. Detailed Implementation

[0066] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0067] It should be noted that the terms "first," "second," etc., used in this application can be used to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish the first element from the second element. The terms "comprising" and "having," and any variations thereof, used in this application, are intended to cover non-exclusive inclusion. The term "multiple" used in this application refers to two or more. The term "and / or" used in this application refers to one of the embodiments, or any combination of multiple embodiments.

[0068] It should be noted that all information and data involved in this application (including but not limited to data used for analysis, stored data, and displayed data) are information and data authorized by the user or fully authorized by all parties, and the acquisition, transmission, storage, use, and processing of related data comply with the relevant provisions of national laws and regulations. Users can refuse content pushed to them or can easily refuse content pushes. In the embodiments of this application, certain existing solutions in the industry, such as software, components, and models, may be mentioned. These should be considered exemplary, and their purpose is merely to illustrate the feasibility of implementing the technical solution of this application, but does not mean that the applicant has already used or necessarily used such a solution.

[0069] The biggest challenge in pedestrian re-identification lies in the fundamental problem of "high similarity in appearance among similar pedestrians." Even among different pedestrians, if their height, clothing, gait, and other features are similar, they are almost indistinguishable from each other in images. This is especially true in low- to medium-resolution images, where different pedestrians may exhibit almost identical visual appearances, greatly increasing the difficulty of identification. In addition, there are several other major challenges:

[0070] Insufficient visual information: Compared to facial recognition or vehicle recognition, pedestrians have fewer distinguishable features. Pedestrians lack obvious structural features, most are in a dynamic state, and their clothing tends to be uniform, with significant similarities in clothing types. For example, common clothing colors such as black, white, and gray frequently appear on different pedestrians; while external accessories (such as hats, handbags, and glasses) are usable features, they often fail to differentiate pedestrians due to potential overlap.

[0071] Limitations in fine-grained differentiation: Existing person re-identification methods typically employ global feature learning, which, while extracting some semantic information, often fails to accurately characterize differences in small regions. In person re-identification, particular attention must be paid to the detailed features of pedestrians, such as clothing texture, shoe style, facial contours, and body posture. These fine-grained local features are easily diluted in the global vector, making it difficult to effectively distinguish similar pedestrians.

[0072] Recognition difficulties arise from changes in perspective: When pedestrians appear under different cameras, the shooting angle can change drastically (e.g., forward view, backward view, side view, oblique view), resulting in significant variations in the pedestrian's appearance in the image. Even the same pedestrian may present completely different visual features. Common scenarios involving changes in perspective include a frontal view showing only the pedestrian's face and upper body, while a side view lacks facial features and shows more of the pedestrian's side profile. To address these variations, techniques such as multi-view data augmentation, perspective-invariant feature extraction, and multi-view matching are typically used.

[0073] Visual differences caused by different weather / lighting conditions: Surveillance cameras in outdoor environments are significantly affected by lighting conditions and weather. Daytime, nighttime, cloudy, rainy, snowy, and backlighting conditions can all affect pedestrians' visual perception and severely interfere with the stability of the feature extraction network.

[0074] Pedestrian appearance changes and occlusion issues: In some real-world scenarios, a pedestrian's appearance may change over time, such as when they change clothes, wear glasses, or carry a backpack. Simultaneously, pedestrians may be occluded by other objects, such as other pedestrians, vehicles, or the surrounding environment (e.g., trees, railings) in crowded environments, thus affecting feature extraction and recognition performance.

[0075] Currently, a large number of studies have been dedicated to solving related problems in pedestrian re-identification scenarios. Existing solutions mainly focus on global feature learning and metric learning. Although they can extract strong global features of the overall appearance of pedestrians, more fine-grained feature extraction is still needed to distinguish nearly duplicate pedestrians.

[0076] Accordingly, this application provides a pedestrian re-identification method, which can be applied to, for example... Figure 1 In the application environment shown, terminal 102 communicates with server 104 via a network. In some embodiments, terminal 102 acquires a target image, performs pedestrian re-identification processing on the target image, and executes the training process of a pedestrian re-identification model. In other embodiments, server 104 may acquire the target image, perform pedestrian re-identification processing on the target image, and execute the training process of a pedestrian re-identification model. In still other embodiments, the pedestrian re-identification processing on the target image and the training process of the pedestrian re-identification model can be completed through interaction between terminal 102 and server 104. For example, after server 104 trains and obtains the pedestrian re-identification model, it deploys the trained pedestrian re-identification model to terminal 102, and terminal 102 obtains the target image for pedestrian re-identification processing. For example, after the server 104 trains and obtains the pedestrian re-identification model, it deploys the trained pedestrian re-identification model locally on the server 104 or to other servers. After the terminal 102 obtains the target image, it sends the target image to the server 104, which performs pedestrian re-identification processing. The result of the pedestrian re-identification processing by the server 104 can be sent to the terminal 102.

[0077] The data storage system can store the data that the terminal 102 or server 104 needs to process. The data storage system can be integrated onto server 104, or it can be located in the cloud or on other network servers. Terminal 102 can be, but is not limited to, various personal computers, laptops, smartphones, tablets, drones, low-altitude aircraft, IoT devices, and portable wearable devices. IoT devices can include smart speakers, smart TVs, smart air conditioners, smart in-vehicle devices, projection devices, etc. Portable wearable devices can include smartwatches, smart bracelets, head-mounted devices, etc. Head-mounted devices can be virtual reality (VR) devices, augmented reality (AR) devices, smart glasses, etc. Server 104 can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing cloud computing services.

[0078] In one exemplary embodiment, such as Figure 2 As shown, a pedestrian re-identification method is provided, which is then applied to... Figure 1 Taking terminal 102 or server 104 as an example, the explanation includes the following steps S201 to S205. Wherein:

[0079] Step S201: Obtain the target image and perform feature extraction processing on the target image to obtain the input feature map.

[0080] The target image is the image for which pedestrian re-identification needs to be performed. The method of obtaining the target image can be the method already available in related technologies, such as images extracted from a database or images extracted from video streams provided by camera devices. This application does not impose specific limitations on this.

[0081] The input feature map obtained by performing feature extraction on the target image can be a shallow feature of the extracted target image, such as edge information, color blocks, texture features, corner information, etc., but is not limited to these.

[0082] There are no restrictions on the way to perform feature extraction on the target image to obtain the input feature map. For example, the target image can be processed through the first three layers of the backbone network of the network model to obtain the input feature map, but it is not limited to this.

[0083] Step S202: Spatial division of the input feature map to obtain multiple spatial locations, and multi-head self-attention processing of multiple spatial locations to obtain global features of the target object in the target image.

[0084] The obtained input feature map can be spatially partitioned to obtain multiple spatial locations. Multi-head self-attention processing can then be applied to these multiple spatial locations to capture long-distance feature dependencies and construct global features of the target object.

[0085] Step S203: Perform convolution dimensionality reduction, global average pooling, and channel recalibration on the input feature map to obtain the local features of the target object in the target image.

[0086] The obtained input feature map can be processed by convolutional dimensionality reduction, global average pooling, and channel recalibration, thereby enhancing the ability to capture key channel information and obtaining local features of the target object.

[0087] Step S204: Fuse global features and local features to obtain fused features of the target image.

[0088] The obtained global and local features can be fused to obtain the fused features of the target object. The specific method for fusing global and local features is not limited; for example, it can be fused using weighted fusion, cascaded fusion, attention mechanism fusion, or bilinear pooling, etc. This application embodiment does not impose specific limitations in this regard.

[0089] Step S205: Perform pedestrian re-identification processing based on fused features.

[0090] Based on the obtained fused features, pedestrian re-identification processing can be performed. Specifically, the method of pedestrian re-identification processing based on the obtained fused features can be the same as the method of pedestrian re-identification processing based on already obtained features in related technologies; this application embodiment does not impose specific limitations on this.

[0091] Based on this embodiment, in the process of pedestrian re-identification, after performing feature extraction on the target image to obtain the input feature map, the input feature map is spatially divided to obtain multiple spatial locations, and multi-head self-attention processing is performed on these multiple spatial locations to obtain the global features of the target object in the target image. On the other hand, convolutional dimensionality reduction, global average pooling, and channel recalibration are performed on the input feature map to obtain the local features of the target object in the target image. The global and local features are then fused to obtain the fused features of the target image, realizing the coordinated output of global and local information of the target image. This results in stronger representation capabilities in the pedestrian re-identification task, thereby improving the accuracy of pedestrian re-identification.

[0092] In some embodiments, reference Figure 3 As shown, step S202 above involves spatially partitioning the input feature map to obtain multiple spatial locations, and then performing multi-head self-attention processing on these multiple spatial locations to obtain the global features of the target object in the target image, including:

[0093] Step S301: Perform 1×1 convolution on the input feature map to obtain the dimensionality-reduced feature map.

[0094] By performing 1×1 convolution on the input feature map, dimensionality reduction can be achieved, resulting in a dimensionality-reduced feature map. For example, using a 1×1 convolutional layer in a Transformer (an encoder) module to convolve the input feature map can reduce its dimensionality. Figure X g ∈ RC×N×N is reduced to the Cr channel by the first layer 1×1 convolution, reducing the number of channels from 1024 to 256, thereby reducing the data dimension in the processing and reducing the computation and memory overhead.

[0095] Step S302: Perform multi-head attention calculation on the dimensionality-reduced feature map to obtain global attention weights.

[0096] The obtained dimensionality-reduced feature map can be processed by multi-head attention to obtain global attention weights. The specific multi-head attention mechanism (MHSA) used for the calculation is not limited; any existing multi-head attention processing method can be used to obtain the global attention weights.

[0097] Step S303: Multiply the global attention weights with the dimensionality-reduced feature map to obtain the global interaction feature map.

[0098] The obtained global attention weights can be multiplied with the dimensionality-reduced feature map to obtain a global interaction feature map processed by the global attention mechanism. For example, the global interaction feature map can be obtained by multiplying the obtained global attention weights AttentionG with the matrix Vr corresponding to the dimensionality-reduced feature map.

[0099] Step S304: Perform 1×1 convolution on the global interactive feature map to obtain the global features of the target object.

[0100] Further processing of the obtained global interaction feature map with 1×1 convolution can restore the number of channels to the original dimension after attention calculation, for example, restoring the number of channels from 256 to 1024, thereby maintaining the expressive power of the network. The restored feature map can retain the rich information of the original features and provide sufficient expressive power for subsequent recognition tasks.

[0101] Based on this embodiment, in the process of obtaining the global features of the target object in the target image, the number of channels is first reduced by 1x1 convolution to reduce the computational cost of each convolutional layer, thereby reducing the data dimension in the processing, reducing computational and memory overhead, and making the computation more efficient. Moreover, by performing multi-head attention computation on the dimensionality-reduced feature map, different regions can be modeled by multiple attention heads in parallel, thereby effectively learning the long-range dependency information in the target image and capturing the overall structure of human body contours, height, and clothing, which helps to improve the accuracy of pedestrian re-identification after fusing global and local features.

[0102] In some embodiments, reference Figure 4 As shown, step S203 above involves performing convolutional dimensionality reduction, global average pooling, and channel recalibration on the input feature map to obtain local features of the target object in the target image, including:

[0103] Step S401: Perform 1×1 convolution on the input feature map to obtain the dimensionality-reduced feature map.

[0104] By performing 1×1 convolution on the input feature map, dimensionality reduction can be achieved, resulting in a dimensionality-reduced feature map. For example, using a 1×1 convolutional layer to perform convolution on the input feature map reduces its dimensionality. Figure X g ∈ RC×N×N is reduced to Cr channels through 1×1 convolution, reducing the number of channels from 1024 to 256, thereby reducing the data dimension in the processing and reducing computation and memory overhead.

[0105] Step S402: Perform global average pooling on the dimensionality-reduced feature map to obtain channel descriptors.

[0106] The obtained dimensionality-reduced feature map can be processed by global average pooling to obtain statistical information at the extraction channel level. This process can also compress global information of the feature map, reduce the number of parameters, and improve processing efficiency.

[0107] Step S403: Perform channel modeling on the channel descriptor through two fully connected layers to obtain local channel attention weights.

[0108] A small neural network can be constructed using two fully connected layers to learn the linear relationship between channels, i.e., to perform channel modeling processing, thereby obtaining channel attention weights, i.e., obtaining the attention weights between channels. Specifically, the method of performing channel modeling processing on the channel descriptors using two fully connected layers to obtain channel attention weights can adopt existing methods in related technologies, and this application embodiment does not impose specific limitations on this.

[0109] Step S404: Multiply the local channel attention weights with the dimensionality-reduced feature map to obtain the local interactive feature map.

[0110] The obtained channel attention weights can be multiplied with the dimensionality-reduced feature map obtained in step S401 to perform channel recalibration on the dimensionality-reduced feature map based on the channel attention weights, thereby obtaining a local interactive feature map based on channel attention.

[0111] Step S405: Perform 1×1 convolution on the local interactive feature map to obtain the local features of the target object.

[0112] Further 1×1 convolution processing is performed on the obtained local interaction feature maps, which can restore the number of channels to the original dimension after the channel attention calculation is completed, for example, restoring the number of channels from 256 to 1024, thereby maintaining the expressive power of the network. The restored feature maps can retain the rich information of the original features and provide sufficient expressive power for subsequent recognition tasks.

[0113] Based on this embodiment, in the process of obtaining local features of the target object in the target image, the number of channels is first reduced by 1x1 convolution to reduce the computational cost of each convolutional layer, thereby reducing the data dimension in the processing, reducing computational and memory overhead, and making the computation more efficient. Moreover, the statistical information of the channel level is extracted from the dimensionality-reduced feature map through global average pooling operation, and then the inter-channel dependencies are modeled through a fully connected network to further enhance the response capability of fine-grained regions such as shoe, backpack, and clothing details. Thus, by effectively utilizing local feature details, the ability to express local region information can be improved, thereby improving the accuracy of re-identification.

[0114] In some embodiments, reference Figure 5 As shown, step S303 above, which involves multiplying the global attention weights with the dimensionality-reduced feature map to obtain the global interaction feature map, may include:

[0115] Step S501: After reshaping the local channel attention weights generated from the local features of the target object in the obtained target image, the reshaped local channel attention weights are obtained.

[0116] The local channel attention weights generated from the local features of the target object in the obtained target image can be the local channel attention weights obtained in step S403 above.

[0117] Reshaping the channel attention weights refers to converting the channel attention weights from a one-dimensional vector into a multi-dimensional tensor that matches the dimensionality-reduced feature map, so that tensors of different shapes can be correctly operated on element-wise.

[0118] Step S502: Multiply the global attention weights with the dimensionality-reduced feature map to obtain the initial global interaction feature map.

[0119] By multiplying the global attention weights with the dimensionality-reduced feature map, an initial global interaction feature map determined based on global features can be obtained.

[0120] Step S503: Add the local channel attention weights after the reshaping operation to the initial global interaction feature map to obtain the global interaction feature map.

[0121] Based on the local channel attention weights obtained after the reshaping operation, the initial global interaction feature map can be added together to obtain the global interaction feature map.

[0122] By incorporating local channel attention weights into the global interactive feature map, the spatial information of local features can guide global features to focus on key regions, further optimizing the representation effect of global features.

[0123] Based on this embodiment, by reshaping the channel attention weights generated in the local features of the target object in the obtained target image, and then combining the reshaped channel attention weights with the dimensionality-reduced feature map based on the global attention weights, a global interactive feature map is obtained. This can guide the response intensity of local features in each channel, so that important channels receive more attention, thereby enhancing the sensitivity of local features to key details.

[0124] In some embodiments, reference Figure 6 As shown, step S404 above, which multiplies the local channel attention weights with the dimensionality-reduced feature map to obtain the local interaction feature map, includes:

[0125] Step S601: After performing two reshaping operations on the global attention weights generated from the global features of the target object in the target image, the reshaping global attention weights are obtained.

[0126] The global attention weights generated from the global features of the target object in the obtained target image can be the global channel attention weights obtained in step S302 above.

[0127] Step S602: Multiply the local channel attention weights, the global attention weights after the reshaping operation, and the dimensionality-reduced feature map to obtain the local interactive feature map.

[0128] Accordingly, the spatial attention map (global attention weight) generated by the global branch is compressed into a channel vector and multiplied channel by channel with the local channel attention weight and the dimensionality-reduced feature map. This can guide the response intensity of local features in each channel, so that important channels receive more attention, thereby enhancing the sensitivity of local features to key details.

[0129] Based on this embodiment, the global attention weights generated from the global features of the target object in the target image are reshaped twice. The reshaped global attention weights are then added to the initial local interactive feature map obtained by multiplying the channel attention weights with the dimensionality-reduced feature map to obtain the local interactive feature map. This process guides the response intensity of local features in each channel, allowing important channels to receive more attention. As a result, the spatial information of local features can guide global features to focus on key areas, further optimizing the representation effect of global features. This process realizes the dynamic interaction between global and local information, thereby improving the overall performance of the pedestrian re-identification task.

[0130] Based on the example above, a bidirectional attention interaction mechanism was implemented, which enables the global branch to integrate local details and the local branch to absorb the global context, significantly improving the fine-grained discrimination capability of features.

[0131] In some embodiments, the target object can be processed by a trained person re-identification model to obtain an input feature map, and the input feature map can be processed to obtain the fused features of the target image; person re-identification is then performed based on the fused features.

[0132] The method of training the pedestrian re-identification model is not limited. In this embodiment, the method of training the pedestrian re-identification model may include:

[0133] Obtain the training sample set and determine the feature vector of each training sample in the training sample set; construct a distance matrix based on the feature vector of each training sample, where the matrix elements are the distance between the feature vectors of two corresponding training samples; extract the anchor-positive sample distance and the anchor-negative sample distance through the positive sample indicator matrix and the negative sample indicator matrix.

[0134] The training sample set is obtained by training the training sample set, the positive sample indicator matrix, the negative sample indicator matrix, the anchor-positive sample distance, and the anchor-negative sample distance.

[0135] The distance matrix, constructed based on the feature vectors of each training sample, represents the distance between different training samples, i.e., each element... Indicates the first The training sample and the first The feature distance between training samples is represented by a diagonal element of 0, indicating that the distance from a sample to itself is 0. The method for calculating the distance between training samples is not limited; for example, it can be Euclidean distance or cosine distance, etc. This application embodiment does not impose specific limitations.

[0136] The positive and negative indicator matrices are matrices determined based on the sample labels of the training samples. The positive indicator matrix quickly identifies which sample pairs belong to the same positive class (positive sample pair), while the negative indicator matrix quickly identifies which sample pairs belong to different negative classes (negative sample pair). Specifically, the elements of the positive indicator matrix indicate whether two corresponding samples belong to the same class (positive sample pair), and the elements of the negative indicator matrix indicate whether two corresponding samples belong to different classes (negative sample pair).

[0137] Based on the obtained positive and negative sample indicator matrices, the distances Dap to positive samples and Dan to negative samples can be extracted for each anchor sample, thus obtaining the anchor-positive sample distance and the anchor-negative sample distance. The specific method for calculating these distances is not limited; in this embodiment, a hard sample mining method can be used to calculate them. In this case, the obtained anchor-positive sample distance can be the distance between the anchor sample and the sample with the furthest distance among all positive samples (i.e., the most difficult positive sample), and the distance between the anchor sample and the sample with the closest distance among all negative samples (i.e., the most difficult negative sample).

[0138] Based on this, refer to Figure 7 As shown, during the training process of any training batch, the following processing is performed:

[0139] Step S701: Calculate the current maximum and minimum distance difference between positive samples in the current batch of training samples.

[0140] In the current batch of training samples, the current maximum and minimum distance difference of positive samples refers to the difference between the distance between the anchor sample and the most difficult positive sample (farthest distance) in the current batch of training samples, and the distance between the easiest positive sample (closest distance) in the current batch of training samples.

[0141] The maximum and minimum distance difference reflects the feature distribution in the training sample set. When the maximum and minimum distance difference is large, it indicates that the feature distribution in the training sample set is not compact enough and the variance is large. There may be some samples (outliers) in the training sample set that are difficult to learn. When the maximum and minimum distance difference is small, it indicates that similar samples are clustered more closely in the feature space.

[0142] Step S702: Obtain the maximum and minimum distance difference based on the historical training sample set, and determine the average maximum and minimum distance difference.

[0143] Based on the maximum and minimum distance difference of the historical training sample set, the average maximum and minimum distance difference can be determined. That is, the average value of each maximum and minimum distance difference corresponding to the historical training sample set can be calculated to obtain the average maximum and minimum distance difference.

[0144] In some examples, the average maximum-minimum distance difference can be obtained by calculating the average value based solely on the maximum-minimum distance differences in the historical training sample set. In other examples, the average maximum-minimum distance difference can be obtained by adding the current maximum-minimum distance difference to the distance difference set of the maximum-minimum distance differences in the historical training sample set, and then calculating the average value of each maximum-minimum distance difference in the distance difference set.

[0145] Step S703: Based on the relationship between the current maximum and minimum distance difference and the average maximum and minimum distance difference, determine the sample set category of the current batch of training sample sets.

[0146] Based on the relationship between the current maximum and minimum distance difference and the average maximum and minimum distance difference, the sample set category of the current batch of training samples can be determined.

[0147] In the case where the current maximum-minimum distance difference is greater than the average maximum-minimum distance difference, it indicates that the feature distribution in the training sample set is not compact enough, the variance is large, and there may be some samples in the training sample set that are difficult to learn.

[0148] When the current maximum-minimum distance difference is less than the average maximum-minimum distance difference, it indicates that similar samples are clustered relatively closely in the feature space.

[0149] Step S704: Determine the target training sample set for the current batch from the training sample set of the current batch based on the sample set category.

[0150] Based on the defined sample set category, the distribution of sample features of the current batch of training sample set can be reflected, and thus the target training sample set of the current batch can be determined from the current batch of training sample set.

[0151] Step S705: Train the pedestrian re-identification model based on the inter-class distance threshold corresponding to the sample set category and the target training sample set of the current batch.

[0152] Based on the determined target training sample set for the current batch and the inter-class distance thresholds corresponding to the sample set categories, the pedestrian re-identification model can be trained for the current batch. The specific model training method can employ methods already existing in related technologies; this application embodiment does not specifically limit this approach.

[0153] The method for selecting the target training sample set and the corresponding inter-class distance threshold for the current batch is not limited. Specifically, the following methods can be used:

[0154] If the current maximum-minimum distance difference is greater than the average maximum-minimum distance difference, select difficult positive samples and easy negative samples from the current batch of training sample set to obtain the target training sample set of the current batch, and set the inter-class distance threshold to the first inter-class distance threshold.

[0155] If the current maximum-minimum distance difference is less than the average maximum-minimum distance difference, select easy positive samples and difficult negative samples from the current batch of training sample set to obtain the target training sample set for the current batch, and set the inter-class distance threshold to the first inter-class distance threshold.

[0156] If the current maximum-minimum distance difference is greater than the average maximum-minimum distance difference, select difficult positive samples and easy negative samples from the current batch of training sample set to obtain the target training sample set of the current batch, and set the inter-class distance threshold to the second inter-class distance threshold.

[0157] When the current maximum and minimum distance difference is equal to the average maximum and minimum distance difference, select semi-hard positive samples and semi-hard negative samples from the current batch of training sample set to obtain the target training sample set of the current batch, and set the inter-class distance threshold to the third inter-class distance threshold.

[0158] Among them, the distance between the first class is less than the distance threshold between the third class, and the distance between the third class is less than the distance threshold between the second class.

[0159] Based on this embodiment, during the training of the pedestrian re-identification model, for each batch of training sample sets, the current maximum and minimum distance difference of the positive samples in the current batch of training sample sets, and the average maximum and minimum distance difference determined based on the maximum and minimum distance differences of historical training sample sets, dynamically determine the sample set category and inter-class distance threshold. Based on the dynamically determined sample set category, determine the target training sample set for training. Based on the inter-class distance threshold corresponding to the sample set category and the target training sample set of the current batch, train the pedestrian re-identification model. This achieves adaptive adjustment based on sample difficulty, ensuring viewpoint consistency while making full use of multi-view samples, significantly improving model robustness, avoiding instability caused by excessive sample difficulty in the early stage of training, and preventing overfitting caused by overly simple samples in the later stage of training. This further improves the performance of the pedestrian re-identification model and is also conducive to further improving the accuracy of pedestrian re-identification.

[0160] Based on the embodiments described above, a detailed example will be provided below.

[0161] refer to Figure 8 As shown, based on the solution of this embodiment, the re-identification process based on a knowledge-driven multi-branch fusion network is realized. The specific processing flow can be as follows:

[0162] After the input image (i.e., the target image) is processed by the first three layers of the backbone network, the input feature map is obtained. The fourth layer of the backbone network is then transformed into a Global-Local Branch Multistructure (GLMB). Specifically:

[0163] The global branch consists of Transformer blocks and captures global information through a multi-head attention mechanism;

[0164] Local branches consist of compressed excitation blocks, which extract local information through convolution and pooling operations.

[0165] Each branch in the global and local branches further comprises two sub-branches, trained using cross-entropy loss and triplet loss, respectively. This dual design effectively enhances feature representation capabilities through the structural diversity of the global and local branches, as well as the diversity of training objectives for different loss functions.

[0166] In this approach, an attention transfer module (ATM) is introduced between sub-branches that use the same loss function. This promotes the integration of local information into global branches and the integration of global information into local branches. In other words, the global branch structure is updated to a global-local branch structure, and the local branch structure is updated to a local-global branch structure, thereby enhancing the discriminative ability of each branch.

[0167] In training a pedestrian re-identification model, the triplet loss can be used for sample selection based on the Knowledge-Driven Triplet Mining (KDTM) method. A specific example of the processing flow is as follows:

[0168] Global-Local Branching Structure: This structure aims to capture long-range feature dependencies to construct global features. The Transformer block consists of three parts: a 1×1 convolution, multi-head self-attention (MHSA), and another 1×1 convolution. Input Features Figure X After the dimension reduction of g ∈ RC×N×N to Cr channels via a first-layer 1×1 convolution, the global attention weights AttentionG are calculated using the MHSA mechanism. The global attention weights are multiplied by the value matrix Vr to obtain the MHSA output, which is then restored to C via a 1×1 convolution. This process effectively models complex global patterns through a cascaded operation of dimension reduction, global attention, and channel restoration.

[0169] Local-Global Branching Structure: This structure focuses on extracting detailed pedestrian features. The compressed activation block consists of three parts: a 1×1 convolution, an SE module, and another 1×1 convolution. After dimensionality reduction via the first 1×1 convolution, the input features are used to generate channel descriptors through global average pooling (GAP). These descriptors are then modeled through two fully connected layers to generate local channel attention weights (AttentionL). These weights recalibrate the feature map's channels, and finally, a 1×1 convolution restores the number of channels. This structure enhances the ability to capture key channel information through a process of spatial compression, channel activation, and feature recalibration.

[0170] In the above process, to address the issue of insufficient information exchange between the global and local branch structures, this application embodiment employs a parameterless attention transfer module (ATM). Combined with... Figure 8 As shown:

[0171] In the global-local branching structure, the local channel attention weights (AttentionL) of the local branching structure are incorporated into the MHSA output of the global branching structure through a reshaping operation. In the local-global branching structure, the global attention weights (AttentionG) of the global branching structure are reshaped twice and then added to the SE output. This bidirectional attention interaction mechanism enables the global branch to integrate local details, while the local branch absorbs the global context, significantly improving the fine-grained discriminative ability of features.

[0172] To address the limitations of existing triplet mining methods, such as the tendency of hard triplet mining to get trapped in local optima, the neglect of multi-perspective sample utilization despite maintaining perspective consistency, and the weak contribution of easy-to-correct sample mining to later loss, this application proposes a knowledge-driven triplet mining method. Its core process includes:

[0173] Distance matrix calculation: Construct a distance matrix Mdis based on feature vectors, and extract the anchor-positive sample distance Dap and the anchor-negative sample distance Dan through the positive / negative sample indicator matrix P / N.

[0174] Intra-class distance evaluation: Calculate the maximum-minimum distance difference Dap of positive samples in the current batch and store it in the historical knowledge queue Qdis.

[0175] Adaptive threshold decision: Based on the average of the maximum and minimum distance differences between positive samples in the historical knowledge queue Qdis, determine the target training sample set for training and the corresponding inter-class distance threshold, where:

[0176] If the maximum-minimum distance difference Dap of positive samples is greater than the average, it indicates high intra-class distance: a combination of difficult positive samples and easy negative samples can be used as the target training sample set, and a smaller first inter-class distance threshold α1 can be used to adjust the inter-class boundaries.

[0177] If the maximum-minimum distance difference Dap of positive samples is less than the average, it indicates low intra-class distance. A combination of easy positive samples and difficult negative samples can be used as the target training sample set, and a larger second-class distance threshold α2 can be used to suppress overfitting.

[0178] If the maximum-minimum distance difference Dap of positive samples equals the average (the error is within the allowable error range), it indicates that it is a medium intra-class distance: a combination of semi-difficult positive and negative samples can be used as the target training sample set, and a moderate inter-class distance threshold α3 can be used to balance the training mechanism.

[0179] Accordingly, by guiding the selection of triples through historical knowledge, dynamic adaptation of sample difficulty is achieved, which not only ensures the consistency of perspectives but also makes full use of multi-perspective samples, significantly improving the robustness of the model.

[0180] In related embodiments, a viewpoint consistency constraint strategy can also be used to filter the negative sample candidate pool based on pedestrian attributes, restricting the matching of sample pairs with consistent attributes and reducing viewpoint interference. Simultaneously, the RANSAC algorithm (RANSAC stands for RANdomSample Consensus, an algorithm that calculates mathematical model parameters of data from a sample dataset containing outliers to obtain valid sample data) is used to perform geometric consistency verification on candidate triples, eliminating samples with mismatched spatial geometric relationships. This enables adaptive data distribution and cross-viewpoint error control, further improving the stability, effectiveness, and generalization ability of the triple learning process.

[0181] Based on the solution described in the embodiments of this application as above, it can be determined that:

[0182] The Global-Local Branch Multi-structure (GLMB) design addresses the balance between long-range feature dependencies and local detail extraction. Compared to traditional global or local feature extraction methods, GLMB achieves complementary capture of global and local information by fusing Transformer blocks with compressed activation blocks. The global branch accurately models global features through a multi-head attention mechanism, while the local branch enhances the capture of key channel information through compressed activation modules. This multi-structure design significantly improves feature representation capabilities, especially in complex tasks where the high degree of fusion between global and local features is difficult to replace with traditional single-structure approaches.

[0183] The global and local branches are trained using different loss functions (cross-entropy and triplet loss), which enhances their respective feature extraction capabilities. This dual design effectively improves the model's discriminative ability through the diversity of global and local branch structures and the diversity of loss function objectives. In particular, the introduction of an attention transfer module (ATM) between sub-branches enables bidirectional global and local information interaction, allowing the global branch to incorporate local details and the local branch to absorb global context, demonstrating innovative cross-branch feature interaction capabilities.

[0184] During training, by dynamically adjusting the adaptive threshold based on the historical knowledge queue, the difficulty of the sample can be dynamically adapted when selecting samples. This dynamic sample selection method based on historical knowledge effectively avoids the local optimum problem in the mining of difficult triples, and at the same time makes full use of multi-view samples to improve the robustness of the model.

[0185] The parameterless attention transfer module (ATM) introduced between global and local branches not only enables bidirectional attention interaction between global and local branches but also significantly reduces computational overhead by eliminating the need for additional training parameters. Through this parameterless design, the system can significantly improve the fusion effect of global and local features without increasing model complexity, especially during large-scale model training, where it greatly enhances training efficiency.

[0186] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages in other steps. It is understood that the steps in different embodiments can be freely combined as needed, and all non-contradictory solutions formed by such combinations are within the scope of protection of this application.

[0187] Based on the same inventive concept, this application also provides a pedestrian re-identification device for implementing the pedestrian re-identification method described above. The solution provided by this device is similar to the implementation described in the above method; therefore, the specific limitations in one or more pedestrian re-identification device embodiments provided below can be found in the limitations of the pedestrian re-identification method described above, and will not be repeated here.

[0188] In one exemplary embodiment, such as Figure 9 As shown, a pedestrian re-identification device is provided, comprising: an input processing module 901, a global feature processing module 902, a local feature processing module 903, a fusion module 904, and a recognition processing module 905, wherein:

[0189] The input processing module 901 is used to acquire the target image and perform feature extraction processing on the target image to obtain the input feature map;

[0190] The global feature processing module 902 is used to spatially divide the input feature map to obtain multiple spatial locations, and to perform multi-head self-attention processing on the multiple spatial locations to obtain the global features of the target object in the target image.

[0191] The local feature processing module 903 is used to perform convolutional dimensionality reduction, global average pooling, and channel recalibration on the input feature map to obtain the local features of the target object in the target image.

[0192] The fusion module 904 is used to fuse global features and local features to obtain the fused features of the target image;

[0193] The recognition processing module 905 is used for pedestrian re-identification processing based on fused features.

[0194] Each module in the aforementioned pedestrian re-identification device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the corresponding operations of each module.

[0195] In some embodiments, the global feature processing module 902 is used to perform 1×1 convolution processing on the input feature map to obtain a dimension-reduced feature map; perform multi-head attention calculation processing on the dimension-reduced feature map to obtain global attention weights; multiply the global attention weights with the dimension-reduced feature map to obtain a global interaction feature map; and perform 1×1 convolution processing on the global interaction feature map to obtain global features of the target object.

[0196] In some embodiments, the local feature processing module 903 is used to perform 1×1 convolution processing on the input feature map to obtain a dimension-reduced feature map; perform global average pooling processing on the dimension-reduced feature map to obtain a channel descriptor; perform channel modeling processing on the channel descriptor through two fully connected layers to obtain local channel attention weights; multiply the local channel attention weights with the dimension-reduced feature map to obtain a local interaction feature map; and perform 1×1 convolution processing on the local interaction feature map to obtain local features of the target object.

[0197] In some embodiments, the global feature processing module 902 is used to reshape the local channel attention weights generated in the local features of the target object in the obtained target image to obtain the reshaped local channel attention weights; multiply the global attention weights with the dimensionality-reduced feature map to obtain the initial global interactive feature map; and add the reshaped local channel attention weights with the initial global interactive feature map to obtain the global interactive feature map.

[0198] In some embodiments, the local feature processing module 903 is used to perform two reshaping operations on the global attention weights generated in the global features of the target object in the obtained target image to obtain the reshaping global attention weights; and to multiply the local channel attention weights, the reshaping global attention weights, and the dimensionality-reduced feature map to obtain a local interactive feature map.

[0199] In some embodiments, the apparatus includes:

[0200] The model training module is used to acquire the training sample set and determine the feature vector of each training sample in the training sample set; a distance matrix is ​​constructed based on the feature vector of each training sample, and the matrix elements in the distance matrix are the distance between the feature vectors of the corresponding two training samples; the anchor-positive sample distance and the anchor-negative sample distance are extracted through the positive sample indicator matrix and the negative sample indicator matrix.

[0201] During the training process of any training batch, the following steps are performed: Calculate the current maximum and minimum distance difference of positive samples in the current batch's training sample set; obtain the average maximum and minimum distance difference determined based on the maximum and minimum distance differences of historical training sample sets; determine the sample set category of the current batch's training sample set based on the relationship between the current maximum and minimum distance differences and the average maximum and minimum distance differences; determine the target training sample set of the current batch from the current batch's training sample set based on the sample set category; and train the pedestrian re-identification model based on the inter-class distance threshold corresponding to the sample set category and the target training sample set of the current batch.

[0202] In some embodiments, the model training module is used for:

[0203] If the current maximum-minimum distance difference is greater than the average maximum-minimum distance difference, select difficult positive samples and easy negative samples from the current batch of training sample set to obtain the target training sample set of the current batch, and set the inter-class distance threshold to the first inter-class distance threshold.

[0204] If the current maximum-minimum distance difference is less than the average maximum-minimum distance difference, select easy positive samples and difficult negative samples from the current batch of training sample set to obtain the target training sample set for the current batch, and set the inter-class distance threshold to the first inter-class distance threshold.

[0205] If the current maximum-minimum distance difference is greater than the average maximum-minimum distance difference, select difficult positive samples and easy negative samples from the current batch of training sample set to obtain the target training sample set of the current batch, and set the inter-class distance threshold to the second inter-class distance threshold.

[0206] When the current maximum and minimum distance difference is equal to the average maximum and minimum distance difference, select semi-hard positive samples and semi-hard negative samples from the current batch of training sample set to obtain the target training sample set of the current batch, and set the inter-class distance threshold to the third inter-class distance threshold.

[0207] Among them, the distance between the first class is less than the distance threshold between the third class, and the distance between the third class is less than the distance threshold between the second class.

[0208] In one exemplary embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 10 As shown, this computer device includes a processor, memory, input / output interfaces (I / O), and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides the environment for the operating system and computer programs stored in the non-volatile storage media. The database stores data. The I / O interfaces are used for exchanging information between the processor and external devices. The communication interface is used for communicating with external terminals via a network. When executed by the processor, the computer program implements a pedestrian re-identification method.

[0209] In one exemplary embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 11As shown, the computer device includes a processor, memory, input / output interfaces, a communication interface, a display unit, and an input device. The processor, memory, and input / output interfaces are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The input / output interfaces are used for exchanging information between the processor and external devices. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, Near Field Communication (NFC), or other technologies. When the computer program is executed by the processor, it implements a pedestrian re-identification method. The display unit is used to form a visually visible image and can be a display screen, a projection device, or a virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen. The input device of the computer device can be a touch layer covering the display screen, or buttons, trackballs, or touchpads set on the casing of the computer device, or external keyboards, touchpads, or mice, etc.

[0210] Those skilled in the art will understand that Figure 10 , 11 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0211] In one embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the methods in the above embodiments.

[0212] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the steps of the methods described in the above embodiments.

[0213] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps of the methods described in the above embodiments.

[0214] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.

[0215] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.

[0216] The above embodiments are merely illustrative of several implementation methods of this application, and their descriptions are relatively specific and detailed. However, they should not be construed as limiting the scope of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A pedestrian re-identification method, characterized in that, The method includes: Acquire the target image and perform feature extraction processing on the target image to obtain the input feature map; The input feature map is spatially partitioned to obtain multiple spatial locations, and multi-head self-attention processing is performed on the multiple spatial locations to obtain the global features of the target object in the target image; The input feature map is subjected to convolutional dimensionality reduction, global average pooling, and channel recalibration to obtain the local features of the target object in the target image. The global features and the local features are fused to obtain the fused features of the target image; Pedestrian re-identification is performed based on the fused features.

2. The method according to claim 1, characterized in that, The step of spatially partitioning the input feature map to obtain multiple spatial locations, and then performing multi-head self-attention processing on these multiple spatial locations to obtain global features of the target object in the target image, includes: The input feature map is subjected to a 1×1 convolution to obtain a dimension-reduced feature map; Multi-head attention calculation is performed on the dimensionality-reduced feature map to obtain global attention weights; The global attention weights are multiplied by the dimensionality-reduced feature map to obtain the global interaction feature map; The global interaction feature map is processed by a 1×1 convolution to obtain the global features of the target object.

3. The method according to claim 1, characterized in that, The process of performing convolutional dimensionality reduction, global average pooling, and channel recalibration on the input feature map to obtain local features of the target object in the target image includes: The input feature map is subjected to a 1×1 convolution to obtain a dimension-reduced feature map; The reduced feature map is subjected to global average pooling to obtain channel descriptors; The channel descriptor is modeled using two fully connected layers to obtain local channel attention weights. The local channel attention weights are multiplied with the dimensionality-reduced feature map to obtain a local interaction feature map; The local interactive feature map is processed by 1×1 convolution to obtain the local features of the target object.

4. The method according to claim 2, characterized in that, The step of multiplying the global attention weights with the dimensionality-reduced feature map to obtain the global interaction feature map includes: After reshaping the local channel attention weights generated from the local features of the target object in the target image, the reshaped local channel attention weights are obtained. The global attention weights are multiplied by the dimensionality-reduced feature map to obtain the initial global interaction feature map; The local channel attention weights after the reshaping operation are added to the initial global interaction feature map to obtain the global interaction feature map.

5. The method according to claim 3, characterized in that, The step of multiplying the local channel attention weights with the dimensionality-reduced feature map to obtain a local interaction feature map includes: After performing two reshaping operations on the global attention weights generated from the global features of the target object in the target image, the reshaping global attention weights are obtained. The local channel attention weights, the global attention weights after the reshaping operation, and the dimensionality-reduced feature map are multiplied together to obtain a local interactive feature map.

6. The method according to any one of claims 1 to 5, characterized in that, The pedestrian re-identification model obtained through training is used to extract features from the target object to obtain an input feature map, and the input feature map is processed to obtain the fusion features of the target image. Pedestrian re-identification processing is performed based on the fused features; The methods for training the pedestrian re-identification model include: Obtain a training sample set and determine the feature vector of each training sample in the training sample set; construct a distance matrix based on the feature vector of each training sample, wherein the matrix elements in the distance matrix are the distance between the feature vectors of two corresponding training samples; extract the anchor-positive sample distance and the anchor-negative sample distance through the positive sample indicator matrix and the negative sample indicator matrix. During the training process of any training batch, the following processing is performed: Calculate the current maximum and minimum distance difference between positive samples in the current batch of training samples; Obtain the maximum and minimum distance difference based on the historical training sample set, and determine the average maximum and minimum distance difference; Based on the relationship between the current maximum and minimum distance difference and the average maximum and minimum distance difference, the sample set category of the current batch of training sample set is determined; The target training sample set for the current batch is determined from the training sample set of the current batch based on the sample set category; The pedestrian re-identification model is trained based on the inter-class distance thresholds corresponding to the sample set categories and the target training sample set of the current batch.

7. The method according to claim 6, characterized in that, The method further includes: If the current maximum and minimum distance difference is greater than the average maximum and minimum distance difference, difficult positive samples and easy negative samples are filtered from the current batch of training sample set to obtain the target training sample set of the current batch, and the inter-class distance threshold is set to the first inter-class distance threshold. If the current maximum and minimum distance difference is less than the average maximum and minimum distance difference, easy positive samples and difficult negative samples are selected from the current batch of training sample set to obtain the target training sample set of the current batch, and the inter-class distance threshold is set to the first inter-class distance threshold. If the current maximum and minimum distance difference is greater than the average maximum and minimum distance difference, difficult positive samples and easy negative samples are filtered from the current batch of training sample set to obtain the target training sample set of the current batch, and the inter-class distance threshold is set to the second inter-class distance threshold. When the current maximum and minimum distance difference is equal to the average maximum and minimum distance difference, semi-hard positive samples and semi-hard negative samples are selected from the current batch of training sample set to obtain the target training sample set of the current batch, and the inter-class distance threshold is set to the third inter-class distance threshold. Wherein, the distance between the first class is less than the distance threshold between the third class, and the distance between the third class is less than the distance threshold between the second class.

8. A pedestrian re-identification device, characterized in that, The device includes: The input processing module is used to acquire the target image and perform feature extraction processing on the target image to obtain an input feature map; The global feature processing module is used to spatially divide the input feature map to obtain multiple spatial locations, and to perform multi-head self-attention processing on the multiple spatial locations to obtain the global features of the target object in the target image. The local feature processing module is used to perform convolutional dimensionality reduction, global average pooling, and channel recalibration on the input feature map to obtain the local features of the target object in the target image. A fusion module is used to fuse the global features and the local features to obtain the fused features of the target image; The recognition processing module is used to perform pedestrian re-identification processing based on the fused features.

9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.