Pedestrian re-identification method and device, electronic device, and storage medium

By introducing a spatiotemporal attention mechanism to process the feature maps in pedestrian re-identification technology, the problem of decreased accuracy caused by occlusion and pose changes is solved, and higher pedestrian re-identification accuracy is achieved.

CN116563750BActive Publication Date: 2026-06-16PING AN TECH (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
PING AN TECH (SHENZHEN) CO LTD
Filing Date
2023-04-07
Publication Date
2026-06-16

Smart Images

  • Figure CN116563750B_ABST
    Figure CN116563750B_ABST
Patent Text Reader

Abstract

The application provides a pedestrian re-identification method and device, an electronic device and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: performing feature extraction on an original video frame to obtain an initial feature map, inputting the initial feature map into a space-time attention model for normalization processing to obtain an initial attention map, performing segmentation processing on the initial feature map to obtain an initial sub-feature map, performing segmentation processing on the initial attention map to obtain a sub-attention map, performing normalization processing on the sub-attention map to obtain an initial spatial attention score of the initial sub-feature map, constructing a target feature attention matrix according to the initial spatial attention score, constructing a target feature map according to the target feature attention matrix and the initial sub-feature map, performing similarity calculation on the target feature map and a reference person image to obtain a target similarity, and performing screening processing on the original video frame according to the target similarity to obtain a target video frame, so that the accuracy of pedestrian re-identification can be improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a pedestrian re-identification method, apparatus, electronic device and storage medium thereof. Background Technology

[0002] In related technologies, when performing pedestrian re-identification tasks, specific pedestrians are detected based on multiple video sequences captured by multiple cameras. However, some video frames in the video sequences may have occlusion issues, making it impossible to accurately identify the body parts of pedestrians and reducing the accuracy of pedestrian re-identification. Summary of the Invention

[0003] The main objective of this application is to provide a pedestrian re-identification method, apparatus, electronic device, and storage medium, which aims to improve the accuracy of pedestrian re-identification.

[0004] To achieve the above objectives, a first aspect of this application proposes a pedestrian re-identification method, the method comprising:

[0005] Acquire multiple raw video frames to be identified;

[0006] Feature extraction is performed on the original video frames to obtain the initial feature map of the original video frames;

[0007] The initial feature map is input into a preset spatiotemporal attention model for normalization processing to obtain the initial attention map of the initial feature map;

[0008] The initial feature map is segmented to obtain an initial sub-feature map, and the initial attention map is segmented to obtain a sub-attention map, wherein the initial sub-feature map and the sub-attention map are... Figure 1 One-to-one correspondence;

[0009] The sub-attention map is normalized to obtain the initial spatial attention score of the initial sub-feature map;

[0010] Construct the target feature attention matrix based on the initial spatial attention score;

[0011] A target feature map is obtained by constructing a feature map based on the target feature attention matrix and multiple initial sub-feature maps.

[0012] The similarity between the target feature map and the preset reference image is calculated to obtain the target similarity.

[0013] Multiple original video frames are filtered based on the target similarity to obtain target video frames, which are video frames containing the target person.

[0014] In some embodiments, the initial feature map includes a first feature map and a second feature map, and the step of extracting features from the original video frame to obtain the initial feature map of the original video frame includes:

[0015] The original video frame is scaled to obtain a first video image and a second video image, wherein the resolution of the first video image is higher than that of the second video image.

[0016] Based on a preset feature extraction model, features are extracted from the first video image to obtain a first feature map, and then features are extracted from the second video image using the same feature extraction model to obtain a second feature map.

[0017] In some embodiments, the feature extraction model includes a first neural network and a second neural network. The first feature map includes a first global feature map and a first local feature map, and the second feature map includes a second global feature map and a second local feature map. The step of extracting features from the first video image based on the preset feature extraction model to obtain a first feature map, and then extracting features from the second video image using the same feature extraction model to obtain a second feature map, includes:

[0018] Based on the first neural network, the first video image is used to estimate the pose of the person to obtain a first heatmap of the first video image, and the second video image is used to estimate the pose of the person to obtain a second heatmap of the second video image.

[0019] Based on the second neural network, feature extraction is performed on the first video image to obtain a first intermediate feature map, and then feature extraction is performed on the second video image through the second neural network to obtain a second intermediate feature map;

[0020] The first heatmap and the first intermediate feature map are fused to obtain the first local feature map, and the second heatmap and the second intermediate feature map are fused to obtain the second local feature map.

[0021] The first intermediate feature map is pooled to obtain the first global feature map, and the second intermediate feature map is pooled to obtain the second global feature map.

[0022] In some embodiments, the first neural network includes a first convolutional layer, a deconvolutional layer, and a second convolutional layer. The step of estimating the pose of a person based on the first neural network in the first video image to obtain a first heatmap of the first video image includes:

[0023] The first video image is used to extract features through the first convolutional layer to obtain a third intermediate feature map.

[0024] The third intermediate feature map is upsampled by the deconvolution layer to obtain the fourth intermediate feature map;

[0025] The first heatmap is obtained by performing heatmap prediction on the fourth intermediate feature map through the second convolutional layer.

[0026] In some embodiments, constructing the target feature attention matrix based on the initial spatial attention score includes:

[0027] Acquire the temporal data of the original video frame and the spatial location data of the initial sub-feature map in the initial feature map;

[0028] Based on the spatial location data, multiple initial spatial attention scores are concatenated to obtain the target spatial attention score for each initial feature map.

[0029] The target spatial attention scores of multiple targets are aggregated based on the time data to obtain the target feature attention matrix.

[0030] In some embodiments, the step of constructing a feature map based on the target feature attention matrix and multiple initial sub-feature maps to obtain a target feature map includes:

[0031] Each initial sub-feature map is filtered according to the target feature attention matrix to obtain multiple target sub-feature maps;

[0032] Based on the target feature attention matrix, a weighted calculation is performed on multiple initial sub-feature maps to obtain a target weighted feature map for each initial feature map;

[0033] The target feature map is obtained based on the target sub-feature map and the target weighted feature map.

[0034] In some embodiments, obtaining the target feature map based on the target sub-feature map and the target weighted feature map includes:

[0035] The multiple target sub-feature maps are concatenated to obtain a first fused feature map;

[0036] The weighted feature maps of the multiple targets are concatenated to obtain a second fused feature map.

[0037] The first fused feature map and the second fused feature map are fused to obtain the target feature map.

[0038] To achieve the above objectives, a second aspect of this application provides a pedestrian re-identification device, the device comprising:

[0039] The acquisition module is used to acquire multiple raw video frames to be identified;

[0040] The feature extraction module is used to extract features from the original video frame to obtain an initial feature map of the original video frame;

[0041] The first normalization module is used to input the initial feature map into a preset spatiotemporal attention model for normalization processing to obtain the initial attention map of the initial feature map.

[0042] The segmentation module is used to segment the initial feature map to obtain an initial sub-feature map, and to segment the initial attention map to obtain a sub-attention map, wherein the initial sub-feature map and the sub-attention map are... Figure 1 One-to-one correspondence;

[0043] The second normalization module is used to normalize the sub-attention map to obtain the initial spatial attention score of the initial sub-feature map;

[0044] The first construction module is used to construct a target feature attention matrix based on the initial spatial attention score;

[0045] The second construction module is used to construct a feature map based on the target feature attention matrix and multiple initial sub-feature maps to obtain a target feature map.

[0046] The calculation module is used to calculate the similarity between the target feature map and the preset reference human image to obtain the target similarity.

[0047] The filtering module is used to filter multiple original video frames based on the target similarity to obtain target video frames, wherein the target video frames are video frames containing the target person.

[0048] To achieve the above objectives, a third aspect of this application provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the method described in the first aspect.

[0049] To achieve the above objectives, a fourth aspect of the present application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method described in the first aspect.

[0050] The pedestrian re-identification method, device, electronic device, and computer-readable storage medium proposed in this application acquire multiple original video frames to be identified, extract features from the original video frames to obtain initial feature maps, and introduce an attention mechanism to process the initial feature maps to avoid spatial misalignment caused by pedestrian pose changes between multiple video frames. The initial feature maps are input into a preset spatiotemporal attention model for normalization to obtain an initial attention map of the initial feature maps. The initial feature maps are then segmented to obtain initial sub-feature maps, and the initial attention maps are further segmented to obtain sub-attention maps. The initial sub-feature maps and sub-attention maps are... Figure 1 A one-to-one correspondence is established, ensuring that each spatial region of the initial feature map has a corresponding sub-attention map. The sub-attention maps are normalized to obtain initial spatial attention scores, ensuring that each spatial region of the initial feature map has a corresponding spatial attention score. A target feature attention matrix is ​​constructed based on these initial spatial attention scores. Feature maps are then constructed using the target feature attention matrix and multiple initial sub-feature maps to obtain the target feature map. This avoids relying on features of occluded areas in video frames for pedestrian re-identification, instead using key features with high spatial attention scores to accurately identify pedestrian body parts in video frames. Furthermore, the similarity between the target feature map and a preset reference person image is calculated to obtain a target similarity. Multiple original video frames are then filtered based on this target similarity to obtain the target video frame, which contains the target person. Using the target video frame as the result of pedestrian re-identification improves the accuracy of pedestrian re-identification. Attached Figure Description

[0051] Figure 1 This is a flowchart of the pedestrian re-identification method provided in the embodiments of this application;

[0052] Figure 2 yes Figure 1 The flowchart of step S120 in the middle;

[0053] Figure 3 yes Figure 2 The flowchart of step S220 in the text;

[0054] Figure 4 yes Figure 3 The flowchart of step S310 in the process;

[0055] Figure 5 yes Figure 1 The flowchart of step S160 in the process;

[0056] Figure 6 yes Figure 1 The flowchart of step S170 in the process;

[0057] Figure 7 yes Figure 6 The flowchart of step S630 in the process;

[0058] Figure 8 This is a schematic diagram of the pedestrian re-identification device provided in the embodiments of this application;

[0059] Figure 9 This is a schematic diagram of the hardware structure of the electronic device provided in the embodiments of this application. Detailed Implementation

[0060] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0061] It should be noted that although functional modules are divided in the device schematic diagram and a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the device or the order in the flowchart. The terms "first," "second," etc., in the specification, claims, and the aforementioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.

[0062] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.

[0063] First, let's analyze some of the terms used in this application:

[0064] Artificial intelligence (AI) is a new branch of computer science that studies, develops, and applies theories, methods, technologies, and systems to simulate, extend, and expand human intelligence. It aims to understand the essence of intelligence and produce intelligent machines that can react in a way similar to human intelligence. Research in this field includes robotics, speech recognition, image recognition, natural language processing, and expert systems. AI can simulate the information processes of human consciousness and thought. Furthermore, AI utilizes digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceiving the environment, acquiring knowledge, and using that knowledge to achieve optimal results.

[0065] Person re-identification (ReID), also known as pedestrian re-identification, is a technique that uses computer vision to determine whether images of a pedestrian appearing in different time periods and under different surveillance cameras belong to a specific pedestrian. Person re-identification is considered a sub-problem of image retrieval, that is, given a surveillance image of a pedestrian, retrieving images of that pedestrian from different camera devices.

[0066] In related technologies, when performing pedestrian re-identification tasks, specific pedestrians are detected based on multiple video sequences captured by multiple cameras. However, existing pedestrian re-identification models are based on basic operations such as average pooling or max pooling across video frames to obtain the representation of the input video. These models are easily limited by data noise such as camera viewpoint, lighting, occlusion, and background clutter, and cannot handle spatial misalignment caused by changes in human posture between video frames, resulting in a decrease in the accuracy of pedestrian re-identification.

[0067] Based on this, embodiments of this application provide a pedestrian re-identification method, apparatus, electronic device, and storage medium, aiming to improve the accuracy of pedestrian re-identification.

[0068] The pedestrian re-identification method, pedestrian re-identification device, electronic device, and computer-readable storage medium provided in this application are specifically described through the following embodiments. First, the pedestrian re-identification method in this application embodiment is described.

[0069] The embodiments of this application can acquire and process relevant data based on artificial intelligence technology. Artificial intelligence (AI) refers to the theories, methods, technologies, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to obtain optimal results.

[0070] Foundational technologies for artificial intelligence generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies mainly encompass computer vision, robotics, biometrics, speech processing, natural language processing, and machine learning / deep learning.

[0071] The pedestrian re-identification method provided in this application relates to the field of artificial intelligence technology. The pedestrian re-identification method provided in this application can be applied to a terminal, a server, or software running on either a terminal or a server. In some embodiments, the terminal can be a smartphone, tablet, laptop, desktop computer, etc.; the server can be configured as an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms; the software can be an application implementing the pedestrian re-identification method, but is not limited to the above forms.

[0072] This application can be used in a wide variety of general-purpose or special-purpose computer system environments or configurations. Examples include: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and distributed computing environments including any of the above systems or devices. This application can be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform specific tasks or implement specific abstract data types. This application can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.

[0073] It should be noted that in all specific embodiments of this application, when processing data related to user identity or characteristics, such as user information, user behavior data, user historical data, and user location information, user permission or consent is obtained first. Furthermore, the collection, use, and processing of this data comply with relevant laws, regulations, and standards. In addition, when embodiments of this application require access to sensitive personal information of users, separate permission or consent from the user is obtained through pop-ups or redirection to confirmation pages. Only after obtaining the user's separate permission or consent is the necessary user-related data required for the proper functioning of these embodiments acquired.

[0074] Figure 1 This is an optional flowchart of the pedestrian re-identification method provided in the embodiments of this application. Figure 1 The method may include, but is not limited to, steps S110 to S190.

[0075] Step S110: Obtain multiple raw video frames to be identified;

[0076] Step S120: Extract features from the original video frames to obtain the initial feature map of the original video frames;

[0077] Step S130: Input the initial feature map into the preset spatiotemporal attention model for normalization processing to obtain the initial attention map of the initial feature map;

[0078] Step S140: Segment the initial feature map to obtain initial sub-feature maps, and segment the initial attention map to obtain sub-attention maps; wherein, the initial sub-feature map and the sub-attention map are... Figure 1 One-to-one correspondence;

[0079] Step S150: Normalize the sub-attention map to obtain the initial spatial attention score of the initial sub-feature map;

[0080] Step S160: Construct the target feature attention matrix based on the initial spatial attention score;

[0081] Step S170: Construct a feature map based on the target feature attention matrix and multiple initial sub-feature maps to obtain the target feature map;

[0082] Step S180: Calculate the similarity between the target feature map and the preset reference person image to obtain the target similarity.

[0083] Step S190: Filter multiple original video frames based on target similarity to obtain target video frames; wherein, the target video frame is a video frame containing the target person.

[0084] Steps S110 to S190, as illustrated in this embodiment, involve acquiring multiple original video frames to be identified, extracting features from the original video frames to obtain initial feature maps. To avoid spatial misalignment between multiple video frames due to changes in pedestrian posture, an attention mechanism is introduced to process the initial feature maps. The initial feature maps are input into a preset spatiotemporal attention model for normalization, resulting in an initial attention map. To ensure that each spatial region of the initial feature map has a corresponding sub-attention map, the initial feature map is segmented to obtain initial sub-feature maps, and the initial attention map is segmented to obtain sub-attention maps. The initial sub-feature maps and sub-attention maps are further segmented. Figure 1To ensure that each spatial region of the initial feature map has a corresponding spatial attention score, the sub-attention maps are normalized to obtain the initial spatial attention scores of the initial sub-feature maps. A target feature attention matrix is ​​then constructed based on these initial spatial attention scores. Finally, a feature map is constructed using the target feature attention matrix and multiple initial sub-feature maps to obtain the target feature map. This avoids relying on features of occluded areas in video frames for pedestrian re-identification, instead using key features with high spatial attention scores to accurately identify the body parts of pedestrians in video frames. The similarity between the target feature map and a preset reference person image is calculated to obtain the target similarity. Multiple original video frames are then filtered based on the target similarity to obtain the target video frame, which contains the target person. Using the target video frame as the result of pedestrian re-identification improves the accuracy of pedestrian re-identification.

[0085] In step S110 of some embodiments, multiple original video frames to be identified are obtained, wherein the original video frames are pedestrian images captured by multiple cameras at different times and in different areas. Specifically, video sequences captured by multiple cameras at different times and in different areas are obtained from open-source datasets such as Mars and DukeMTMvideo ReID. The video sequence includes M original video frames. Since there is correlation between the video frames in the video sequence, for example, there may only be slight differences between adjacent video frames in parts such as the head and eyes, in order to avoid repeatedly processing video frames with strong correlation, reduce the amount of computation in video processing, and improve the efficiency of pedestrian re-identification, the video sequence is randomly sampled to obtain N original video frames, where M is greater than N.

[0086] It should be noted that the similarity of the same person across different original video frames can be preserved through inter-frame regularization.

[0087] Please see Figure 2 In some embodiments, the initial feature map includes a first feature map and a second feature map, and step S120 may include, but is not limited to, steps S210 to S220:

[0088] Step S210: Perform scale transformation on the original video frame to obtain a first video image and a second video image; wherein the resolution of the first video image is higher than that of the second video image.

[0089] Step S220: Based on the preset feature extraction model, feature extraction is performed on the first video image to obtain a first feature map, and feature extraction is performed on the second video image through the feature extraction model to obtain a second feature map.

[0090] In step S210 of some embodiments, the original video frame is scaled using a stationary wavelet transform, decomposing the original video frame into a first video image and a second video image. The first and second video images have the same image size. The first video image represents the overall appearance of the original video frame, and the second video image represents the details of the original video frame, including details in the horizontal, vertical, and diagonal directions. Specifically, a wavelet basis and scaling coefficients are determined. A wavelet transform is performed on each row of the original video frame according to the wavelet basis to obtain low-frequency and high-frequency data. A wavelet transform is then performed on each column of the low-frequency and high-frequency data according to the wavelet basis, decomposing the low-frequency data into low-low-frequency and low-high-frequency data, and the high-frequency data into high-low-frequency and high-high-frequency data. This decomposition of the low-low-frequency data continues until the number of decomposition levels equals the scaling coefficients. The low-low-frequency data reflects the first video image, the low-high-frequency data reflects horizontal detail information, the high-low-frequency data reflects vertical detail information, and the high-high-frequency data reflects diagonal detail information. The wavelet basis can be a Haar wavelet, a DB wavelet, etc. Understandably, as the number of decomposition layers increases, the data noise in the original video frames will decrease, thereby reducing the impact of data noise on the accuracy of pedestrian re-identification.

[0091] In step S220 of some embodiments, the feature extraction model may include a first neural network and a second neural network. The first neural network estimates the pose of a person in the first video image to obtain a first heatmap of the first video image, and the first neural network also estimates the pose of a person in the second video image to obtain a second heatmap of the second video image. Further, the second neural network extracts features from the first video image to obtain a first intermediate feature map, and the second neural network also extracts features from the second video image to obtain a second intermediate feature map. Then, the first heatmap and the first intermediate feature map are fused to obtain a first local feature map, and the second heatmap and the second intermediate feature map are fused to obtain a second local feature map. Finally, the first intermediate feature map is pooled to obtain a first global feature map, and the second intermediate feature map is pooled to obtain a second global feature map. Thus, a first feature map is obtained based on the first global feature map and the first local feature map, and a second feature map is obtained based on the second global feature map and the second local feature map.

[0092] Steps S210 to S220 above involve scaling the original video frames to perform multi-scale refinement analysis, resulting in a first video image and a second video image. This process extracts video images of the original video frames at different scales. Based on a feature extraction model, features are extracted from the video images at different scales to obtain multi-scale features of the original video frames. Pedestrian re-identification is then performed based on these multi-scale features, improving the accuracy of pedestrian re-identification.

[0093] Please see Figure 3 In some embodiments, the feature extraction model includes a first neural network and a second neural network, the first feature map includes a first global feature map and a first local feature map, the second feature map includes a second global feature map and a second local feature map, and step S220 may include, but is not limited to, steps S310 to S340:

[0094] Step S310: Based on the first neural network, perform human pose estimation on the first video image to obtain the first heatmap of the first video image, and perform human pose estimation on the second video image through the first neural network to obtain the second heatmap of the second video image.

[0095] Step S320: Extract features from the first video image based on the second neural network to obtain a first intermediate feature map, and extract features from the second video image using the second neural network to obtain a second intermediate feature map;

[0096] Step S330: Perform feature fusion on the first heatmap and the first intermediate feature map to obtain the first local feature map, and perform image fusion on the second heatmap and the second intermediate feature map to obtain the second local feature map;

[0097] Step S340: Perform pooling processing on the first intermediate feature map to obtain the first global feature map, and perform pooling processing on the second intermediate feature map to obtain the second global feature map.

[0098] In step S310 of some embodiments, the first neural network may include a first convolutional layer, a deconvolutional layer, and a second convolutional layer. For example, the first convolutional layer extracts features from the first video image to obtain a third intermediate feature map; the deconvolutional layer upsamples the third intermediate feature map to obtain a fourth intermediate feature map; and the second convolutional layer performs heatmap prediction on the fourth intermediate feature map to obtain a first heatmap.

[0099] Furthermore, the specific process of using the first neural network to estimate the pose of the person in the second video image and obtaining the second heatmap of the second video image is basically the same as the specific process of using the first neural network to estimate the pose of the person in the first video image and obtaining the first heatmap, and will not be repeated here.

[0100] In step S320 of some embodiments, the second neural network is ResNet50. ResNet50 is used as the backbone network to extract features from the first video image to obtain a first intermediate feature map. ResNet50 is then used to extract features from the second video image to obtain a second intermediate feature map.

[0101] In step S330 of some embodiments, feature fusion is performed on the first heatmap and the first intermediate feature map, that is, the feature elements in the first heatmap are multiplied by the corresponding feature elements in the first intermediate feature map to obtain a first local feature map. The first local feature map is used to characterize the features of a certain region of the first video image, such as head features, shoulder features, etc. Image fusion is performed on the second heatmap and the second intermediate feature map, that is, the feature elements in the second heatmap are multiplied by the corresponding feature elements in the second intermediate feature map to obtain a second local feature map. The second local feature map is used to characterize the features of a certain region of the second video image, such as head features, shoulder features, etc.

[0102] In step S340 of some embodiments, the first intermediate feature map is subjected to max pooling or average pooling to obtain a first global feature map. The first global feature map is used to characterize the overall attributes of the first video image, such as color features, texture features, shape features, etc. The second intermediate feature map is subjected to max pooling or average pooling to obtain a second global feature map. The second global feature map is used to characterize the overall attributes of the second video image, such as color features, texture features, shape features, etc.

[0103] In steps S310 to S340 above, person pose estimation is performed on the first video image and the second video image based on the first neural network, resulting in a first heatmap of the first video image and a second heatmap of the second video image. Feature extraction is then performed on the first video image and the second video image based on the second neural network, resulting in a first intermediate feature map and a second intermediate feature map. To obtain a more accurate representation of pedestrian local features, feature fusion is performed on the first heatmap and the first intermediate feature map to obtain a first local feature map. Image fusion is then performed on the second heatmap and the second intermediate feature map to obtain a second local feature map. To obtain global information of the video image, pooling processing is performed on the first intermediate feature map and the second intermediate feature map to obtain a first global feature map and a second global feature map. Since global features do not include image spatial information, and local features do not include overall image attribute information, combining global and local features allows for a complete representation of the original video frame, avoiding the omission of information about specific pedestrians.

[0104] Please see Figure 4 In some embodiments, the first neural network includes a first convolutional layer, a deconvolutional layer, and a second convolutional layer, and step S310 may include, but is not limited to, steps S410 to S430:

[0105] Step S410: Extract features from the first video image through the first convolutional layer to obtain the third intermediate feature map;

[0106] Step S420: Upsample the third intermediate feature map through a deconvolution layer to obtain the fourth intermediate feature map;

[0107] Step S430: Perform heatmap prediction on the fourth intermediate feature map through the second convolutional layer to obtain the first heatmap.

[0108] In step S410 of some embodiments, the first convolutional layer is a ResNet50 network. Feature extraction is performed on the first video image using the ResNet50 network to obtain a third intermediate feature map. Specifically, the ResNet50 network includes convolutional layers and residual layers. The convolutional layers perform convolution operations on the first video image to obtain a convolutional feature map, and the residual layers perform residual operations on the convolutional feature map to obtain the third intermediate feature map. The residual layers employ skip connections, mitigating the gradient vanishing problem caused by the increase in neural network depth and improving the accuracy of feature extraction.

[0109] In step S420 of some embodiments, the deconvolution layer includes a first deconvolution layer, a second deconvolution layer, and a third deconvolution layer. The first deconvolution layer upsamples the third intermediate feature map to obtain a first deconvolution feature map. The second deconvolution layer upsamples the first deconvolution feature map to obtain a second deconvolution feature map. The third deconvolution layer upsamples the second deconvolution feature map to obtain a fourth intermediate feature map. It should be noted that the feature map size doubles with each upsampling process. Through three upsampling processes, the low-resolution third intermediate feature map is converted into a high-resolution fourth intermediate feature map. The pedestrian re-identification result is obtained based on the high-resolution fourth intermediate feature map, which can improve the accuracy of detecting specific pedestrians from video sequences and increase the reliability of pedestrian re-identification.

[0110] In step S430 of some embodiments, the second convolutional layer includes a third convolutional layer, a fourth convolutional layer, and a fifth convolutional layer. The fourth intermediate feature map is input into the second convolutional layer. The third convolutional layer performs heatmap prediction on the fourth intermediate feature map to obtain the heatmap size. The fourth convolutional layer performs target prediction on the fourth intermediate feature map to obtain the target's width and height data. The fifth convolutional layer performs target prediction on the fourth intermediate feature map to obtain the target's center point coordinate data. A first heatmap is generated based on the heatmap size, the target's width and height data, and the center point coordinate data. The center point of the target is a pedestrian posture key point, such as the elbow, head, left shoulder, right knee, etc.

[0111] It should be noted that the second heatmap of the second video image is obtained by estimating the pose of the person in the second video image through the first neural network. The second heatmap is generated in the same way as the first heatmap, which will not be described again here.

[0112] In steps S410 to S430 above, the first video image is used to extract features through the first convolutional layer to obtain a third intermediate feature map. Since the first convolutional layer uses skip connections, it alleviates the gradient vanishing problem caused by the increase in neural network depth and improves the feature extraction accuracy of the third intermediate feature map. The third intermediate feature map is upsampled through the deconvolutional layer to increase its resolution and obtain a fourth intermediate feature map. The fourth intermediate feature map is then used to predict the heatmap through the second convolutional layer to obtain a first heatmap of the pedestrian pose key points in the first video image. Pedestrian re-identification is performed based on the first heatmap with pedestrian pose key points, which can improve the accuracy of pedestrian detection.

[0113] In step S130 of some embodiments, if the initial feature map is represented as f[H,W,D], where H is the height of the initial feature map, W is the width of the initial feature map, and D is the channel of the initial feature map, the initial feature maps corresponding to multiple original video frames are input into the spatiotemporal attention model for L2 normalization processing to obtain the initial attention map of each initial feature map. The method for performing L2 normalization processing is shown in formula (1).

[0114]

[0115] In step S140 of some embodiments, the initial feature map is horizontally segmented along its height direction, dividing it into multiple region blocks, each region block being an initial sub-feature map. Similarly, the initial attention map is horizontally segmented, dividing it into multiple sub-attention maps, including the initial sub-feature maps and the sub-attention maps. Figure 1 One-to-one correspondence, meaning that each region of the initial feature map has its corresponding sub-attention map.

[0116] In step S150 of some embodiments, the sub-attention map is subjected to L1 normalization to obtain the initial spatial attention score of the initial sub-feature map. If the sub-attention map is represented as g... n,k , represents the sub-attention map of the k-th region of the n-th original video frame. The attention score of the k-th region of the n-th original video frame is obtained based on the sub-attention maps of the k-th region of multiple original video frames. The method for L1 normalization is shown in formula (2).

[0117]

[0118] Please see Figure 5In some embodiments, step S160 may also include, but is not limited to, steps S510 to S530:

[0119] Step S510: Obtain the temporal data of the original video frame and the spatial location data of the initial sub-feature map in the initial feature map;

[0120] Step S520: Based on the spatial location data, multiple initial spatial attention scores are concatenated to obtain the target spatial attention score for each initial feature map.

[0121] Step S530: Aggregate the attention scores of multiple target spaces based on the time data to obtain the target feature attention matrix.

[0122] In step S510 of some embodiments, the original video sequence is composed of multiple original video frames arranged in chronological order according to their temporal attributes; therefore, the original video frames have temporal attributes. The temporal data of the original video frames and the spatial location data of the initial sub-feature map within the initial feature map are obtained. The temporal data represents the temporal attribute of the original video frame, i.e., the time when the camera captured the original video frame. The spatial location data is used to characterize the spatial location of the initial sub-feature map within the initial feature map; for example, the initial sub-feature map is the k-th region of the initial feature map.

[0123] In step S520 of some embodiments, multiple spatial attention scores are concatenated based on the spatial position data of the initial sub-feature maps on the initial feature map to obtain the target spatial attention score for each initial feature map. For example, the initial feature map is divided into four initial sub-feature maps, with spatial position indices of 1, 2, 3, and 4, respectively. The initial spatial attention scores are concatenated in the order of their spatial position indices, for example, concatenating the initial spatial attention score of spatial position index 1 with the initial spatial attention score of spatial position index 2, to obtain the target spatial attention score for the initial feature map.

[0124] In step S530 of some embodiments, the target spatial attention scores of the initial feature map are concatenated from first to last according to the chronological order of the time data to obtain the target feature attention matrix, wherein the target feature attention matrix includes the attention score of each region in multiple original video frames.

[0125] Through steps S510 to S530, spatial attention scores can be assigned to each spatial region of the original video frame, instead of directly assigning attention scores to the original video frame. Since changes in human posture can cause changes in the features of the video frame and generate errors, directly assigning attention scores to each original video frame of the video sequence will not correct these errors. However, based on the attention scores of each spatial region, each human body part in the original video frame can be identified. It is possible to identify feature changes caused by changes in human posture and assign different attention scores to different features to correct the errors caused by posture changes, thereby improving the accuracy of pedestrian re-identification.

[0126] Please see Figure 6 In some embodiments, step S170 may include, but is not limited to, steps S610 to S630:

[0127] Step S610: Filter each initial sub-feature map according to the target feature attention matrix to obtain multiple target sub-feature maps;

[0128] Step S620: Based on the target feature attention matrix, perform weighted calculation on multiple initial sub-feature maps to obtain the target weighted feature map of each initial feature map;

[0129] Step S630: Obtain the target feature map based on the target sub-feature map and the target weighted feature map.

[0130] In step S610 of some embodiments, the initial sub-feature maps are filtered according to the spatial attention score of the target feature attention matrix to obtain the sub-feature map with the highest spatial attention score in each initial feature map. The highest spatial attention score indicates that the sub-feature map belongs to the feature of the pedestrian in the original video frame. The sub-feature map with the highest attention score in the initial feature map is taken as the target sub-feature map.

[0131] In step S620 of some embodiments, each spatial attention score in the target feature attention matrix corresponds to an initial feature sub-map. The initial feature sub-maps are weighted according to the spatial attention scores to obtain the target weighted feature map. If the target feature attention matrix is ​​represented as [w1, w2, w3], and the initial feature sub-maps are represented as x1, x2, and x3, then the target weighted feature map is represented as w1x1 + w2x2 + w3x3.

[0132] In step S630 of some embodiments, multiple target sub-feature maps can be concatenated to obtain a first fused feature map; and multiple target weighted feature maps can be concatenated to obtain a second fused feature map. Finally, the first fused feature map and the second fused feature map are fused to obtain a target feature map.

[0133] In steps S610 to S630 above, the initial sub-feature map with the highest spatial attention score is taken as the target sub-feature map. The initial sub-feature map is weighted according to the spatial attention score to obtain the target weighted feature map. The target feature map is obtained based on the target sub-feature map and the target weighted feature map. This can obtain an accurate representation of pedestrian features in the original video frame, avoid the influence of video occlusion on pedestrian features, and improve the accuracy of pedestrian re-identification.

[0134] Please see Figure 7 In some embodiments, step S630 may include, but is not limited to, steps S710 to S730:

[0135] Step S710: The multiple target sub-feature maps are spliced ​​together to obtain the first fused feature map;

[0136] Step S720: The weighted feature maps of multiple targets are concatenated to obtain the second fused feature map;

[0137] Step S730: Perform feature fusion on the first fused feature map and the second fused feature map to obtain the target feature map.

[0138] In step S710 of some embodiments, the target sub-feature maps of multiple original video frames are stitched together to obtain a first fused feature map.

[0139] In step S720 of some embodiments, the target weighted feature maps of multiple original video frames are stitched together to obtain a second fused feature map.

[0140] In step S730 of some embodiments, the first fused feature map and the second fused feature map are spliced ​​together to obtain a third fused feature map, and the third fused feature map is input into a global average pooling layer and a fully connected layer to obtain a target feature map.

[0141] Through steps S710 to S730, the pedestrian features in each video frame of the original video sequence can be expressed. Based on these pedestrian features, pedestrian re-identification of specific pedestrians can be performed, which can improve the efficiency and accuracy of pedestrian re-identification.

[0142] In step S180 of some embodiments, a similarity calculation is performed between the target feature map and a preset reference person image to obtain a target similarity. The reference person image is a specific pedestrian for whom pedestrian re-identification is required. Specifically, features are extracted from the reference person image according to a preset feature extraction method to obtain a reference person feature map. The similarity between the target feature map and the reference person feature map is then calculated to obtain a target similarity. It should be noted that the preset feature extraction method can be a supervised feature extraction method or an unsupervised feature extraction method, etc. Euclidean distance, cosine similarity, etc., can be used to measure the similarity between the target feature map and the reference person image.

[0143] In step S190 of some embodiments, multiple original video frames are filtered according to target similarity, and the original video frame with the highest target similarity is taken as the target video frame. The highest target similarity indicates that the target person appears in the original video frame, and the target video frame is a video frame containing the target person.

[0144] Please see Figure 8 This application also provides a pedestrian re-identification device that can implement the above-described pedestrian re-identification method. The device includes:

[0145] The acquisition module 810 is used to acquire multiple raw video frames to be identified;

[0146] The feature extraction module 820 is used to extract features from the original video frames to obtain the initial feature map of the original video frames;

[0147] The first normalization module 830 is used to input the initial feature map into a preset spatiotemporal attention model for normalization processing to obtain the initial attention map of the initial feature map.

[0148] The segmentation module 840 is used to segment the initial feature map to obtain initial sub-feature maps, and to segment the initial attention map to obtain sub-attention maps. The initial sub-feature maps and sub-attention maps are... Figure 1 One-to-one correspondence;

[0149] The second normalization module 850 is used to normalize the sub-attention map to obtain the initial spatial attention score of the initial sub-feature map.

[0150] The first construction module 860 is used to construct the target feature attention matrix based on the initial spatial attention score;

[0151] The second construction module 870 is used to construct a feature map based on the target feature attention matrix and multiple initial sub-feature maps to obtain the target feature map.

[0152] The calculation module 880 is used to calculate the similarity between the target feature map and the preset reference human image to obtain the target similarity.

[0153] The filtering module 890 is used to filter multiple original video frames based on target similarity to obtain target video frames, which are video frames containing the target person.

[0154] The specific implementation of this pedestrian re-identification device is basically the same as the specific implementation of the pedestrian re-identification method described above, and will not be repeated here.

[0155] This application also provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the aforementioned pedestrian re-identification method. This electronic device can be any smart terminal, including tablet computers, in-vehicle computers, etc.

[0156] Please see Figure 9 , Figure 9 The hardware structure of an electronic device according to another embodiment is illustrated. The electronic device includes:

[0157] The processor 910 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this application.

[0158] The memory 920 can be implemented as a read-only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 920 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, the relevant program code is stored in the memory 920 and is called and executed by the processor 910 using the pedestrian re-identification method of the embodiments of this application.

[0159] The input / output interface 930 is used to implement information input and output;

[0160] The communication interface 940 is used to enable communication and interaction between this device and other devices. Communication can be achieved through wired means (such as USB, network cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).

[0161] Bus 950 transmits information between various components of the device (e.g., processor 910, memory 920, input / output interface 930, and communication interface 940);

[0162] The processor 910, memory 920, input / output interface 930 and communication interface 940 are connected to each other within the device via bus 950.

[0163] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described pedestrian re-identification method.

[0164] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

[0165] The pedestrian re-identification method, pedestrian re-identification device, electronic device, and computer-readable storage medium provided in this application acquire multiple original video frames to be identified, extract features from the original video frames to obtain initial feature maps of the original video frames, and introduce an attention mechanism to process the initial feature maps in order to avoid spatial misalignment between multiple video frames due to changes in pedestrian posture. The initial feature maps are input into a preset spatiotemporal attention model for normalization processing to obtain an initial attention map of the initial feature maps. In order to ensure that each spatial region of the initial feature map has a corresponding sub-attention map, the initial feature map is segmented to obtain initial sub-feature maps, and the initial attention map is segmented to obtain sub-attention maps. The initial sub-feature maps and sub-attention maps are further segmented. Figure 1To ensure that each spatial region of the initial feature map has a corresponding spatial attention score, the sub-attention maps are normalized to obtain the initial spatial attention scores of the initial sub-feature maps. A target feature attention matrix is ​​then constructed based on these initial spatial attention scores. Finally, a feature map is constructed using the target feature attention matrix and multiple initial sub-feature maps to obtain the target feature map. This avoids relying on features of occluded areas in video frames for pedestrian re-identification, instead using key features with high spatial attention scores to accurately identify the body parts of pedestrians in video frames. The similarity between the target feature map and a preset reference person image is calculated to obtain the target similarity. Multiple original video frames are then filtered based on the target similarity to obtain the target video frame, which contains the target person. Using the target video frame as the result of pedestrian re-identification improves the accuracy of pedestrian re-identification.

[0166] The embodiments described in this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided by the embodiments of this application. As those skilled in the art will know, with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.

[0167] Those skilled in the art will understand that the technical solutions shown in the figures do not constitute a limitation on the embodiments of this application, and may include more or fewer steps than shown, or combine certain steps, or different steps.

[0168] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.

[0169] Those skilled in the art will understand that all or some of the steps in the methods disclosed above, as well as the functional modules / units in the systems and devices, can be implemented as software, firmware, hardware, or suitable combinations thereof.

[0170] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0171] It should be understood that in this application, "at least one (item)" means one or more, and "more than" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.

[0172] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of the units described above is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0173] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0174] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0175] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes multiple instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing programs, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0176] The preferred embodiments of the present application have been described above with reference to the accompanying drawings, but this does not limit the scope of the claims of the present application. Any modifications, equivalent substitutions, and improvements made by those skilled in the art without departing from the scope and substance of the embodiments of the present application shall be within the scope of the claims of the present application.

Claims

1. A pedestrian re-identification method, characterized in that, The method includes: Acquire multiple raw video frames to be identified; Feature extraction is performed on the original video frames to obtain the initial feature map of the original video frames; The initial feature map is input into a preset spatiotemporal attention model for normalization processing to obtain the initial attention map of the initial feature map; The initial feature map is segmented to obtain an initial sub-feature map, and the initial attention map is segmented to obtain a sub-attention map, wherein the initial sub-feature map and the sub-attention map correspond one-to-one; The sub-attention map is normalized to obtain the initial spatial attention score of the initial sub-feature map; Construct the target feature attention matrix based on the initial spatial attention score; A target feature map is obtained by constructing a feature map based on the target feature attention matrix and multiple initial sub-feature maps. The similarity between the target feature map and the preset reference image is calculated to obtain the target similarity. Multiple original video frames are filtered based on the target similarity to obtain target video frames, wherein the target video frames are video frames containing the target person; The initial feature map includes a first feature map and a second feature map. The step of extracting features from the original video frame to obtain the initial feature map of the original video frame includes: The original video frames are scaled to obtain a first video image and a second video image, wherein the resolution of the first video image is higher than that of the second video image; features are extracted from the first video image based on a preset feature extraction model to obtain a first feature map, and features are extracted from the second video image using the same feature extraction model to obtain a second feature map; The step of constructing the target feature attention matrix based on the initial spatial attention score includes: The process involves acquiring the temporal data of the original video frame and the spatial location data of the initial sub-feature map within the initial feature map; concatenating multiple initial spatial attention scores based on the spatial location data to obtain a target spatial attention score for each initial feature map; and aggregating multiple target spatial attention scores based on the temporal data to obtain the target feature attention matrix. The step of constructing a feature map based on the target feature attention matrix and multiple initial sub-feature maps to obtain the target feature map includes: Each initial sub-feature map is filtered according to the target feature attention matrix to obtain multiple target sub-feature maps; the multiple initial sub-feature maps are weighted according to the target feature attention matrix to obtain a target weighted feature map for each initial feature map; the target feature map is obtained according to the target sub-feature map and the target weighted feature map.

2. The pedestrian re-identification method according to claim 1, characterized in that, The feature extraction model includes a first neural network and a second neural network. The first feature map includes a first global feature map and a first local feature map. The second feature map includes a second global feature map and a second local feature map. The step of extracting features from the first video image based on the preset feature extraction model to obtain the first feature map, and then extracting features from the second video image using the same feature extraction model to obtain the second feature map, includes: Based on the first neural network, the first video image is used to estimate the pose of the person to obtain a first heatmap of the first video image, and the second video image is used to estimate the pose of the person to obtain a second heatmap of the second video image. Based on the second neural network, feature extraction is performed on the first video image to obtain a first intermediate feature map, and then feature extraction is performed on the second video image through the second neural network to obtain a second intermediate feature map; The first heatmap and the first intermediate feature map are fused to obtain the first local feature map, and the second heatmap and the second intermediate feature map are fused to obtain the second local feature map. The first intermediate feature map is pooled to obtain the first global feature map, and the second intermediate feature map is pooled to obtain the second global feature map.

3. The pedestrian re-identification method according to claim 2, characterized in that, The first neural network includes a first convolutional layer, a deconvolutional layer, and a second convolutional layer. The step of estimating the pose of a person in the first video image based on the first neural network to obtain a first heatmap of the first video image includes: The first video image is used to extract features through the first convolutional layer to obtain a third intermediate feature map. The third intermediate feature map is upsampled by the deconvolution layer to obtain the fourth intermediate feature map; The first heatmap is obtained by performing heatmap prediction on the fourth intermediate feature map through the second convolutional layer.

4. The pedestrian re-identification method according to claim 1, characterized in that, The step of obtaining the target feature map based on the target sub-feature map and the target weighted feature map includes: The multiple target sub-feature maps are concatenated to obtain a first fused feature map; The weighted feature maps of the multiple targets are concatenated to obtain a second fused feature map. The first fused feature map and the second fused feature map are fused to obtain the target feature map.

5. A pedestrian re-identification device, characterized in that, The device includes: The acquisition module is used to acquire multiple raw video frames to be identified; The feature extraction module is used to extract features from the original video frame to obtain an initial feature map of the original video frame; the initial feature map includes a first feature map and a second feature map. The first normalization module is used to input the initial feature map into a preset spatiotemporal attention model for normalization processing to obtain the initial attention map of the initial feature map. The segmentation module is used to segment the initial feature map to obtain an initial sub-feature map, and to segment the initial attention map to obtain a sub-attention map, wherein the initial sub-feature map and the sub-attention map correspond one-to-one; The second normalization module is used to normalize the sub-attention map to obtain the initial spatial attention score of the initial sub-feature map; The first construction module is used to construct a target feature attention matrix based on the initial spatial attention score; The second construction module is used to construct a feature map based on the target feature attention matrix and multiple initial sub-feature maps to obtain a target feature map. The calculation module is used to calculate the similarity between the target feature map and the preset reference human image to obtain the target similarity. The filtering module is used to filter multiple original video frames based on the target similarity to obtain a target video frame, wherein the target video frame is a video frame containing the target person. The device is also used for: The original video frames are scaled to obtain a first video image and a second video image, wherein the resolution of the first video image is higher than that of the second video image; features are extracted from the first video image based on a preset feature extraction model to obtain a first feature map, and features are extracted from the second video image using the same feature extraction model to obtain a second feature map; The process involves acquiring the temporal data of the original video frame and the spatial location data of the initial sub-feature map within the initial feature map; concatenating multiple initial spatial attention scores based on the spatial location data to obtain a target spatial attention score for each initial feature map; and aggregating multiple target spatial attention scores based on the temporal data to obtain the target feature attention matrix. Each initial sub-feature map is filtered according to the target feature attention matrix to obtain multiple target sub-feature maps; the multiple initial sub-feature maps are weighted according to the target feature attention matrix to obtain a target weighted feature map for each initial feature map; the target feature map is obtained according to the target sub-feature map and the target weighted feature map.

6. An electronic device, characterized in that, The electronic device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the pedestrian re-identification method according to any one of claims 1 to 4.

7. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the pedestrian re-identification method according to any one of claims 1 to 4.