A pedestrian re-identification method based on sparse attention and learnable pooling
By combining sparse attention and learnable pooling layers, the problems of high computational cost and insufficient feature extraction in pedestrian re-identification technology are solved, achieving efficient and accurate pedestrian identification.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANJING SHICHAZHE INFORMATION TECH CO LTD
- Filing Date
- 2022-12-01
- Publication Date
- 2026-06-12
Smart Images

Figure CN116092116B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image recognition research, and in particular to pedestrian re-identification methods, specifically a pedestrian re-identification method based on sparse attention and learnable pooling. Background Technology
[0002] With the accelerating pace of smart city development, a number of artificial intelligence technologies have emerged and are being applied in various urban sectors. To meet the public's growing security needs and maintain social stability, the field of intelligent security has consistently been at the forefront of smart city construction in terms of innovation in intelligent technologies. Pedestrian re-identification, as one of the core technologies of intelligent security, not only compensates for the shortcomings of manual retrieval but also overcomes the visual limitations imposed by cameras. Pedestrian re-identification technology aims to achieve automatic retrieval and tracking of target pedestrians in cross-scene and cross-regional surveillance videos. This technology has broad application prospects in intelligent surveillance, criminal investigation, and other scenarios. However, complex surveillance environments bring many challenges to pedestrian re-identification, such as changes in pedestrian posture and clothing, changes in ambient light, differences in camera angles, and pedestrian body occlusion. Therefore, innovative research on pedestrian re-identification technology is of great significance. Current problems with pedestrian re-identification technology include excessive computational load making deployment difficult, inability to effectively balance global and local features, and easy loss of information during feature dimensionality reduction. Summary of the Invention
[0003] To overcome the shortcomings of existing technologies, this invention provides a pedestrian re-identification method based on sparse attention and learnable pooling. This method reduces computational cost while fully interacting with local and global features, enabling the extraction of more comprehensive feature information. The technical solution is as follows:
[0004] This invention provides a pedestrian re-identification method based on sparse attention and learnable pooling, which includes the following main steps:
[0005] Step 1: Scale the original pedestrian image to a uniform size to complete downsampling and obtain a reduced-size feature map;
[0006] Step 2: Input the reduced-size feature map into the sparse attention module, which includes an inverted residual module, a local attention module, and a global attention module.
[0007] Step 3, convert the output feature map f of the inverted residual module a As input to the local attention module, Where H, W, and C represent the height, width, and number of channels of the feature map, respectively;
[0008] feature map f aThe feature map is divided into multiple non-overlapping windows, each of size A×A, and then flattened out. The resulting feature map is denoted as f′. a , dimension f′ a The input is fed into the self-attention module, where self-attention is calculated for each A×A feature window. The results obtained from the self-attention module are then spatially reconstructed to facilitate full interaction between the extracted local features and subsequent global features. Finally, the feature map f is obtained through a feedforward network. b ,
[0009] Step 4, transfer the feature map f b The input is fed into the global attention module, where a fixed-size B×B grid is used to focus f. b Divided into dimensions Feature map f′ b The size of the feature window after segmentation is Self-attention is applied along the B×B dimension, and the results obtained by the self-attention module are reconstructed at the grid level. Feature maps are then obtained through a feedforward network.
[0010] Step 5, f c The input is a learnable pooling layer, and the output f is obtained. d If f c The m-th feature map is m∈{1,2,...,C},X m for The set of elements can be used to learn the m-th feature output by the pooling layer. It is expressed as follows: α m Let x be the hyperparameter corresponding to the m-th feature map, and let x be the set X. m The elements in the pooling layer; the output f of the learnable pooling layer. d After passing through a fully connected layer, it is fed into the loss function.
[0011] Preferably, step 1 specifically involves scaling the original pedestrian image to 128×256, inputting it into a 3×3 convolutional layer with a stride of 2, completing downsampling, and obtaining a feature map with reduced size.
[0012] Preferably, the inverted residual module in step 2 includes two 1×1 convolutional layers, a 3×3 depthwise convolutional layer, and an SE structure.
[0013] Preferably, the loss function in step 5 uses triplet loss, where (i, j, k) represents a triplet sample. An anchor point i is set, and the triplet loss is to bring the anchor point closer to the positive sample j and to increase the distance between the anchor point and the negative sample k, as shown below: ,in, Let be the Euclidean distance from anchor point i to positive sample j. Let be the Euclidean distance from anchor point i to negative sample k.
[0014] Furthermore, the method also includes step 6, in the model inference stage, inputting the pedestrian image to be identified and the image of the base personnel into the model respectively, extracting the output features of the fully connected layer, calculating the cosine similarity of the two output features, setting a similarity threshold, and thus determining whether the pedestrian to be identified is a base personnel.
[0015] Compared with existing technologies, one of the above technical solutions has the following advantages: the sparse attention mechanism in this method includes local sparse attention and global sparse attention. By using the sparse attention mechanism, the amount of computation can be reduced, which is beneficial for edge deployment. At the same time, the amount of computation is reduced, and local and global features are fully interacted to extract more comprehensive feature information. By using a learnable pooling layer, the features are reduced in dimensionality through the learned weights, and more discriminative deep features can be extracted. Attached Figure Description
[0016] Figure 1 A flowchart of a pedestrian re-identification method based on sparse attention and learnable pooling provided in embodiments of this disclosure. Detailed Implementation
[0017] To clarify the technical solutions and working principles of the present invention, the embodiments of this disclosure will be further described in detail below with reference to the accompanying drawings. All the above-described optional technical solutions can be combined in any way to form optional embodiments of this disclosure, and will not be elaborated upon here. The terms "step 1," "step 2," "step 3," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in sequences other than those described herein.
[0018] This disclosure provides a pedestrian re-identification method based on sparse attention and learnable pooling, with appended... Figure 1 The flowchart below shows a pedestrian re-identification method based on sparse attention and learnable pooling. Based on this flowchart, the method includes the following main steps:
[0019] Step 1: Scale the original pedestrian image to a uniform size, perform downsampling, and obtain a reduced-size feature map.
[0020] Preferably, in step 1, the original pedestrian image is scaled to 128×256 and input into a 3×3 convolutional layer with a stride of 2 to complete downsampling and obtain a feature map with reduced size, thereby reducing the computational load of subsequent operations.
[0021] Step 2: Input the reduced-size feature map into the sparse attention module, which includes an inverted residual module, a local attention module, and a global attention module.
[0022] Preferably, the inverted residual module described in step 2 includes two 1×1 convolutional layers, one 3×3 deep convolutional layer, and an SE structure. Compared with a general residual module, this module can not only reduce the number of parameters, but also increase the model depth and improve the model's learning ability.
[0023] Step 3, convert the output feature map f of the inverted residual module a As input to the local attention module, Where H, W, and C represent the height, width, and number of channels of the feature map, respectively;
[0024] feature map f a The feature map is divided into multiple non-overlapping windows, each of size A×A, and then flattened out. The resulting feature map is denoted as f′. a , dimension f′ a The input is fed into the self-attention module, where self-attention is calculated for each A×A feature window. This operation fully extracts local feature information, significantly reducing computational cost compared to calculating the entire image. The results obtained from the self-attention module are spatially reconstructed to facilitate full interaction between the extracted local features and subsequent global features. Finally, a feedforward network is used to obtain the feature map f. b ,
[0025] Step 4, transfer the feature map f b The input is fed into the global attention module, where a fixed-size B×B grid is used to focus f. b Divided into dimensions Feature map f′ b The size of the feature window after segmentation is Using self-attention in a B×B dimension satisfies both sparsity (reducing computational cost) and incorporates global spatial information. Similarly, the results obtained from the self-attention module are reconstructed at the grid level, and feature maps are obtained through a feedforward network.
[0026] Step 5, f c The input is a learnable pooling layer, and the output f is obtained. d Unlike commonly used max pooling and average pooling, learnable pooling layers acquire more discriminative and domain-specific features through learned weights. If f c The m-th feature map is X m for The set of elements can be used to learn the m-th feature output by the pooling layer. It is expressed as follows: α m Let x be the hyperparameter corresponding to the m-th feature map, and let x be the set X. m The elements in.
[0027] The output f of the learnable pooling layer d After passing through a fully connected layer, it is fed into the loss function.
[0028] Preferably, the loss function used is the triplet loss. Let (i, j, k) represent a triplet sample, and set an anchor point i. The triplet loss is to bring the anchor point closer to the positive sample j, and to increase the distance between the anchor point and the negative sample k, as shown in the following formula: ,in, Let be the Euclidean distance from anchor point i to positive sample j. Let be the Euclidean distance from anchor point i to negative sample k.
[0029] Preferably, the method further includes step 6, in the model inference stage, inputting the pedestrian image to be identified and the image of the person in the base station into the model respectively, extracting the output features of the fully connected layer, calculating the cosine similarity of the two output features, setting a similarity threshold, and thus determining whether the pedestrian to be identified is a person in the base station.
[0030] The present invention has been described above by way of example with reference to the accompanying drawings. Obviously, the specific implementation of the present invention is not limited to the above-described manner. Any non-substantial improvements made using the inventive concept and technical solution of the present invention, or the direct application of the inventive concept and technical solution of the present invention to other occasions without improvement or equivalent substitution, are all within the protection scope of the present invention.
Claims
1. A pedestrian re-identification method based on sparse attention and learnable pooling, characterized in that, The method includes the following main steps: Step 1: Scale the original pedestrian image to a uniform size to complete downsampling and obtain a reduced-size feature map; Step 2: Input the reduced-size feature map into the sparse attention module, which includes an inverted residual module, a local attention module, and a global attention module. Step 3, convert the output feature map f of the inverted residual module a As input to the local attention module, Where H, W, and C represent the height, width, and number of channels of the feature map, respectively; feature map f a The space is divided into multiple non-overlapping windows, each of size A×A. The spatial dimensions are then flattened, and the resulting feature map is denoted as f′. a , dimension f′ a The input is fed into the self-attention module, where self-attention is calculated for each A×A feature window. The results obtained from the self-attention module are then spatially reconstructed to facilitate full interaction between the extracted local features and subsequent global features. Finally, the feature map f is obtained through a feedforward network. b , Step 4, transfer the feature map f b The input is fed into the global attention module, where a fixed-size B×B grid is used to focus f. b Divided into dimensions Feature map f′ b The size of the feature window after segmentation is Self-attention is applied along the B×B dimension, and the results obtained by the self-attention module are reconstructed at the grid level. Feature maps are then obtained through a feedforward network. Step 5, f c The input is a learnable pooling layer, and the output f is obtained. d If f c The m-th feature map is X m for The set of elements can be used to learn the m-th feature output by the pooling layer. It is expressed as follows: α m Let x be the hyperparameter corresponding to the m-th feature map, and let x be the set X. m The elements in the pooling layer; the output f of the learnable pooling layer. d After passing through a fully connected layer, it is fed into the loss function.
2. The pedestrian re-identification method based on sparse attention and learnable pooling according to claim 1, characterized in that, Step 1 specifically involves scaling the original pedestrian image to 128×256, inputting it into a 3×3 convolutional layer with a stride of 2, completing downsampling, and obtaining a feature map with reduced size.
3. The pedestrian re-identification method based on sparse attention and learnable pooling according to claim 2, characterized in that, The inverted residual module described in step 2 consists of two 1×1 convolutional layers, a 3×3 depthwise convolutional layer, and an SE structure.
4. A pedestrian re-identification method based on sparse attention and learnable pooling according to any one of claims 1-3, characterized in that, Step 5 uses the triplet loss function, where (i, j, k) represents a triplet sample. An anchor point i is set, and the triplet loss is to bring the anchor point closer to the positive sample j and to increase the distance between the anchor point and the negative sample k, as shown below: ,in, Let be the Euclidean distance from anchor point i to positive sample j. Let be the Euclidean distance from anchor point i to negative sample k.
5. The pedestrian re-identification method based on sparse attention and learnable pooling according to claim 4, characterized in that, The method also includes step 6, in the model inference stage, inputting the pedestrian image to be identified and the image of the base personnel into the model respectively, extracting the output features of the fully connected layer, calculating the cosine similarity of the two output features, setting a similarity threshold, and thus determining whether the pedestrian to be identified is a base personnel.