Pedestrian re-identification method and device based on attribute description partial shielding

By extracting color information and masking clothing features in pedestrian re-identification while changing clothes, and utilizing a self-attention feature extraction network, the problem of traditional methods failing to fully utilize color information is solved, achieving a more efficient recognition effect.

CN118097710BActive Publication Date: 2026-06-23XIDIAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XIDIAN UNIV
Filing Date
2024-01-03
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Traditional methods for re-identifying pedestrians changing clothes fail to fully utilize color information, have high computational complexity, and low recognition efficiency.

Method used

By using an attribute description-based method, color information is extracted from pedestrian images and clothing features are masked. A self-attention feature extraction network is then used to perform multi-head self-attention operations to obtain pedestrian recognition results.

Benefits of technology

By making full use of color information while eliminating interference from clothing, computational complexity is reduced, and recognition efficiency and accuracy are improved.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118097710B_ABST
    Figure CN118097710B_ABST
Patent Text Reader

Abstract

The application discloses a kind of based on attribute description partial shielding clothes-changing pedestrian re-identification method and device, the method includes: by description extraction model, the attribute characteristics of pedestrian including color information are extracted and the clothing characteristics of pedestrian are shielded, obtain one-dimensional mask attribute description information;Pedestrian image and one-dimensional mask attribute description characteristics are input to the self-attention feature extraction network trained, to carry out multi-head self-attention operation to pedestrian image and one-dimensional mask attribute description characteristics, obtain the identification result of pedestrian.According to the method provided in the application, color information can be fully utilized and the interference of pedestrian clothing on identification can be excluded;Since the application is not directly extracted by complex processing process to extract information irrelevant to clothing in pedestrian image for identification, the model structure used in the application is simple, can reduce computational complexity and improve application efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of image processing technology, specifically relating to a method and apparatus for re-identifying pedestrians changing clothes based on partial masking of attribute descriptions. Background Technology

[0002] In recent years, with the development of neural network technology and the increasing demand for surveillance security, person re-identification has become one of the research hotspots in the field of computer vision. The goal of person re-identification is to retrieve a target person from different cameras. Most research on person re-identification is based on the assumption that pedestrians do not change their clothing. Based on this assumption, many papers have proposed efficient methods for solving person re-identification. However, in the real world, pedestrians may change their clothes at any time, which can lead to unstable performance of person re-identification methods. Therefore, addressing the issue of pedestrians changing clothes is crucial in person re-identification; this is called clothing-changing person re-identification. The original image of a pedestrian contains rich information, but when pedestrians change clothes, recognition methods based on the original image often perform poorly because the information contained in the pedestrian image is mainly the color information of the clothing, which accounts for a large proportion of the image. In the case of changing clothes, color information is not reliable enough, unlike other changes such as changes in field of view and lighting. In cross-view matching, clothing changes are also virtually impossible to model. Therefore, there are methods to study the multimodal biometrics of pedestrians, such as pedestrian contours, movement postures, joint information, skeletal information, and radio frequency signals. These information can retain more invariance compared to the original image.

[0003] For example, recognition methods based on FSAM networks utilize identity to guide mask learning from coarse to fine granular levels. Human-generated masks are used to extract shape features. Recognition methods based on SPTSKM networks learn identity features from image sequences and explore pedestrian motion pattern information from 3D skeletons normalized by skeleton generative adversarial networks. Recognition methods based on 3DSL networks aim to learn 3D shape features to distinguish different identities and shape-related parameters for reconstructing subnetworks. Recognition methods based on CESD networks use a pose detector to detect pedestrian body joints and use shape embedding to separate clothing and distinguish shape information through joint features. Recognition methods based on GI-REID networks propose a gait prediction module to predict pedestrian gait sequences to extract gait features. Gait feature-supervised shape flow captures clothing-independent features from the pedestrian's original image.

[0004] However, these methods all discard all color information in the original image during the contour processing module, failing to fully utilize clothing-independent features in the original image. Some color information is helpful for pedestrian re-identification. Furthermore, these methods identify clothing-independent features through adversarial models, which are prone to training instability and make it difficult to explain how to separate appearance and clothing information from the original image.

[0005] Therefore, traditional methods do not make full use of color information and have high computational complexity and low recognition efficiency. Summary of the Invention

[0006] This invention provides a device for re-identifying pedestrians changing clothes based on partial masking of attribute descriptions, which can solve the problems of traditional methods not making full use of color information and having high computational complexity and low recognition efficiency.

[0007] In a first aspect, embodiments of the present invention provide a method for re-identifying pedestrians changing clothes based on partial masking of attribute descriptions, the method comprising:

[0008] The pedestrian image is input into the trained description extraction model to extract the pedestrian's attribute features and mask the pedestrian's clothing features to obtain the pedestrian's occlusion attribute description information, which includes color information unrelated to clothing.

[0009] Linear projection is performed on the occlusion attribute description information to obtain a one-dimensional occlusion attribute description feature of the pedestrian.

[0010] The pedestrian image and one-dimensional occlusion attribute description features are input into a trained self-attention feature extraction network to perform multi-head self-attention operation on the pedestrian image and one-dimensional occlusion attribute description features to obtain the pedestrian recognition result.

[0011] In one possible implementation of the first aspect, pedestrian attribute features can be extracted from pedestrian images; the attribute features can be binarized to obtain binary vectors corresponding to the attribute features; the elements corresponding to clothing features in the binary vectors can be set to 0 to mask the clothing features and obtain occlusion attribute description information.

[0012] For example, a binary vector includes multiple elements, each element corresponding to the binarized value of an attribute feature.

[0013] In one possible implementation of the first aspect, the attribute features can be described by text.

[0014] In one possible implementation of the first aspect, after linearly projecting the occlusion attribute description information to obtain a one-dimensional occlusion attribute description feature of the pedestrian, attribute encoding can be embedded in the one-dimensional occlusion attribute description feature to obtain the encoded occlusion attribute description feature.

[0015] In one possible implementation of the first aspect, the pedestrian image can be segmented and flattened into a one-dimensional image vector; positional encoding can be embedded in the one-dimensional image vector to obtain an encoded one-dimensional image vector; a multi-head self-attention operation can be performed on the encoded one-dimensional image vector to obtain low-level image features of the pedestrian image; the low-level image features and the encoded occlusion attribute description features can be sequentially subjected to multiple concatenation processes and multi-head self-attention operations to obtain the first and second recognition features of the pedestrian; the low-level image features, the first recognition features, and the second recognition features can be aggregated to obtain the recognition features of the pedestrian; and the recognition result of the pedestrian can be determined based on the recognition features.

[0016] In one possible implementation of the first aspect, a multi-head self-attention operation can be performed on the encoded one-dimensional image vector to obtain the relevance features of the pedestrian image; based on the relevance features, global information of the pedestrian image can be extracted to obtain the low-level image features of the pedestrian image.

[0017] In one possible implementation of the first aspect, low-level image features and encoded occlusion attribute description features can be concatenated to obtain a first concatenated feature; a multi-head self-attention operation can be performed on the first concatenated feature to obtain a first recognition feature; the first recognition feature and encoded occlusion attribute description features can be concatenated to obtain a second concatenated feature; and a multi-head self-attention operation can be performed on the second concatenated feature to obtain a second recognition feature.

[0018] Secondly, embodiments of the present invention provide a method for re-identifying pedestrians changing clothes based on partial masking of attribute descriptions. The device includes a processing unit; the processing unit may include a description extraction model and a self-attention feature extraction network.

[0019] The processing unit is used for:

[0020] The pedestrian image is input into the description extraction model to extract the pedestrian's attribute features and mask the pedestrian's clothing features to obtain the pedestrian's occlusion attribute description information, which includes color information unrelated to clothing.

[0021] Linear projection is performed on the occlusion attribute description information to obtain a one-dimensional occlusion attribute description feature of the pedestrian.

[0022] The pedestrian image and one-dimensional occlusion attribute description features are input into a self-attention feature extraction network to perform multi-head self-attention operation on the pedestrian image and one-dimensional occlusion attribute description features to obtain the pedestrian recognition result.

[0023] In one possible implementation of the second aspect, the description extraction model can be used to extract the attribute features of pedestrians from pedestrian images; the attribute features are binarized to obtain the binary vector corresponding to the attribute features; the elements corresponding to clothing features in the binary vector are set to 0 to mask the clothing features and obtain the occlusion attribute description information.

[0024] For example, a binary vector includes multiple elements, each element corresponding to the binarized value of an attribute feature.

[0025] In one possible implementation of the second aspect, the attribute features are described through text.

[0026] In one possible implementation of the second aspect, after linearly projecting the occlusion attribute description information to obtain a one-dimensional occlusion attribute description feature of the pedestrian, the processing unit can also be used to embed attribute encoding in the one-dimensional occlusion attribute description feature to obtain the encoded occlusion attribute description feature.

[0027] In one possible implementation of the second aspect, the self-attention feature extraction network can be used to: segment and flatten the pedestrian image into a one-dimensional image vector; embed positional encoding into the one-dimensional image vector to obtain an encoded one-dimensional image vector; perform multi-head self-attention operation on the encoded one-dimensional image vector to obtain low-level image features of the pedestrian image; sequentially perform multiple concatenation processes and multi-head self-attention operations on the low-level image features and the encoded occlusion attribute description features to obtain the first and second recognition features of the pedestrian; aggregate the low-level image features, the first recognition features, and the second recognition features to obtain the recognition features of the pedestrian; and determine the recognition result of the pedestrian based on the recognition features.

[0028] In one possible implementation of the second aspect, the self-attention feature extraction network can be used to perform multi-head self-attention operation on the encoded one-dimensional image vector to obtain the relevance features of the pedestrian image; based on the relevance features, global information of the pedestrian image is extracted to obtain the low-level image features of the pedestrian image.

[0029] In one possible implementation of the second aspect, the self-attention feature extraction network can specifically be used to concatenate low-level image features and encoded occlusion attribute description features to obtain a first concatenated feature; perform a multi-head self-attention operation on the first concatenated feature to obtain a first recognition feature; concatenate the first recognition feature and encoded occlusion attribute description features to obtain a second concatenated feature; and perform a multi-head self-attention operation on the second concatenated feature to obtain a second recognition feature.

[0030] Thirdly, embodiments of the present invention provide an electronic device, including a processor and a memory, wherein the memory is used to store a computer program; the processor can be used to execute a calculator program (instructions) stored in the memory to implement the method of the first aspect described above.

[0031] Fourthly, embodiments of the present invention provide a computer-readable storage medium storing a computer program that, when executed, can implement the method described in the first aspect above.

[0032] It is understood that the beneficial effects of the second to fourth aspects mentioned above can be found in the relevant description of the first aspect mentioned above, and will not be repeated here.

[0033] The beneficial effects of the embodiments of the present invention compared with the prior art are as follows: According to the method provided by the present invention, by extracting attribute features including color information from pedestrian images and masking the clothing features of pedestrians, the interference of pedestrian clothing on recognition can be eliminated while making full use of the color information of pedestrian images, thus realizing the recognition of pedestrians who have changed clothes; since the present invention obtains attribute features including color information and then masks the clothing features of pedestrians to avoid the influence of clothing information on recognition, rather than directly extracting information unrelated to clothing from pedestrian images for recognition through complex analysis and processing, the model structure used in the present invention is simple, which can reduce computational complexity and improve application efficiency. Attached Figure Description

[0034] Figure 1 This is a flowchart illustrating a method for re-identifying pedestrians changing clothes based on partial masking of attribute descriptions, provided in an embodiment of the present invention.

[0035] Figure 2 A schematic diagram of a scenario for pedestrian re-identification while changing clothes, provided by an embodiment of the present invention;

[0036] Figure 3 This is a schematic diagram of a pedestrian re-identification device based on partial masking of attribute description provided in an embodiment of the present invention;

[0037] Figure 4 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation

[0038] The present invention will be further described in detail below with reference to specific embodiments, but the implementation of the present invention is not limited thereto.

[0039] The wireless blockchain network sharding method provided in this embodiment of the invention can be applied to electronic devices such as mobile terminals, personal laptops, and supercomputers. This embodiment of the invention does not impose any restrictions on the specific type of electronic device.

[0040] Figure 1 The diagram shown is a schematic flowchart of a pedestrian re-identification method based on attribute description partial masking provided by an embodiment of the present invention. As an example and not a limitation, method 100 may include steps S101-S103. Each step is described below.

[0041] S101, input the pedestrian image into the trained description extraction model to extract the pedestrian's attribute features and mask the pedestrian's clothing features to obtain the pedestrian's occlusion attribute description information.

[0042] For example, the occlusion attribute description information includes color information unrelated to clothing.

[0043] In one possible implementation, the description extraction model can be used to extract the attribute features of pedestrians from pedestrian images; the attribute features are binarized to obtain the corresponding binary vectors; the elements corresponding to clothing features in the binary vectors are set to 0 to mask the clothing features and obtain the occlusion attribute description information.

[0044] For example, a binary vector may include multiple elements, each of which corresponds to the binarized value of an attribute feature.

[0045] In one example, the description extraction model can store an attribute list, which may include multiple attribute labels. For instance, it could include 61 binary attribute labels and 4 multi-class attributes, for a total of 105 attribute labels. The extracted pedestrian attribute features can all be represented by different attribute labels in the attribute list. The description extraction model can arrange the extracted attribute labels according to their positions in the attribute list, resulting in an attribute label sequence. If the feature corresponding to an attribute label is extracted, the value of that attribute label is set to 1; otherwise, it is set to 0. This is used to binarize the attribute features, resulting in a binary vector corresponding to all attribute features, such as [0, 1, 0...]. Then, the value of the attribute label related to clothing is set to 0 to mask clothing features, thus obtaining masked attribute description information.

[0046] For example, see Figure 2 The description of the feature extraction model is that it can extract the attribute features of pedestrians from pedestrian images, such as male, black hair, purple shirt, sneakers, etc., to obtain an attribute label sequence. Binarizing this sequence yields a binary vector. The value corresponding to the clothing feature (purple top) in this binary vector is set to 0, thus obtaining the occlusion attribute description information. (see Figure 2 (201 in the middle).

[0047] Optionally, attribute features are described using text. Compared to the traditional method of describing task features using complex coordinate information such as skeletal features, describing task features using text can reduce the difficulty of feature extraction.

[0048] S102, linear projection is performed on the occlusion attribute description information to obtain a one-dimensional occlusion attribute description feature of the pedestrian.

[0049] For example, the occlusion attribute description information can be linearly projected to obtain a one-dimensional occlusion attribute description feature.

[0050] In one example, after obtaining the one-dimensional occlusion attribute description feature, attribute encoding [CLS] can be embedded in the one-dimensional occlusion attribute description feature to obtain the encoded occlusion attribute description feature.

[0051] S103, the pedestrian image and one-dimensional occlusion attribute description features are input into the trained self-attention feature extraction network to perform multi-head self-attention operation on the pedestrian image and one-dimensional occlusion attribute description features to obtain the pedestrian recognition result.

[0052] In some embodiments, the self-attention feature extraction network can segment the pedestrian image and flatten it into a one-dimensional image vector; embed positional encoding into the one-dimensional image vector to obtain an encoded one-dimensional image vector; perform multi-head self-attention operation on the encoded one-dimensional image vector to obtain low-level image features of the pedestrian image; perform multiple concatenation processes and multi-head self-attention operations on the low-level image features and the encoded occlusion attribute description features in sequence to obtain the first and second recognition features of the pedestrian; aggregate the low-level image features, the first recognition features, and the second recognition features to obtain the recognition features of the pedestrian; and determine the recognition result of the pedestrian based on the recognition features.

[0053] For example, a self-attention feature extraction network could be the EVA02-large network.

[0054] In one possible implementation, the self-attention feature extraction network can specifically be used to extract pedestrian images x i Divide into a group consisting of N = H × W / P 2 A sequence of fixed-size, non-overlapping image patches, where P is the size of each patch, H is the height of the pedestrian image, W is the width of the pedestrian image, and N is the number of patches. The image patches are then flattened and mapped into one-dimensional image vectors with the same dimension as the one-dimensional occlusion attribute description features using a trainable linear projection. Position-based encoding [DES] is embedded in a one-dimensional image vector to obtain the encoded one-dimensional image vector.

[0055] In one example, see Figure 2 The self-attention feature extraction network can include 24 Transform Vision (TrV) blocks, which are divided into three TrV modules. The first TrV module contains 10 TrV blocks, the second TrV module contains 10 TrV blocks, and the third TrV module contains 4 TrV blocks. Furthermore, the self-attention feature extraction network may also include an aggregation layer and a recognition output layer.

[0056] For example, see Figure 2 Each TrV module may sequentially include a linear projection layer, a multi-head self-attention module, a linear projection layer, a fully connected layer, a linear projection layer, and a fully connected layer.

[0057] In one possible implementation, the self-attention feature extraction network can input the encoded one-dimensional image vector into the first TrV module to perform multi-head self-attention operations on the encoded one-dimensional image vector, obtaining correlation features between image patches. Then, based on the correlation features, global information of the pedestrian image is extracted to obtain low-level image features of the pedestrian image.

[0058] In one possible implementation, the self-attention feature extraction network can sequentially perform multiple concatenation processes and multi-head self-attention operations on low-level image features and encoded occlusion attribute description features to obtain the first and second identification features of the pedestrian.

[0059] In one example, the self-attention feature extraction network can input low-level image features into a second TrV module. The low-level image features are then projected through a linear projection layer in the first TrV block of the second TrV module, resulting in projected low-level image features. The self-attention feature extraction network can then concatenate the projected low-level image features with the encoded occlusion attribute description features to obtain the first concatenated feature. The first concatenated feature is then input into the multi-head self-attention module in the first TrV block of the second TrV module, where it undergoes a multi-head self-attention operation. Finally, the second TrV module outputs the first recognized feature.

[0060] Similarly, a self-attention feature extraction network can input the first recognition feature into a third TrV block, see [link to relevant documentation]. Figure 2 The first recognition feature is linearly projected and then concatenated with the encoded occlusion attribute description feature to obtain the second concatenated feature. This second concatenated feature is then input into the next multi-head self-attention layer, ultimately yielding the second recognition feature output by the third TrV module. This multi-head self-attention operation is used to learn the relationship between the attribute features of the image and the occlusion clothing information.

[0061] For example, the self-attention feature extraction network can gradually learn pedestrian image features from low to high levels by performing multiple multi-head self-attention operations on the encoded one-dimensional image vector and the encoded occlusion attribute description features. This can be achieved by learning low-level image features to recognition features.

[0062] In one possible implementation, see Figure 2 The aggregation layer in the self-attention feature extraction network can aggregate the first recognition feature, low-level image features and the second recognition feature to obtain the recognition feature of the pedestrian. Then, the recognition feature is input into the recognition output layer, which classifies and recognizes the recognition feature of the pedestrian to obtain the recognition result of the pedestrian.

[0063] For example, the recognition output layer can compare the identity features of pedestrians in the pedestrian image with those in the database to determine the probability that the pedestrian in the pedestrian image matches the identity of each pedestrian in the database, thereby obtaining the pedestrian recognition result.

[0064] For example, a pedestrian's identification features can be those obtained through historical image data. The clothing of the pedestrian in the historical image data may differ from the clothing of the pedestrian in the currently being identified image.

[0065] According to the method provided by this invention, by extracting attribute features including color information from pedestrian images and masking the clothing features of pedestrians, the method can fully utilize the color information of pedestrian images while eliminating the interference of pedestrian clothing on recognition, thus achieving the recognition of pedestrians changing clothes. Since this invention obtains attribute features including color information and then masks the clothing features to avoid the influence of clothing information on recognition, rather than directly extracting information unrelated to clothing from pedestrian images through complex analysis and processing, the model structure used in this invention is simple, reducing computational complexity and improving application efficiency. Furthermore, recognizing pedestrians by describing their attribute features through text rather than through features such as skeletal features that require cumbersome analysis and processing further reduces computational complexity and improves recognition efficiency. By concatenating and embedding the masking attribute description information encoded by linear projection into different levels of TrV blocks and fusing it with low-level to high-level features of the image, different feature spaces are mapped to a shared latent space, enabling a better understanding and utilization of the semantic relationship between them, thus improving recognition accuracy.

[0066] Figure 3 The diagram shown illustrates the structure of a pedestrian re-identification device based on attribute description partial masking, according to an embodiment of the present invention. As an example and not a limitation, the device 300 may include a processing unit 310, which may include a description extraction model 311 and a self-attention feature extraction network 312.

[0067] Processing unit 310 is used for:

[0068] The pedestrian image is input into the description extraction model to extract the pedestrian's attribute features and mask the pedestrian's clothing features to obtain the pedestrian's occlusion attribute description information, which includes color information unrelated to clothing.

[0069] Linear projection is performed on the occlusion attribute description information to obtain a one-dimensional occlusion attribute description feature of the pedestrian.

[0070] The pedestrian image and one-dimensional occlusion attribute description features are input into a self-attention feature extraction network to perform multi-head self-attention operation on the pedestrian image and one-dimensional occlusion attribute description features to obtain the pedestrian recognition result.

[0071] In one possible implementation, the description extraction model 311 can be used to extract the attribute features of pedestrians from pedestrian images; binarize the attribute features to obtain the binary vector corresponding to the attribute features; and set the elements corresponding to clothing features in the binary vector to 0 to mask the clothing features and obtain the occlusion attribute description information.

[0072] For example, a binary vector includes multiple elements, each element corresponding to the binarized value of an attribute feature.

[0073] In one possible implementation, attribute features are described via text.

[0074] In one possible implementation, after linearly projecting the occlusion attribute description information to obtain a one-dimensional occlusion attribute description feature of the pedestrian, the processing unit 310 can also be used to embed attribute encoding in the one-dimensional occlusion attribute description feature to obtain the encoded occlusion attribute description feature.

[0075] In one possible implementation, the self-attention feature extraction network 312 can be used to: segment and flatten the pedestrian image into a one-dimensional image vector; embed positional encoding into the one-dimensional image vector to obtain an encoded one-dimensional image vector; perform multi-head self-attention operation on the encoded one-dimensional image vector to obtain low-level image features of the pedestrian image; sequentially perform multiple concatenation processes and multi-head self-attention operations on the low-level image features and the encoded occlusion attribute description features to obtain the first and second recognition features of the pedestrian; aggregate the low-level image features, the first recognition features, and the second recognition features to obtain the recognition features of the pedestrian; and determine the recognition result of the pedestrian based on the recognition features.

[0076] In one possible implementation, the self-attention feature extraction network 312 can be used to perform multi-head self-attention operation on the encoded one-dimensional image vector to obtain the relevance features of the pedestrian image; based on the relevance features, global information of the pedestrian image is extracted to obtain the low-level image features of the pedestrian image.

[0077] In one possible implementation, the self-attention feature extraction network 312 is specifically used to concatenate low-level image features and encoded occlusion attribute description features to obtain a first concatenated feature; perform a multi-head self-attention operation on the first concatenated feature to obtain a first recognition feature; concatenate the first recognition feature and encoded occlusion attribute description features to obtain a second concatenated feature; and perform a multi-head self-attention operation on the second concatenated feature to obtain a second recognition feature.

[0078] To better illustrate the beneficial effects of the method provided by this invention, the following simulation experiments were conducted:

[0079] For example, the methods provided in this invention, the recognition methods based on OSNet, DG-Net, BoT, AGW, TransReID, RCSANet, MBUNet, IRANet, and DeSKPro were tested using the Celeb-reID-light dataset. The Rank-1 accuracy and mean average precision (mAP) of the recognition results for each method are shown in the table below.

[0080] Table 1

[0081] Rank-1 mAP OSNet 21.3 11.7 DG-Net 23.5 12.6 BoT 24.2 13.6 AGW 30.2 15.4 TransReID 31.3 18.6 RCSANet 29.5 16.7 MBUNet 33.9 21.3 IRANet 46.2 25.4 DeSKPro <![CDATA[ 52.0 ]]> <![CDATA[ 29.8 ]]> This invention 69.6 52.1

[0082] As shown in Table 1, on the Celeb-reID-light dataset, the Rank-1 accuracy and mAP of this invention are significantly higher than the second-best method, namely the DeSKPro-based recognition method. The Rank-1 accuracy and mAP of this invention are 17.6% higher and the mAP is 22.3% higher than the DeSKPro-based recognition method.

[0083] This demonstrates that the present invention unifies pedestrian appearance features and verbal descriptions. In pedestrian re-identification after clothing changes, multimodal attribute description information is introduced through a description extraction model, which is more obvious and easier to extract and edit than the skeletal or contour features in the original person image. By masking clothing attributes, information helpful for pedestrian identification can be preserved to the greatest extent possible, while accurately eliminating interference from clothing information in pedestrian re-identification after clothing changes. Sensitive clothing information interference can be removed through attribute masking and connection embedding, avoiding the difficulties in extracting and the complexity of multimodal biometric features.

[0084] Figure 4 The diagram shown is a structural schematic of an electronic device provided in an embodiment of the present invention. Figure 4 The illustrated electronic device 400 may include: at least one processor 410 ( Figure 4 The diagram shows only one processor, a memory 420, and a computer program 430 stored in the memory 420 and executable on the at least one processor 410, which, when executing the computer program 430, implements the steps in any of the above method embodiments.

[0085] The electronic device 400 may be a robot or other processing device capable of implementing the above methods. This embodiment of the invention does not impose any restrictions on the specific type of electronic device.

[0086] Those skilled in the art will understand that Figure 4 This is merely an example of electronic device 400 and does not constitute a limitation on the electronic device. It may include more or fewer components than shown, or combine certain components, or use different components. For example, the electronic device 400 may also include input / output interfaces.

[0087] The processor 410 may be a Central Processing Unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASTCs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, etc. A general-purpose processor may be a microprocessor or any conventional processor.

[0088] In some embodiments, the memory 420 may be an internal storage unit, such as a hard disk or RAM. In other embodiments, the memory 420 may be an external storage device, such as a plug-in hard disk, a smart memory card (SMC), a secure digital card (SD), or a flash card. Furthermore, the memory 420 may include both internal and external storage units. The memory 420 is used to store the operating system, applications, boot loader, data, and other programs, such as the program code of the computer program. The memory 420 can also be used to temporarily store data that has been output or will be output.

[0089] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

[0090] Those skilled in the art will understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the functions described above can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this invention. The specific working process of the units and modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0091] This invention also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps described in the various method embodiments above.

[0092] This invention provides a computer program product that, when run on an electronic device, enables the electronic device to perform the steps described in the various method embodiments above.

[0093] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments of the present invention can be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include at least: any entity or device capable of carrying computer program code to a photographing device / terminal device, a recording medium, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium. Examples include USB flash drives, portable hard drives, magnetic disks, or optical disks. In some jurisdictions, according to legislation and patent practice, computer-readable media cannot be electrical carrier signals or telecommunication signals.

[0094] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0095] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.

Claims

1. A method for re-identifying pedestrians changing clothes based on partial masking of attribute descriptions, characterized in that, include: The pedestrian image is input into a trained description extraction model to extract the pedestrian's attribute features and mask the pedestrian's clothing features to obtain the pedestrian's occlusion attribute description information, wherein the occlusion attribute description information includes color information unrelated to clothing; Linear projection is performed on the occlusion attribute description information to obtain a one-dimensional occlusion attribute description feature of the pedestrian; The pedestrian image and the one-dimensional occlusion attribute description feature are input into a trained self-attention feature extraction network to perform multi-head self-attention operation on the pedestrian image and the one-dimensional occlusion attribute description feature to obtain the pedestrian recognition result. A multi-head self-attention operation is performed on the pedestrian image and the one-dimensional occlusion attribute description features to obtain the pedestrian recognition result, including: The pedestrian image is segmented and flattened into a one-dimensional image vector; Position codes are embedded in the one-dimensional image vector to obtain an encoded one-dimensional image vector. A multi-head self-attention operation is performed on the encoded one-dimensional image vector to obtain the low-level image features of the pedestrian image; The low-level image features and the encoded occlusion attribute description features are sequentially subjected to multiple splicing processes and multi-head self-attention operations to obtain the first and second recognition features of the pedestrian. By aggregating the low-level image features, the first recognition feature, and the second recognition feature, the recognition features of the pedestrian are obtained. The pedestrian's identification result is determined based on the identified features.

2. The method according to claim 1, characterized in that, Extracting the pedestrian's attribute features and masking the pedestrian's clothing features yields the pedestrian's occlusion attribute description information, including: Extract the attribute features of the pedestrian from the pedestrian image; The attribute feature is binarized to obtain a binary vector corresponding to the attribute feature, wherein the binary vector includes multiple elements, and each element corresponds to the binarized value of the attribute feature; The elements corresponding to the clothing features in the binary vector are set to 0 to mask the clothing features, thereby obtaining the occlusion attribute description information.

3. The method according to claim 1, characterized in that, The attribute features are described through text.

4. The method according to claim 1, characterized in that, After linearly projecting the occlusion attribute description information to obtain a one-dimensional occlusion attribute description feature of the pedestrian, the method further includes: Attribute encoding is embedded in the one-dimensional occlusion attribute description feature to obtain the encoded occlusion attribute description feature.

5. The method according to claim 1, characterized in that, Perform a multi-head self-attention operation on the encoded one-dimensional image vector to obtain low-level image features of the pedestrian image, including: A multi-head self-attention operation is performed on the encoded one-dimensional image vector to obtain the relevance features of the pedestrian image; Based on the correlation features, global information of the pedestrian image is extracted to obtain low-level image features of the pedestrian image.

6. The method according to claim 1, characterized in that, The low-level image features and the encoded occlusion attribute description features are sequentially subjected to multiple concatenation processes and multi-head self-attention operations to obtain the first and second identification features of the pedestrian, including: The low-level image features and the encoded occlusion attribute description features are combined to obtain the first combined feature; Perform a multi-head self-attention operation on the first spliced ​​feature to obtain the first recognition feature; The first identification feature and the encoded occlusion attribute description feature are concatenated to obtain the second concatenated feature; The second splicing feature is subjected to a multi-head self-attention operation to obtain the second recognition feature.

7. A device for re-identifying pedestrians changing clothes based on partial masking of attribute descriptions, characterized in that, The device includes a processing unit, which includes a description extraction model and a self-attention feature extraction network. The processing unit is used for: The pedestrian image is input into the description extraction model to extract the pedestrian's attribute features and mask the pedestrian's clothing features to obtain the pedestrian's occlusion attribute description information, wherein the occlusion attribute description information includes color information unrelated to clothing; Linear projection is performed on the occlusion attribute description information to obtain a one-dimensional occlusion attribute description feature of the pedestrian; The pedestrian image and the one-dimensional occlusion attribute description feature are input into a self-attention feature extraction network to perform multi-head self-attention operation on the pedestrian image and the one-dimensional occlusion attribute description feature to obtain the pedestrian recognition result. The processing unit is further configured to segment the pedestrian image and flatten it into a one-dimensional image vector; embed positional encoding into the one-dimensional image vector to obtain an encoded one-dimensional image vector; perform a multi-head self-attention operation on the encoded one-dimensional image vector to obtain low-level image features of the pedestrian image; sequentially perform multiple concatenation processes and multi-head self-attention operations on the low-level image features and the encoded occlusion attribute description features to obtain a first identification feature and a second identification feature of the pedestrian; aggregate the low-level image features, the first identification feature, and the second identification feature to obtain the identification feature of the pedestrian; and determine the identification result of the pedestrian based on the identification feature.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory, characterized in that, When the processor executes the computer program, it implements the method as described in any one of claims 1-7.

9. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processing device, it implements the method as described in any one of claims 1-7.