Pedestrian image matching method and device

By calculating the distance between image acquisition devices and processing images using a feature extraction model, a pedestrian feature matrix is ​​generated to determine whether images belong to the same pedestrian. This solves the problem of low matching accuracy of pedestrian images from multiple image acquisition devices and improves matching accuracy.

CN115359274BActive Publication Date: 2026-06-19BEIJING ZHIDA TIANJIE COMMERCIAL OPERATION MANAGEMENT CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING ZHIDA TIANJIE COMMERCIAL OPERATION MANAGEMENT CO LTD
Filing Date
2022-08-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In scenarios with multiple image acquisition devices, there is a problem of low pedestrian image matching accuracy.

Method used

By acquiring a dataset of pedestrian images collected by multiple image acquisition devices within a target area over a preset time period, the distance between each device is calculated to determine the device vector. The images are then processed using a feature extraction and processing model to generate pedestrian feature and background feature vectors. Based on these vectors, an image feature matrix is ​​generated, and image similarity is calculated to determine whether the pedestrians belong to the same person.

Benefits of technology

It improves the accuracy of pedestrian image matching in scenarios with multiple image acquisition devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115359274B_ABST
    Figure CN115359274B_ABST
Patent Text Reader

Abstract

This disclosure relates to the field of image processing technology, and provides a method and apparatus for matching pedestrian images. The method includes: determining a first target device vector and a second target device vector; processing the first target image and the second target image using a feature extraction and processing model to obtain a first pedestrian feature vector and a first background feature vector corresponding to the first target image, and a second pedestrian feature vector and a second background feature vector corresponding to the second target image; determining a first image feature matrix corresponding to the first target image based on the first target device vector, the first pedestrian feature vector, and the first background feature vector; determining a second image feature matrix corresponding to the second target image based on the second target device vector, the second pedestrian feature vector, and the second background feature vector; and determining whether the first target image and the second target image belong to the same pedestrian based on the first image feature matrix and the second image feature matrix.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of image processing technology, and in particular to a method and apparatus for matching pedestrian images. Background Technology

[0002] Within a given time period, how do we determine whether two or more surveillance images belong to the same pedestrian (matching multiple surveillance images is the same as matching two images; matching multiple images simply involves matching two images multiple times)? Currently, a common approach is to use general neural networks to extract pedestrian features from the surveillance images and then match them based on those features. However, this method doesn't consider the influence of image background or the characteristics of the different image acquisition devices used to acquire the images. Therefore, it can lead to significant errors, especially when dealing with multiple image acquisition devices—that is, different images belonging to different devices or images acquired across different devices.

[0003] In realizing the concept disclosed herein, the inventors discovered at least the following technical problem in the related art: low pedestrian image matching accuracy when multiple image acquisition devices are present. Summary of the Invention

[0004] In view of this, the present disclosure provides a method, apparatus, electronic device, and computer-readable storage medium for matching pedestrian images, in order to solve the problem of low pedestrian image matching accuracy in scenarios with multiple image acquisition devices in the prior art.

[0005] A first aspect of this disclosure provides a method for matching pedestrian images, comprising: acquiring a pedestrian image dataset acquired by multiple image acquisition devices within a target area over a preset time period, wherein the pedestrian image dataset includes multiple images of multiple pedestrians; acquiring the distance between each image acquisition device and other image acquisition devices in the target area, and determining a device vector corresponding to each image acquisition device based on the distance between each image acquisition device and other image acquisition devices; determining whether any two images in the pedestrian image dataset belong to the same pedestrian, and respectively labeling the two images to be determined as a first target image and a second target image, labeling the device vector corresponding to the image acquisition device that acquired the first target image as a first target device vector, and labeling the device vector corresponding to the image acquisition device that acquired the second target image as a first target device vector, and labeling the device vector corresponding to the image acquisition device that acquired the second target image as a first target device vector. The backup vector is denoted as the second target device vector. The first target image and the second target image are processed using a feature extraction and processing model to obtain the first pedestrian feature vector and the first background feature vector corresponding to the first target image, as well as the second pedestrian feature vector and the second background feature vector corresponding to the second target image. Based on the first target device vector, the first pedestrian feature vector, and the first background feature vector, the first image feature matrix corresponding to the first target image is determined. Based on the second target device vector, the second pedestrian feature vector, and the second background feature vector, the second image feature matrix corresponding to the second target image is determined. Based on the first image feature matrix and the second image feature matrix, the target similarity between the first target image and the second target image is calculated, and based on the target similarity, it is determined whether the first target image and the second target image belong to the same pedestrian.

[0006] A second aspect of this disclosure provides a pedestrian image matching apparatus, comprising: a first acquisition module configured to acquire a pedestrian image dataset acquired by multiple image acquisition devices within a preset time period in a target area, wherein the pedestrian image dataset includes multiple images of multiple pedestrians; a second acquisition module configured to acquire the distance between each image acquisition device and other image acquisition devices in the target area, and determine a device vector corresponding to each image acquisition device based on the distance between each image acquisition device and other image acquisition devices; and a first judgment module configured to judge whether any two images in the pedestrian image dataset belong to the same pedestrian, and to respectively denote the two images to be judged as a first target image and a second target image, denote the device vector corresponding to the image acquisition device that acquired the first target image as a first target device vector, and denote the device vector corresponding to the image acquisition device that acquired the second target image as a first target device vector. The second target device vector; the model module, configured to process the first target image and the second target image respectively using a feature extraction and processing model to obtain the first pedestrian feature vector and the first background feature vector corresponding to the first target image, and the second pedestrian feature vector and the second background feature vector corresponding to the second target image; the first determination module, configured to determine the first image feature matrix corresponding to the first target image based on the first target device vector, the first pedestrian feature vector and the first background feature vector; the second determination module, configured to determine the second image feature matrix corresponding to the second target image based on the second target device vector, the second pedestrian feature vector and the second background feature vector; and the second judgment module, configured to calculate the target similarity between the first target image and the second target image based on the first image feature matrix and the second image feature matrix, and to determine whether the first target image and the second target image belong to the same pedestrian based on the target similarity.

[0007] A third aspect of this disclosure provides an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method described above.

[0008] A fourth aspect of this disclosure provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the above-described method.

[0009] The beneficial effects of this disclosure embodiment compared with the prior art are as follows: This disclosure embodiment obtains a pedestrian image dataset acquired by multiple image acquisition devices within a preset time period in a target area. The pedestrian image dataset includes multiple images of multiple pedestrians. It obtains the distance between each image acquisition device and other image acquisition devices in the target area, and determines the device vector corresponding to each image acquisition device based on the distance between each image acquisition device and other image acquisition devices. It determines whether any two images in the pedestrian image dataset belong to the same pedestrian, and records the two images to be determined as the first target image and the second target image, respectively. The device vector corresponding to the image acquisition device that acquired the first target image is recorded as the first target device vector, and the device vector corresponding to the image acquisition device that acquired the second target image is recorded as the second target device vector. It then uses a feature extraction and processing model to process the first target image and the second target image respectively. The process involves identifying the target image, obtaining the first pedestrian feature vector and the first background feature vector corresponding to the first target image, and the second pedestrian feature vector and the second background feature vector corresponding to the second target image. Based on the first target device vector, the first pedestrian feature vector, and the first background feature vector, a first image feature matrix corresponding to the first target image is determined. Based on the second target device vector, the second pedestrian feature vector, and the second background feature vector, a second image feature matrix corresponding to the second target image is determined. Based on the first image feature matrix and the second image feature matrix, the target similarity between the first target image and the second target image is calculated, and based on the target similarity, it is determined whether the first target image and the second target image belong to the same pedestrian. Therefore, by adopting the above technical means, the problem of low pedestrian image matching accuracy in scenarios with multiple image acquisition devices can be solved in the prior art, thereby improving the accuracy of pedestrian image matching in scenarios with multiple image acquisition devices. Attached Figure Description

[0010] To more clearly illustrate the technical solutions in the embodiments of this disclosure, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0011] Figure 1 This is a schematic diagram illustrating an application scenario of an embodiment of this disclosure;

[0012] Figure 2 This is a flowchart illustrating a pedestrian image matching method provided in an embodiment of this disclosure;

[0013] Figure 3 This is a schematic diagram of the structure of a pedestrian image matching device provided in an embodiment of this disclosure;

[0014] Figure 4 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this disclosure. Detailed Implementation

[0015] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, so as to provide a thorough understanding of the embodiments of this disclosure. However, those skilled in the art will understand that this disclosure may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this disclosure with unnecessary detail.

[0016] A method and apparatus for matching pedestrian images according to embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings.

[0017] Figure 1 This is a schematic diagram illustrating an application scenario of an embodiment of this disclosure. The application scenario may include terminal devices 101, 102, and 103, server 104, and network 105.

[0018] Terminal devices 101, 102, and 103 can be hardware or software. When terminal devices 101, 102, and 103 are hardware, they can be various electronic devices with displays that support communication with server 104, including but not limited to smartphones, tablets, laptops, and desktop computers. When terminal devices 101, 102, and 103 are software, they can be installed in the aforementioned electronic devices. Terminal devices 101, 102, and 103 can be implemented as multiple software programs or software modules, or as a single software program or software module; this disclosure does not impose any limitations on this. Furthermore, various applications can be installed on terminal devices 101, 102, and 103, such as data processing applications, instant messaging tools, social platform software, search applications, shopping applications, etc.

[0019] Server 104 can be a server that provides various services, such as a backend server that receives requests sent by terminal devices with which it has established communication connections. This backend server can receive and analyze the requests sent by the terminal devices and generate processing results. Server 104 can be a single server, a server cluster consisting of several servers, or a cloud computing service center. This embodiment of the disclosure does not impose any limitations on these aspects.

[0020] It should be noted that server 104 can be either hardware or software. When server 104 is hardware, it can be various electronic devices that provide various services to terminal devices 101, 102, and 103. When server 104 is software, it can be multiple software programs or software modules that provide various services to terminal devices 101, 102, and 103, or it can be a single software program or software module that provides various services to terminal devices 101, 102, and 103. This disclosure does not limit the scope of the embodiments.

[0021] Network 105 can be a wired network using coaxial cable, twisted pair, and fiber optic connection, or it can be a wireless network that enables interconnection of various communication devices without wiring, such as Bluetooth, Near Field Communication (NFC), Infrared, etc. This disclosure does not limit the scope of the network.

[0022] Users can establish a communication connection with server 104 via network 105 through terminal devices 101, 102, and 103 to receive or send information, etc. It should be noted that the specific types, quantities, and combinations of terminal devices 101, 102, and 103, server 104, and network 105 can be adjusted according to the actual needs of the application scenario, and this disclosure embodiment does not impose any limitations on this.

[0023] Figure 2 This is a flowchart illustrating a pedestrian image matching method provided in an embodiment of this disclosure. Figure 2 The pedestrian image matching method can be derived from... Figure 1 The computer or server, or the software on the computer or server, executes the command. For example... Figure 2 As shown, the matching method for this pedestrian image includes:

[0024] S201, acquire a pedestrian image dataset collected by multiple image acquisition devices in the target area within a preset time period, wherein the pedestrian image dataset includes: multiple images of multiple pedestrians;

[0025] Image acquisition devices can be common devices such as cameras used to acquire images. The images in the pedestrian image dataset are pedestrian images collected within a preset time period before the acquisition time of the pedestrian image dataset, with the acquisition time being the end time.

[0026] S202, obtain the distance between each image acquisition device and other image acquisition devices in the target area, and determine the device vector corresponding to each image acquisition device based on the distance between each image acquisition device and other image acquisition devices;

[0027] Based on the distance between two image acquisition devices, determine the coefficients corresponding to those two devices (the greater the distance, the smaller the coefficient). Once the coefficients for each pair of image acquisition devices are determined, the relationship matrix G for all image acquisition devices is formed (the coefficient for each image acquisition device is an element in the relationship matrix). Obtain the initialization matrix E and the relationship adjustment matrix σ (E and σ are pre-set). Then, according to E′=(G*E)*σ...

[0028] The device vector corresponding to each image acquisition device is a row vector in matrix E′.

[0029] S203, determine whether any two images in the pedestrian image dataset belong to the same pedestrian, and record the two images to be judged as the first target image and the second target image respectively, record the device vector corresponding to the image acquisition device that acquires the first target image as the first target device vector, and record the device vector corresponding to the image acquisition device that acquires the second target image as the second target device vector;

[0030] By determining whether any two images in the pedestrian image dataset belong to the same pedestrian, we can identify all the images corresponding to each pedestrian in the dataset. For ease of explanation later, one of the two images will be designated as the first target image, and the other as the second target image.

[0031] S204, using a feature extraction and processing model to process the first target image and the second target image respectively, to obtain the first pedestrian feature vector and the first background feature vector corresponding to the first target image, and the second pedestrian feature vector and the second background feature vector corresponding to the second target image;

[0032] Feature extraction and processing models can extract and process features from images.

[0033] S205, based on the first target device vector, the first pedestrian feature vector, and the first background feature vector, determine the first image feature matrix corresponding to the first target image;

[0034] The first image feature matrix may be composed of the first target device vector, the first pedestrian feature vector, and the first background feature vector.

[0035] S206, Based on the second target device vector, the second pedestrian feature vector, and the second background feature vector, determine the second image feature matrix corresponding to the second target image;

[0036] The second image feature matrix can be composed of the second target device vector, the second pedestrian feature vector, and the second background feature vector.

[0037] S207, based on the first image feature matrix and the second image feature matrix, calculate the target similarity between the first target image and the second target image, and determine whether the first target image and the second target image belong to the same pedestrian based on the target similarity.

[0038] If the similarity between the targets is greater than a preset threshold, then the first target image and the second target image belong to the same pedestrian.

[0039] According to the technical solution provided in this disclosure, a pedestrian image dataset is obtained by acquiring a pedestrian image dataset acquired by multiple image acquisition devices in a target area within a preset time period. The pedestrian image dataset includes multiple images of multiple pedestrians. The distance between each image acquisition device and other image acquisition devices in the target area is obtained, and a device vector corresponding to each image acquisition device is determined based on the distance between each image acquisition device and other image acquisition devices. It is determined whether any two images in the pedestrian image dataset belong to the same pedestrian, and the two images to be determined are respectively denoted as the first target image and the second target image. The device vector corresponding to the image acquisition device that acquired the first target image is denoted as the first target device vector, and the device vector corresponding to the image acquisition device that acquired the second target image is denoted as the second target device vector. The first target image and the second target image are processed using a feature extraction and processing model to obtain the first target image. The method involves: determining a first pedestrian feature vector and a first background feature vector corresponding to a target image; determining a second pedestrian feature vector and a second background feature vector corresponding to a second target image; calculating the target similarity between the first target image and the second target image based on the first target device vector, the first pedestrian feature vector, and the first background feature vector; determining a second image feature matrix corresponding to the second target image based on the second target device vector, the second pedestrian feature vector, and the second background feature vector; calculating the target similarity between the first target image and the second target image based on the target similarity; and determining whether the first target image and the second target image belong to the same pedestrian based on the target similarity. Therefore, by adopting the above technical means, the problem of low pedestrian image matching accuracy in scenarios with multiple image acquisition devices can be solved in the prior art, thereby improving the accuracy of pedestrian image matching in scenarios with multiple image acquisition devices.

[0040] In step S204, the first target image and the second target image are processed using a feature extraction and processing model to obtain the first pedestrian feature vector and the first background feature vector corresponding to the first target image, and the second pedestrian feature vector and the second background feature vector corresponding to the second target image. This includes: (that is, the feature extraction and processing model includes the following structure or calculation) using the network from the input layer to the second stage network layer of the residual neural network to extract the first target feature map of the first target image and the second target feature map of the second target image. The feature extraction and processing model includes: the network from the input layer to the second stage network layer of the residual neural network; performing convolution calculation, normalization calculation, convolution calculation, normalization calculation, activation calculation, convolution calculation, and normalization calculation on the first target feature map and the second target feature map respectively to obtain the third target feature map and the fourth target feature map; processing the third target feature map and the fourth target feature map using a normalization exponential function to determine the first pedestrian feature vector and the first background feature vector, as well as the second pedestrian feature vector and the second background feature vector, based on the processing results.

[0041] The classic residual neural network structure consists of: an input layer, a Stemblock layer, a first-stage network layer, a second-stage network layer, a third-stage network layer, a fourth-stage network layer, a global average pooling layer, and a fully connected layer. The Stemblock layer is used for downsampling. The first, second, third, and fourth-stage network layers process the features obtained from the Stemblock downsampling. The global average pooling layer reduces the dimensionality, and the fully connected layer is used for classification. In this disclosure, the feature extraction and processing model includes the network structure of a residual neural network from the input layer to the second-stage network layer. This can be understood as inputting a first target image and a second target image into the residual neural network, and outputting the first target feature map and the second target feature map from the second-stage network layer.

[0042] For example, the first and second target feature maps are subjected to two 3x3 convolutional operations (Conv) and batch normalization (BN) calculations, respectively; then, activation calculation, 3x3 convolution, and batch normalization are performed again to obtain the third and fourth target feature maps. The normalization exponential function is the softmax function.

[0043] The third and fourth target feature maps are processed using a normalized exponential function to determine the first pedestrian feature vector, the first background feature vector, the second pedestrian feature vector, and the second background feature vector, respectively. This includes: processing the third target feature map using the normalized exponential function and determining the corresponding vectors for the first pedestrian's head, chest, abdomen, left hand, right hand, left thigh, right thigh, left foot, right foot, and first background vector; and processing the fourth target feature map using the normalized exponential function and determining the corresponding vectors for the second pedestrian's head, chest, abdomen, left hand, right hand, left thigh, and right thigh vector, respectively. The first pedestrian is identified by considering the vectors of the second pedestrian's left foot, right foot, and the second background. Based on the vectors of the first pedestrian's head, chest, abdomen, left hand, right hand, left thigh, right thigh, left foot, and right foot, a first vector group is determined, and a first pedestrian feature vector is determined based on this first vector group. A second vector group is determined based on the first background vector, and a first background feature vector is determined based on this second vector group. A third vector group is determined based on the vectors of the second pedestrian's head, chest, abdomen, left hand, right hand, left thigh, right thigh, left foot, and right foot, and a second pedestrian feature vector is determined based on this third vector group. Finally, a fourth vector group is determined based on the second background vector, and a second background feature vector is determined based on this fourth vector group.

[0044] Each image can be divided into pedestrian head, pedestrian chest, pedestrian abdomen, pedestrian left hand, pedestrian right hand, pedestrian left thigh, pedestrian right thigh, pedestrian left foot, pedestrian right foot, and background. By processing the image using the normalized exponential function, a total of 10 parts can be determined (the normalized exponential function can give the probability that each part is pedestrian head, pedestrian chest, pedestrian abdomen, pedestrian left hand, pedestrian right hand, pedestrian left thigh, pedestrian right thigh, pedestrian left foot, pedestrian right foot, and background, and the category with the highest probability is the true category of that part). The vectors of the first pedestrian's head, chest, abdomen, left hand, right hand, left thigh, right thigh, left foot, right foot, and first background can be extracted from the 10 parts of the third target feature map.

[0045] Based on the vectors of the first pedestrian's head, chest, abdomen, left hand, right hand, left thigh, right thigh, left foot, and right foot, form vector group A, then form the first vector group F. a :

[0046] F a =W1A

[0047] Based on the first background vector (there can be multiple first background vectors), form the second vector group B and the second vector group F. b :

[0048] F b =W2B

[0049] W1 and W2 are pre-set matrices.

[0050] Determining the third vector group is similar to determining the first vector group; determining the second pedestrian feature vector is similar to determining the first pedestrian feature vector; determining the fourth vector group is similar to determining the second vector group; determining the second background feature vector is similar to determining the first background feature vector.

[0051] The first pedestrian feature vector is determined based on the first vector group, including: performing a first-stage calculation, a second-stage calculation, and a third-stage calculation on the first vector group in sequence to obtain the first pedestrian feature vector; wherein, each stage calculation includes no less than two rounds of attention calculation, which includes: self-attention calculation within the same category and self-attention calculation across categories.

[0052] The first two-round attention calculation in the first stage of computation is performed on the first vector group, including: dividing the first vector group according to the vector category to obtain nine first vector groups, and performing self-attention calculation within the same category on each first vector group to obtain a fifth vector group, wherein each first vector group has only one type of vector; dividing the fifth vector group according to the vector category to obtain N / 9 second vector groups, and performing cross-category self-attention calculation on each second vector group to obtain a sixth vector group, wherein each second vector group has exactly one vector of each category, and N is the number of vectors in the fifth vector group; performing layer normalization and matrix operations on the sixth vector group to obtain a seventh vector group; wherein the seventh vector group is the result of the first two-round attention calculation in the first stage of computation and is the input for the second two-round attention calculation in the first stage of computation.

[0053] Presuppose three matrices α, β, γ, and another...

[0054] y=softmax(α(x)*β(x))*γ(x)=SAT(x)

[0055] SAT(x) is self-attention computation, where x is the object or input to be computed, and y is the result or output.

[0056] The first vector group F a Based on the vector categories, nine first vector groups are obtained. Self-attention calculations are then performed on each first vector group to obtain the fifth vector group. The fifth vector group Based on the vector categories, N / 9 second vector groups are obtained. Cross-category self-attention is then performed on each second vector group to obtain the sixth vector group. Each first vector group contains vectors of only one category; that is, vectors of each category constitute a first vector group. In the second vector group, each category has exactly one vector; that is, the second vector group is formed by selecting one vector from each category. There are nine categories of vectors: first pedestrian head vector, first pedestrian chest vector, first pedestrian abdomen vector, first pedestrian left hand vector, first pedestrian right hand vector, first pedestrian left thigh vector, first pedestrian right thigh vector, first pedestrian left foot vector, and first pedestrian right foot vector. Layer normalization (LN) and matrix operations are performed on the sixth vector group to obtain the seventh vector group. Matrix operations involve multiplying the sixth vector group, after layer normalization, by a predefined matrix. The above calculation is defined as a two-round attention computation, denoted by GSAT(), and can be simplified as follows:

[0057] The above describes the first two-round attention calculation in the first stage of the computation for the first vector group. The first stage of computation may also include...

[0058]

[0059] Give Multiply by a preset matrix on the left, and multiply by a preset matrix on the right (this step is used for adjustment). (the dimension) to obtain

[0060] The second stage of computation is performed on the first vector group, including:

[0061]

[0062]

[0063] Give Multiply by a preset matrix on the left, and multiply by a preset matrix on the right (this step is used for adjustment). (the dimension) to obtain

[0064] The third stage of computation is performed on the first vector group, including:

[0065]

[0066]

[0067] Give Multiply by a preset matrix on the left, and multiply by a preset matrix on the right (this step is used for adjustment). (dimension of the first pedestrian) to obtain the first pedestrian feature vector.

[0068] Determining the feature vector of the second pedestrian is similar to determining the feature vector of the first pedestrian, and will not be repeated here.

[0069] The method for determining the first background feature vector based on the second vector group includes: performing a fourth-stage calculation on the second vector group to obtain the first background feature vector; wherein the fourth-stage calculation includes no less than two rounds of attention calculation, and the rounds of attention calculation includes: self-attention calculation of the same category.

[0070] The fourth stage of computation is not progressive from the first, second, and third stages of computation; it is only used here to distinguish that the fourth stage of computation is performed on the second vector group.

[0071] The fourth stage of computation is performed on the second vector group, which includes:

[0072]

[0073]

[0074] Give Multiply by a preset matrix on the left, and multiply by a preset matrix on the right (this step is used for adjustment). (dimension of the first background feature vector)

[0075] The GSAT() function used in calculating the first background feature vector is actually different from the GSAT() function used in calculating the first pedestrian feature vector. The difference is that the GSAT() function used in calculating the first background feature vector only performs two rounds of self-attention calculations for the same category.

[0076] Determining the second background feature vector is similar to determining the first background feature vector, and will not be repeated here.

[0077] The target similarity between the first target image and the second target image is calculated based on the first image feature matrix and the second image feature matrix, including: determining the third image feature matrix and the fourth image feature matrix based on the first image feature matrix and the second image feature matrix; normalizing the first pedestrian feature vector in the third image feature matrix and the second pedestrian feature vector in the fourth image feature matrix respectively; performing a dot product calculation on the normalized third image feature matrix and the fourth image feature matrix, and using the result of the dot product calculation as the target similarity.

[0078] The first image feature matrix is ​​denoted as d1, and the second image feature matrix is ​​denoted as d2.

[0079] Third image feature matrix d3:

[0080] d3=softmax(a(d1)*b(d2))*c(d2)

[0081] Fourth image feature matrix d4:

[0082] d4=softmax(a(d2)*b(d1))*c(d1)

[0083] The first 9 vectors in d3 and d4 (the pedestrian feature vectors are the first 9) are normalized (pooled), and then the target similarity s is calculated by dot product:

[0084] s = pool(d3) · pool(d4)

[0085] All of the above-mentioned optional technical solutions can be combined in any way to form the optional embodiments of this application, and will not be described in detail here.

[0086] The following are embodiments of the apparatus disclosed herein, which can be used to execute embodiments of the method disclosed herein. For details not disclosed in the apparatus embodiments of this disclosure, please refer to the embodiments of the method disclosed herein.

[0087] Figure 3 This is a schematic diagram of a pedestrian image matching device provided in an embodiment of this disclosure. Figure 3 As shown, the pedestrian image matching device includes:

[0088] The first acquisition module 301 is configured to acquire a pedestrian image dataset collected by multiple image acquisition devices in a target area within a preset time period, wherein the pedestrian image dataset includes: multiple images of multiple pedestrians;

[0089] Image acquisition devices can be common devices such as cameras used to acquire images. The images in the pedestrian image dataset are pedestrian images collected within a preset time period before the acquisition time of the pedestrian image dataset, with the acquisition time being the end time.

[0090] The second acquisition module 302 is configured to acquire the distance between each image acquisition device and other image acquisition devices in the target area, and determine the device vector corresponding to each image acquisition device based on the distance between each image acquisition device and other image acquisition devices;

[0091] Based on the distance between two image acquisition devices, determine the coefficients corresponding to those two devices (the greater the distance, the smaller the coefficient). Once the coefficients for each pair of image acquisition devices are determined, the relationship matrix G for all image acquisition devices is formed (the coefficient for each image acquisition device is an element in the relationship matrix). Obtain the initialization matrix E and the relationship adjustment matrix σ (E and σ are pre-set). Then, according to E′=(G*E)*σ...

[0092] The device vector corresponding to each image acquisition device is a row vector in matrix E′.

[0093] The first judgment module 303 is configured to judge whether any two images in the pedestrian image dataset belong to the same pedestrian, and to record the two images to be judged as the first target image and the second target image respectively, to record the device vector corresponding to the image acquisition device that acquires the first target image as the first target device vector, and to record the device vector corresponding to the image acquisition device that acquires the second target image as the second target device vector.

[0094] By determining whether any two images in the pedestrian image dataset belong to the same pedestrian, we can identify all the images corresponding to each pedestrian in the dataset. For ease of explanation later, one of the two images will be designated as the first target image, and the other as the second target image.

[0095] Model module 304 is configured to process the first target image and the second target image respectively using a feature extraction and processing model to obtain the first pedestrian feature vector and the first background feature vector corresponding to the first target image, and the second pedestrian feature vector and the second background feature vector corresponding to the second target image.

[0096] Feature extraction and processing models can extract and process features from images.

[0097] The first determining module 305 is configured to determine the first image feature matrix corresponding to the first target image based on the first target device vector, the first pedestrian feature vector, and the first background feature vector.

[0098] The first image feature matrix may be composed of the first target device vector, the first pedestrian feature vector, and the first background feature vector.

[0099] The second determining module 306 is configured to determine the second image feature matrix corresponding to the second target image based on the second target device vector, the second pedestrian feature vector, and the second background feature vector.

[0100] The second image feature matrix can be composed of the second target device vector, the second pedestrian feature vector, and the second background feature vector.

[0101] The second judgment module 307 is configured to calculate the target similarity between the first target image and the second target image based on the first image feature matrix and the second image feature matrix, and to determine whether the first target image and the second target image belong to the same pedestrian based on the target similarity.

[0102] If the similarity between the targets is greater than a preset threshold, then the first target image and the second target image belong to the same pedestrian.

[0103] According to the technical solution provided in this disclosure, a pedestrian image dataset is obtained by acquiring a pedestrian image dataset acquired by multiple image acquisition devices in a target area within a preset time period. The pedestrian image dataset includes multiple images of multiple pedestrians. The distance between each image acquisition device and other image acquisition devices in the target area is obtained, and a device vector corresponding to each image acquisition device is determined based on the distance between each image acquisition device and other image acquisition devices. It is determined whether any two images in the pedestrian image dataset belong to the same pedestrian, and the two images to be determined are respectively denoted as the first target image and the second target image. The device vector corresponding to the image acquisition device that acquired the first target image is denoted as the first target device vector, and the device vector corresponding to the image acquisition device that acquired the second target image is denoted as the second target device vector. The first target image and the second target image are processed using a feature extraction and processing model to obtain the first target image. The method involves: determining a first pedestrian feature vector and a first background feature vector corresponding to a target image; determining a second pedestrian feature vector and a second background feature vector corresponding to a second target image; calculating the target similarity between the first target image and the second target image based on the first target device vector, the first pedestrian feature vector, and the first background feature vector; determining a second image feature matrix corresponding to the second target image based on the second target device vector, the second pedestrian feature vector, and the second background feature vector; calculating the target similarity between the first target image and the second target image based on the target similarity; and determining whether the first target image and the second target image belong to the same pedestrian based on the target similarity. Therefore, by adopting the above technical means, the problem of low pedestrian image matching accuracy in scenarios with multiple image acquisition devices can be solved in the prior art, thereby improving the accuracy of pedestrian image matching in scenarios with multiple image acquisition devices.

[0104] Optionally, model module 304 is further configured to extract a first target feature map of the first target image and a second target feature map of the second target image using the network from the input layer to the second-stage network layer of the residual neural network. The feature extraction and processing model includes: a network from the input layer to the second-stage network layer of the residual neural network; performing convolution calculation, normalization calculation, convolution calculation, normalization calculation, activation calculation, convolution calculation, and normalization calculation sequentially on the first target feature map and the second target feature map to obtain a third target feature map and a fourth target feature map; and processing the third target feature map and the fourth target feature map using a normalization exponential function to determine a first pedestrian feature vector and a first background feature vector, as well as a second pedestrian feature vector and a second background feature vector, based on the processing results.

[0105] The classic residual neural network structure consists of: an input layer, a Stemblock layer, a first-stage network layer, a second-stage network layer, a third-stage network layer, a fourth-stage network layer, a global average pooling layer, and a fully connected layer. The Stemblock layer is used for downsampling. The first, second, third, and fourth-stage network layers process the features obtained from the Stemblock downsampling. The global average pooling layer reduces the dimensionality, and the fully connected layer is used for classification. In this disclosure, the feature extraction and processing model includes the network structure of a residual neural network from the input layer to the second-stage network layer. This can be understood as inputting a first target image and a second target image into the residual neural network, and outputting the first target feature map and the second target feature map from the second-stage network layer.

[0106] For example, the first and second target feature maps are subjected to two 3x3 convolutional operations (Conv) and batch normalization (BN) calculations, respectively; then, activation calculation, 3x3 convolution, and batch normalization are performed again to obtain the third and fourth target feature maps. The normalization exponential function is the softmax function.

[0107] Optionally, model module 304 is further configured to process the third target feature map using a normalized exponential function, and determine the first pedestrian head vector, first pedestrian chest vector, first pedestrian abdomen vector, first pedestrian left hand vector, first pedestrian right hand vector, first pedestrian left thigh vector, first pedestrian right thigh vector, first pedestrian left foot vector, first pedestrian right foot vector, and first background vector corresponding to the first target image based on the processing result; process the fourth target feature map using a normalized exponential function, and determine the second pedestrian head vector, second pedestrian chest vector, second pedestrian abdomen vector, second pedestrian left hand vector, second pedestrian right hand vector, second pedestrian left thigh vector, second pedestrian right thigh vector, second pedestrian left foot vector, second pedestrian right foot vector, and second background vector corresponding to the fourth target image based on the processing result; and based on the first pedestrian head... The first pedestrian's feature vector is determined based on the vectors of the first pedestrian's chest, abdomen, left hand, right hand, left thigh, right thigh, left foot, and right foot. A second vector group is determined based on the first background vector, and a first background feature vector is determined based on the second vector group. A third vector group is determined based on the vectors of the second pedestrian's head, chest, abdomen, left hand, right hand, left thigh, right thigh, left foot, and right foot, and a second pedestrian feature vector is determined based on the third vector group. Finally, a fourth vector group is determined based on the second background vector, and a second background feature vector is determined based on the fourth vector group.

[0108] Each image can be divided into pedestrian head, pedestrian chest, pedestrian abdomen, pedestrian left hand, pedestrian right hand, pedestrian left thigh, pedestrian right thigh, pedestrian left foot, pedestrian right foot, and background. By processing the image using the normalized exponential function, a total of 10 parts can be determined (the normalized exponential function can give the probability that each part is pedestrian head, pedestrian chest, pedestrian abdomen, pedestrian left hand, pedestrian right hand, pedestrian left thigh, pedestrian right thigh, pedestrian left foot, pedestrian right foot, and background, and the category with the highest probability is the true category of that part). The vectors of the first pedestrian's head, chest, abdomen, left hand, right hand, left thigh, right thigh, left foot, right foot, and first background can be extracted from the 10 parts of the third target feature map.

[0109] Based on the vectors of the first pedestrian's head, chest, abdomen, left hand, right hand, left thigh, right thigh, left foot, and right foot, form vector group A, then form the first vector group F. a :

[0110] F a =W1A

[0111] Based on the first background vector (there can be multiple first background vectors), form the second vector group B and the second vector group F. b :

[0112] F b =W2B

[0113] W1 and W2 are pre-set matrices.

[0114] Determining the third vector group is similar to determining the first vector group; determining the second pedestrian feature vector is similar to determining the first pedestrian feature vector; determining the fourth vector group is similar to determining the second vector group; determining the second background feature vector is similar to determining the first background feature vector.

[0115] Optionally, the model module 304 is further configured to perform a first-stage calculation, a second-stage calculation, and a third-stage calculation on the first vector group in sequence to obtain a first pedestrian feature vector; wherein each stage calculation includes no less than two rounds of attention calculation, the rounds of attention calculation including: self-attention calculation within the same category and self-attention calculation across categories.

[0116] Optionally, model module 304 is further configured to divide the first vector group according to the vector category to obtain nine first vector groups, and perform self-attention calculation for each first vector group of the same category to obtain a fifth vector group, wherein each first vector group has only one type of vector; divide the fifth vector group according to the vector category to obtain N / 9 second vector groups, and perform cross-category self-attention calculation for each second vector group to obtain a sixth vector group, wherein each second vector group has exactly one vector of each category, and N is the number of vectors in the fifth vector group; perform layer normalization and matrix operations on the sixth vector group to obtain a seventh vector group; wherein the seventh vector group is the result of the first two-round attention calculation in the first stage calculation and is the input of the second two-round attention calculation in the first stage calculation.

[0117] Presuppose three matrices α, β, γ, and another...

[0118] y=softmax(α(x)*β(x))*γ(x)=SAT(x)

[0119] SAT(x) is self-attention computation, where x is the object or input to be computed, and y is the result or output.

[0120] The first vector group F a Based on the vector categories, nine first vector groups are obtained. Self-attention calculations are then performed on each first vector group to obtain the fifth vector group. The fifth vector group Based on the vector categories, N / 9 second vector groups are obtained. Cross-category self-attention is then performed on each second vector group to obtain the sixth vector group. Each first vector group contains vectors of only one category; that is, vectors of each category constitute a first vector group. In the second vector group, each category has exactly one vector; that is, the second vector group is formed by selecting one vector from each category. There are nine categories of vectors: first pedestrian head vector, first pedestrian chest vector, first pedestrian abdomen vector, first pedestrian left hand vector, first pedestrian right hand vector, first pedestrian left thigh vector, first pedestrian right thigh vector, first pedestrian left foot vector, and first pedestrian right foot vector. Layer normalization (LN) and matrix operations are performed on the sixth vector group to obtain the seventh vector group. Matrix operations involve multiplying the sixth vector group, after layer normalization, by a predefined matrix. The above calculation is defined as a two-round attention computation, denoted by GSAT(), and can be simplified as follows:

[0121] The above describes the first two-round attention calculation in the first stage of the computation for the first vector group. The first stage of computation may also include...

[0122]

[0123] Give Multiply by a preset matrix on the left, and multiply by a preset matrix on the right (this step is used for adjustment). (the dimension) to obtain

[0124] The second stage of computation is performed on the first vector group, including:

[0125]

[0126]

[0127] Give Multiply by a preset matrix on the left, and multiply by a preset matrix on the right (this step is used for adjustment). (the dimension) to obtain

[0128] The third stage of computation is performed on the first vector group, including:

[0129]

[0130]

[0131] Give Multiply by a preset matrix on the left, and multiply by a preset matrix on the right (this step is used for adjustment). (dimension of the first pedestrian) to obtain the first pedestrian feature vector.

[0132] Determining the feature vector of the second pedestrian is similar to determining the feature vector of the first pedestrian, and will not be repeated here.

[0133] Optionally, model module 304 is further configured to perform a fourth-stage calculation on the second vector group to obtain a first background feature vector; wherein the fourth-stage calculation includes no less than two rounds of attention calculation, the rounds of attention calculation including: self-attention calculation of the same category.

[0134] The fourth stage of computation is not progressive from the first, second, and third stages of computation; it is only used here to distinguish that the fourth stage of computation is performed on the second vector group.

[0135] The fourth stage of computation is performed on the second vector group, which includes:

[0136]

[0137]

[0138] Give Multiply by a preset matrix on the left, and multiply by a preset matrix on the right (this step is used for adjustment). (dimension of the first background feature vector)

[0139] The GSAT() function used in calculating the first background feature vector is actually different from the GSAT() function used in calculating the first pedestrian feature vector. The difference is that the GSAT() function used in calculating the first background feature vector only performs two rounds of self-attention calculations for the same category.

[0140] Determining the second background feature vector is similar to determining the first background feature vector, and will not be repeated here.

[0141] Optionally, the second judgment module 307 is further configured to determine the third image feature matrix and the fourth image feature matrix based on the first image feature matrix and the second image feature matrix; normalize the first pedestrian feature vector in the third image feature matrix and the second pedestrian feature vector in the fourth image feature matrix respectively; perform a dot product calculation on the normalized third image feature matrix and the fourth image feature matrix, and use the result of the dot product calculation as the target similarity.

[0142] The first image feature matrix is ​​denoted as d1, and the second image feature matrix is ​​denoted as d2.

[0143] Third image feature matrix d3:

[0144] d3=softmax(a(d1)*b(d2))*c(d2)

[0145] Fourth image feature matrix d4:

[0146] d4=softmax(a(d2)*b(d1))*c(d1)

[0147] The first 9 vectors in d3 and d4 (the pedestrian feature vectors are the first 9) are normalized (pooled), and then the target similarity s is calculated by dot product:

[0148] s = pool(d3) · pool(d4)

[0149] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this disclosure.

[0150] Figure 4 This is a schematic diagram of the electronic device 4 provided in an embodiment of this disclosure. Figure 4 As shown, the electronic device 4 of this embodiment includes: a processor 401, a memory 402, and a computer program 403 stored in the memory 402 and executable on the processor 401. When the processor 401 executes the computer program 403, it implements the steps in the various method embodiments described above. Alternatively, when the processor 401 executes the computer program 403, it implements the functions of each module / unit in the various device embodiments described above.

[0151] Electronic device 4 can be a desktop computer, laptop, handheld computer, cloud server, or other electronic device. Electronic device 4 may include, but is not limited to, processor 401 and memory 402. Those skilled in the art will understand that... Figure 4 This is merely an example of electronic device 4 and does not constitute a limitation on electronic device 4. It may include more or fewer components than shown, or different components.

[0152] The processor 401 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.

[0153] The memory 402 can be an internal storage unit of the electronic device 4, such as a hard disk or RAM of the electronic device 4. The memory 402 can also be an external storage device of the electronic device 4, such as a plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card, Flash Card, etc., equipped on the electronic device 4. The memory 402 can also include both internal and external storage units of the electronic device 4. The memory 402 is used to store computer programs and other programs and data required by the electronic device.

[0154] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0155] If an integrated module / unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program may include computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. A computer-readable medium may include: any entity or device capable of carrying computer program code, recording media, USB flash drives, portable hard drives, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in a computer-readable medium may be appropriately added to or subtracted according to the requirements of legislation and patent practice in a jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media may not include electrical carrier signals and telecommunication signals.

[0156] The above embodiments are only used to illustrate the technical solutions of this disclosure, and are not intended to limit it. Although this disclosure has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this disclosure, and should all be included within the protection scope of this disclosure.

Claims

1. A method of matching pedestrian images, characterized by, include: Acquire a pedestrian image dataset collected by multiple image acquisition devices in a target area within a preset time period, wherein the pedestrian image dataset includes: multiple images of multiple pedestrians; Obtain the distance between each image acquisition device and other image acquisition devices in the target area, and determine the device vector corresponding to each image acquisition device based on the distance between each image acquisition device and other image acquisition devices; Determine whether any two images in the pedestrian image dataset belong to the same pedestrian, and record the two images to be determined as the first target image and the second target image, respectively. Record the device vector corresponding to the image acquisition device that acquires the first target image as the first target device vector, and record the device vector corresponding to the image acquisition device that acquires the second target image as the second target device vector. The first target image and the second target image are processed by a feature extraction and processing model to obtain the first pedestrian feature vector and the first background feature vector corresponding to the first target image, and the second pedestrian feature vector and the second background feature vector corresponding to the second target image. Based on the first target device vector, the first pedestrian feature vector, and the first background feature vector, a first image feature matrix corresponding to the first target image is determined. Based on the second target device vector, the second pedestrian feature vector, and the second background feature vector, a second image feature matrix corresponding to the second target image is determined. Based on the first image feature matrix and the second image feature matrix, the target similarity between the first target image and the second target image is calculated, and based on the target similarity, it is determined whether the first target image and the second target image belong to the same pedestrian; The step of calculating the target similarity between the first target image and the second target image based on the first image feature matrix and the second image feature matrix includes: determining a third image feature matrix and a fourth image feature matrix based on the first image feature matrix and the second image feature matrix; normalizing the first pedestrian feature vector in the third image feature matrix and the second pedestrian feature vector in the fourth image feature matrix, respectively; performing a dot product calculation on the normalized third image feature matrix and the fourth image feature matrix, and using the result of the dot product calculation as the target similarity.

2. The method of claim 1, wherein, The step of processing the first target image and the second target image using a feature extraction and processing model to obtain a first pedestrian feature vector and a first background feature vector corresponding to the first target image, and a second pedestrian feature vector and a second background feature vector corresponding to the second target image, includes: The first target feature map of the first target image and the second target feature map of the second target image are extracted using the network from the input layer to the second stage network layer of the residual neural network. The feature extraction and processing model includes the network from the input layer to the second stage network layer of the residual neural network. The first target feature map and the second target feature map are subjected to convolution calculation, normalization calculation, convolution calculation, normalization calculation, activation calculation, convolution calculation and normalization calculation respectively to obtain the third target feature map and the fourth target feature map. The third target feature map and the fourth target feature map are processed using a normalized exponential function to determine the first pedestrian feature vector and the first background feature vector, as well as the second pedestrian feature vector and the second background feature vector, based on the processing results.

3. The method of claim 2, wherein, The step of processing the third target feature map and the fourth target feature map using a normalized exponential function to determine the first pedestrian feature vector and the first background feature vector, as well as the second pedestrian feature vector and the second background feature vector, based on the processing results includes: The normalized exponential function is used to process the third target feature map, and the first pedestrian head vector, first pedestrian chest vector, first pedestrian abdomen vector, first pedestrian left hand vector, first pedestrian right hand vector, first pedestrian left thigh vector, first pedestrian right thigh vector, first pedestrian left foot vector, first pedestrian right foot vector, and first background vector corresponding to the first target image are determined based on the processing result. The normalized exponential function is used to process the fourth target feature map, and the second pedestrian head vector, second pedestrian chest vector, second pedestrian abdomen vector, second pedestrian left hand vector, second pedestrian right hand vector, second pedestrian left thigh vector, second pedestrian right thigh vector, second pedestrian left foot vector, second pedestrian right foot vector, and second background vector corresponding to the second target image are determined based on the processing result. Based on the first pedestrian's head vector, first pedestrian's chest vector, first pedestrian's abdomen vector, first pedestrian's left hand vector, first pedestrian's right hand vector, first pedestrian's left thigh vector, first pedestrian's right thigh vector, first pedestrian's left foot vector, and first pedestrian's right foot vector, a first vector group is determined, and the first pedestrian's feature vector is determined based on the first vector group. A second vector group is determined based on the first background vector, and the first background feature vector is determined based on the second vector group; Based on the second pedestrian's head vector, second pedestrian's chest vector, second pedestrian's abdomen vector, second pedestrian's left hand vector, second pedestrian's right hand vector, second pedestrian's left thigh vector, second pedestrian's right thigh vector, second pedestrian's left foot vector, and second pedestrian's right foot vector, a third vector group is determined, and the second pedestrian's feature vector is determined based on the third vector group. A fourth vector group is determined based on the second background vector, and the second background feature vector is determined based on the fourth vector group.

4. The method of claim 3, wherein, Determining the first pedestrian feature vector based on the first vector group includes: The first vector group is sequentially subjected to the first stage calculation, the second stage calculation, and the third stage calculation to obtain the first pedestrian feature vector. Each stage of the calculation includes at least two rounds of attention calculation, which includes: self-attention calculation within the same category and self-attention calculation across categories.

5. The method of claim 3, wherein, Determining the first background feature vector based on the second vector group includes: The second vector group is subjected to a fourth stage of calculation to obtain the first background feature vector; The fourth stage of calculation includes at least two rounds of attention calculation, which includes: self-attention calculation of the same category.

6. The method of claim 4, wherein, include: The first two-round attention calculation in the first stage of the computation is performed on the first vector group, including: The first vector group is divided into nine first vector groups according to the category of the vectors. Self-attention calculation of the same category is performed on each first vector group to obtain the fifth vector group. Each first vector group has only one category of vectors. The fifth vector group is divided according to the category of the vectors to obtain N / 9 second vector groups, and cross-category self-attention calculation is performed on each second vector group to obtain a sixth vector group. In each second vector group, there is exactly one vector of each category, and N is the number of vectors in the fifth vector group. The sixth vector group is subjected to layer normalization and matrix operations to obtain the seventh vector group; The seventh vector group is the result of the first dual-round attention calculation in the first stage calculation and the input of the second dual-round attention calculation in the first stage calculation.

7. A pedestrian image matching apparatus characterized by comprising: include: The first acquisition module is configured to acquire a pedestrian image dataset collected by multiple image acquisition devices in a target area within a preset time period, wherein the pedestrian image dataset includes: multiple images of multiple pedestrians; The second acquisition module is configured to acquire the distance between each image acquisition device and other image acquisition devices in the target area, and determine the device vector corresponding to each image acquisition device based on the distance between each image acquisition device and other image acquisition devices; The first judgment module is configured to judge whether any two images in the pedestrian image dataset belong to the same pedestrian, and to record the two images to be judged as the first target image and the second target image respectively, to record the device vector corresponding to the image acquisition device that acquires the first target image as the first target device vector, and to record the device vector corresponding to the image acquisition device that acquires the second target image as the second target device vector; The model module is configured to use a feature extraction and processing model to process the first target image and the second target image respectively, to obtain a first pedestrian feature vector and a first background feature vector corresponding to the first target image, and a second pedestrian feature vector and a second background feature vector corresponding to the second target image. The first determining module is configured to determine the first image feature matrix corresponding to the first target image based on the first target device vector, the first pedestrian feature vector, and the first background feature vector. The second determining module is configured to determine the second image feature matrix corresponding to the second target image based on the second target device vector, the second pedestrian feature vector, and the second background feature vector. The second judgment module is configured to calculate the target similarity between the first target image and the second target image based on the first image feature matrix and the second image feature matrix, and to determine whether the first target image and the second target image belong to the same pedestrian based on the target similarity. The calculation of the target similarity between the first target image and the second target image based on the first image feature matrix and the second image feature matrix includes: determining a third image feature matrix and a fourth image feature matrix based on the first image feature matrix and the second image feature matrix; normalizing the first pedestrian feature vector in the third image feature matrix and the second pedestrian feature vector in the fourth image feature matrix, respectively; performing a dot product calculation on the normalized third image feature matrix and the fourth image feature matrix, and using the result of the dot product calculation as the target similarity.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method as described in any one of claims 1 to 6.

9. A computer-readable storage medium storing a computer program, the computer program comprising instructions that, when executed by a computer, cause the computer to perform the method of any one of claims 1 to 8. When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1 to 6.